Re: Metadata and FullText, indexed at different times - looking for best approach
Thank you, Re-index does look like a real option then. I am looking now at storing text/files in MongoDB or like and indexing into SOLR from that. Initially, I was going to skip the DB part for as long as possible. Regarding the use case, yes it does make sense to have just metadata. It is rich, curated metadata that works without files (several, each in its own language). So, before files show up, the search is against title/subject/etc. When the files show up, one by one, they get added into index for additional/enhanced results. Again, thank you for walking through this with me. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Jul 17, 2012 at 9:12 AM, Erick Erickson wrote: > In that case, I think your best option is to re-index the entire document > when you have the text available, metadata and all. Which actually > begs the question whether you want to index the bare metadata at > all. Is it the use-case that the user actually gets value when there's no > text? If not, forget DIH and just index the metadata as a result of the > text becoming available. > > Best > Erick > > On Mon, Jul 16, 2012 at 1:43 PM, Alexandre Rafalovitch > wrote: >> Thank you, >> >> I am already on 4alpha. Patch feels a little too unstable for my >> needs/familiarity with the codes. >> >> What about something around multiple cores? Could I have full-text >> fields stored in a separate cores and somehow (again, minimum >> hand-coding) do search against all those cores and get back combined >> list of document IDs? Or would it making comparative ranking/sorting >> impossible? >> >> Regards, >>Alex. >> Personal blog: http://blog.outerthoughts.com/ >> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch >> - Time is the quality of nature that keeps events from happening all >> at once. Lately, it doesn't seem to be working. (Anonymous - via GTD >> book) >> >> >> On Sun, Jul 15, 2012 at 12:08 PM, Erick Erickson >> wrote: >>> You've got a couple of choices. There's a new patch in town >>> https://issues.apache.org/jira/browse/SOLR-139 >>> that allows you to update individual fields in a doc if (and only if) >>> all the fields in the original document were stored (actually, all the >>> non-copy fields). >>> >>> So if you're storing (stored="true") all your metadata information, you can >>> just update the document when the text becomes available assuming you >>> know the uniqueKey when you update. >>> >>> Under the covers, this will find the old document, get all the fields, add >>> the >>> new fields to it, and re-index the whole thing. >>> >>> Otherwise, your fallback idea is a good one. >>> >>> Best >>> Erick >>> >>> On Sat, Jul 14, 2012 at 11:05 PM, Alexandre Rafalovitch >>> wrote: Hello, I have a database of metadata and I can inject it into SOLR with DIH just fine. But then, I also have the documents to extract full text from that I want to add to the same records as additional fields. I think DIH allows to run Tika at the ingestion time, but I may not have the full-text files at that point (they could arrive days later). I can match the file to the metadata by a file name matching a field name. What is the best approach to do that staggered indexing with minimum custom code? I guess my fallback position is a custom full-text indexer agent that re-adds the metadata fields when the file is being indexed. Is there anything better? I am a newbie using v4.0alpha of SOLR (and loving it). Thank you, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Metadata and FullText, indexed at different times - looking for best approach
In that case, I think your best option is to re-index the entire document when you have the text available, metadata and all. Which actually begs the question whether you want to index the bare metadata at all. Is it the use-case that the user actually gets value when there's no text? If not, forget DIH and just index the metadata as a result of the text becoming available. Best Erick On Mon, Jul 16, 2012 at 1:43 PM, Alexandre Rafalovitch wrote: > Thank you, > > I am already on 4alpha. Patch feels a little too unstable for my > needs/familiarity with the codes. > > What about something around multiple cores? Could I have full-text > fields stored in a separate cores and somehow (again, minimum > hand-coding) do search against all those cores and get back combined > list of document IDs? Or would it making comparative ranking/sorting > impossible? > > Regards, >Alex. > Personal blog: http://blog.outerthoughts.com/ > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch > - Time is the quality of nature that keeps events from happening all > at once. Lately, it doesn't seem to be working. (Anonymous - via GTD > book) > > > On Sun, Jul 15, 2012 at 12:08 PM, Erick Erickson > wrote: >> You've got a couple of choices. There's a new patch in town >> https://issues.apache.org/jira/browse/SOLR-139 >> that allows you to update individual fields in a doc if (and only if) >> all the fields in the original document were stored (actually, all the >> non-copy fields). >> >> So if you're storing (stored="true") all your metadata information, you can >> just update the document when the text becomes available assuming you >> know the uniqueKey when you update. >> >> Under the covers, this will find the old document, get all the fields, add >> the >> new fields to it, and re-index the whole thing. >> >> Otherwise, your fallback idea is a good one. >> >> Best >> Erick >> >> On Sat, Jul 14, 2012 at 11:05 PM, Alexandre Rafalovitch >> wrote: >>> Hello, >>> >>> I have a database of metadata and I can inject it into SOLR with DIH >>> just fine. But then, I also have the documents to extract full text >>> from that I want to add to the same records as additional fields. I >>> think DIH allows to run Tika at the ingestion time, but I may not have >>> the full-text files at that point (they could arrive days later). I >>> can match the file to the metadata by a file name matching a field >>> name. >>> >>> What is the best approach to do that staggered indexing with minimum >>> custom code? I guess my fallback position is a custom full-text >>> indexer agent that re-adds the metadata fields when the file is being >>> indexed. Is there anything better? >>> >>> I am a newbie using v4.0alpha of SOLR (and loving it). >>> >>> Thank you, >>> Alex. >>> Personal blog: http://blog.outerthoughts.com/ >>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch >>> - Time is the quality of nature that keeps events from happening all >>> at once. Lately, it doesn't seem to be working. (Anonymous - via GTD >>> book)
Re: Metadata and FullText, indexed at different times - looking for best approach
Thank you, I am already on 4alpha. Patch feels a little too unstable for my needs/familiarity with the codes. What about something around multiple cores? Could I have full-text fields stored in a separate cores and somehow (again, minimum hand-coding) do search against all those cores and get back combined list of document IDs? Or would it making comparative ranking/sorting impossible? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Sun, Jul 15, 2012 at 12:08 PM, Erick Erickson wrote: > You've got a couple of choices. There's a new patch in town > https://issues.apache.org/jira/browse/SOLR-139 > that allows you to update individual fields in a doc if (and only if) > all the fields in the original document were stored (actually, all the > non-copy fields). > > So if you're storing (stored="true") all your metadata information, you can > just update the document when the text becomes available assuming you > know the uniqueKey when you update. > > Under the covers, this will find the old document, get all the fields, add the > new fields to it, and re-index the whole thing. > > Otherwise, your fallback idea is a good one. > > Best > Erick > > On Sat, Jul 14, 2012 at 11:05 PM, Alexandre Rafalovitch > wrote: >> Hello, >> >> I have a database of metadata and I can inject it into SOLR with DIH >> just fine. But then, I also have the documents to extract full text >> from that I want to add to the same records as additional fields. I >> think DIH allows to run Tika at the ingestion time, but I may not have >> the full-text files at that point (they could arrive days later). I >> can match the file to the metadata by a file name matching a field >> name. >> >> What is the best approach to do that staggered indexing with minimum >> custom code? I guess my fallback position is a custom full-text >> indexer agent that re-adds the metadata fields when the file is being >> indexed. Is there anything better? >> >> I am a newbie using v4.0alpha of SOLR (and loving it). >> >> Thank you, >> Alex. >> Personal blog: http://blog.outerthoughts.com/ >> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch >> - Time is the quality of nature that keeps events from happening all >> at once. Lately, it doesn't seem to be working. (Anonymous - via GTD >> book)
Re: Metadata and FullText, indexed at different times - looking for best approach
You've got a couple of choices. There's a new patch in town https://issues.apache.org/jira/browse/SOLR-139 that allows you to update individual fields in a doc if (and only if) all the fields in the original document were stored (actually, all the non-copy fields). So if you're storing (stored="true") all your metadata information, you can just update the document when the text becomes available assuming you know the uniqueKey when you update. Under the covers, this will find the old document, get all the fields, add the new fields to it, and re-index the whole thing. Otherwise, your fallback idea is a good one. Best Erick On Sat, Jul 14, 2012 at 11:05 PM, Alexandre Rafalovitch wrote: > Hello, > > I have a database of metadata and I can inject it into SOLR with DIH > just fine. But then, I also have the documents to extract full text > from that I want to add to the same records as additional fields. I > think DIH allows to run Tika at the ingestion time, but I may not have > the full-text files at that point (they could arrive days later). I > can match the file to the metadata by a file name matching a field > name. > > What is the best approach to do that staggered indexing with minimum > custom code? I guess my fallback position is a custom full-text > indexer agent that re-adds the metadata fields when the file is being > indexed. Is there anything better? > > I am a newbie using v4.0alpha of SOLR (and loving it). > > Thank you, > Alex. > Personal blog: http://blog.outerthoughts.com/ > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch > - Time is the quality of nature that keeps events from happening all > at once. Lately, it doesn't seem to be working. (Anonymous - via GTD > book)