Re: index multiple files into one index entity

Alexandre Rafalovitch Mon, 27 May 2013 05:29:50 -0700

You did not open source it by any chance? :-)

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)



On Sun, May 26, 2013 at 8:23 PM, Yury Kats <yuryk...@yahoo.com> wrote:
> That's exactly what happens. Each streams goes into a separate document.
> If all streams share the same unique id parameter, the last stream
> will overwrite everything.
>
> I've asked this same question last year. Got no responses and ended up
> writing my own UpdateRequestProcessor.
>
> See http://tinyurl.com/phhqsb4
>
> On 5/26/2013 11:15 AM, Alexandre Rafalovitch wrote:
>> If I understand correctly, the issue is:
>> 1) The client provides multiple content stream and expects Tika to
>> parse all of them and stick all the extracted content into one big
>> SolrDoc
>> 2) Tika (looking at load() method of: ExtractingDocumentLoader.java
>> (Github link: http://bit.ly/12GsDl9 ) does not actually suspect that
>> it's load method may be called multiple types and therefore happily
>> submit the document at the end of that call. Probably submits a new
>> document for each content source, which probably means it just
>> overrides the same doc over and over again.
>>
>> If I am right, then we have a bug in Tika handler's expectations (of
>> single load() call). The next step would be to put together a very
>> simple use case and open a Jira case with it.
>>
>> Regards,
>>    Alex.
>> P.s. I am not a Solr code wrangler, so this MAY be completely wrong.
>>
>> Personal blog: http://blog.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all
>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>>
>>
>> On Sun, May 26, 2013 at 10:46 AM, Erick Erickson
>> <erickerick...@gmail.com> wrote:
>>> I'm still not quite getting the issue. Separate requests (i.e. any
>>> addition of a SolrInputDocument) are treated as a separate document.
>>> There's no notion of "append the contents of one doc to another based
>>> on ID", unless you're doing atomic updates.
>>>
>>> And Tika takes some care to index separate files as separate documents.
>>>
>>> Now, if you don't need these as with the same uniqueKey, you might
>>> index them as separate documents and include a field that lets you
>>> associate these documents somehow (see the group/field collapsing Wiki
>>> page).
>>>
>>> But otherwise, I think I need a higher-level view of what you're
>>> trying to accomplish to make an intelligent comment.
>>>
>>> Best
>>> Erick
>>>
>>> On Thu, May 23, 2013 at 9:05 AM,  <mark.ka...@t-systems.com> wrote:
>>>> Hello Erick,
>>>> Thank you for your fast answer.
>>>>
>>>> Maybe I don't exclaim my question clearly.
>>>> I want index many files to one index entity. I will use the same behavior 
>>>> as any other multivalued field which can indexed to one unique id.
>>>> So I think every ContentStreamUpdateRequest represent one index entity, 
>>>> isn't it? And with each addContentStream I will add one File to this 
>>>> entity.
>>>>
>>>> Thank you and with best Regards
>>>> Mark
>>>>
>>>>
>>>>
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Erick Erickson [mailto:erickerick...@gmail.com]
>>>> Gesendet: Donnerstag, 23. Mai 2013 14:11
>>>> An: solr-user@lucene.apache.org
>>>> Betreff: Re: index multiple files into one index entity
>>>>
>>>> I just skimmed your post, but I'm responding to the last bit.
>>>>
>>>> If you have <uniqueKey> defined as "id" in schema.xml then no, you cannot 
>>>> have multiple documents with the same ID.
>>>> Whenever a new doc comes in it replaces the old doc with that ID.
>>>>
>>>> You can remove the <uniqueKey> definition and do what you want, but there 
>>>> are very few Solr installations with no <uniqueKey> and it's probably a 
>>>> better idea to make your id's truly unique.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Thu, May 23, 2013 at 6:14 AM,  <mark.ka...@t-systems.com> wrote:
>>>>> Hello solr team,
>>>>>
>>>>> I want to index multiple fields into one solr index entity, with the
>>>>> same id. We are using solr 4.1
>>>>>
>>>>>
>>>>> I try it with following source fragment:
>>>>>
>>>>>     public void addContentSet(ContentSet contentSet) throws
>>>>> SearchProviderException {
>>>>>
>>>>>                                 ...
>>>>>
>>>>>             ContentStreamUpdateRequest csur = 
>>>>> generateCSURequest(contentSet.getIndexId(), contentSet);
>>>>>             String indexId = contentSet.getIndexId();
>>>>>
>>>>>             ConcurrentUpdateSolrServer server = 
>>>>> serverPool.getUpdateServer(indexId);
>>>>>             server.request(csur);
>>>>>
>>>>>                                 ...
>>>>>     }
>>>>>
>>>>>     private ContentStreamUpdateRequest generateCSURequest(String indexId, 
>>>>> ContentSet contentSet)
>>>>>             throws IOException {
>>>>>         ContentStreamUpdateRequest csur = new
>>>>> ContentStreamUpdateRequest(confStore.getExtractUrl());
>>>>>
>>>>>         ModifiableSolrParams parameters = csur.getParams();
>>>>>         if (parameters == null) {
>>>>>             parameters = new ModifiableSolrParams();
>>>>>         }
>>>>>
>>>>>         parameters.set("literalsOverride", "false");
>>>>>
>>>>>         // maps the tika default content attribute to the Attribute with 
>>>>> name 'fulltext'
>>>>>         parameters.set("fmap.content", 
>>>>> SearchSystemAttributeDef.FULLTEXT.getName());
>>>>>         // create an empty content stream, this seams necessary for 
>>>>> ContentStreamUpdateRequest
>>>>>         csur.addContentStream(new ImaContentStream());
>>>>>
>>>>>         for (Content content : contentSet.getContentList()) {
>>>>>             csur.addContentStream(new ImaContentStream(content));
>>>>>             // for each content stream add additional attributes
>>>>>             parameters.add("literal." + 
>>>>> SearchSystemAttributeDef.CONTENT_ID.getName(), 
>>>>> content.getBinaryObjectId().toString());
>>>>>             parameters.add("literal." + 
>>>>> SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
>>>>>             parameters.add("literal." + 
>>>>> SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
>>>>>             parameters.add("literal." + 
>>>>> SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
>>>>>         }
>>>>>
>>>>>         parameters.set("literal.id ", indexId);
>>>>>
>>>>>         // adding some other attributes
>>>>>         ...
>>>>>
>>>>>         csur.setParams(parameters);
>>>>>
>>>>>         return csur;
>>>>>     }
>>>>>
>>>>> During debugging I can see that the method 'server.request(csur)' read 
>>>>> for each ImaContentStream the buffer.
>>>>> When I'm looking on solr catalina log I see that the attached files reach 
>>>>> the solr servlet.
>>>>>
>>>>> INFO: Releasing directory:/data/V-4-1/master0/data/index
>>>>> Apr 25, 2013 5:48:07 AM
>>>>> org.apache.solr.update.processor.LogUpdateProcessor finish
>>>>> INFO: [master0] webapp=/solr-4-1 path=/update/extract 
>>>>> params={literal.searchconnectortest15_c8150e41_cc49_4a ...... 
>>>>> &literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1& .....
>>>>> {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720),
>>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424),
>>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304),
>>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336),
>>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216),
>>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58
>>>>>
>>>>>
>>>>> But only the latest in the content list will be indexed.
>>>>>
>>>>>
>>>>> My schema.xml has the following field definitions:
>>>>>
>>>>>     <field name="id" type="string" indexed="true" stored="true" 
>>>>> required="true" />
>>>>>     <field name="content" type="text_general" indexed="false"
>>>>> stored="true" multiValued="true"/>
>>>>>
>>>>>     <field name="contentkey" type="string" indexed="true" stored="true" 
>>>>> multiValued="true"/>
>>>>>     <field name="contentid" type="string" indexed="true" stored="true" 
>>>>> multiValued="true"/>
>>>>>     <field name="contentfilename " type="string" indexed="true" 
>>>>> stored="true" multiValued="true"/>
>>>>>     <field name="contentmimetype" type="string" indexed="true"
>>>>> stored="true" multiValued="true"/>
>>>>>
>>>>>     <field name="fulltext" type="text_general" indexed="true"
>>>>> stored="true" multiValued="true"/>
>>>>>
>>>>>
>>>>> I'm using the tika ExtractingRequestHandler which can extract binary 
>>>>> files.
>>>>>
>>>>>
>>>>>
>>>>>   <requestHandler name="/update/extract"
>>>>>                   startup="lazy"
>>>>>                   class="solr.extraction.ExtractingRequestHandler" >
>>>>>     <lst name="defaults">
>>>>>       <str name="lowernames">true</str>
>>>>>       <str name="uprefix">ignored_</str>
>>>>>
>>>>>       <!-- capture link hrefs but ignore div attributes -->
>>>>>       <str name="captureAttr">true</str>
>>>>>       <str name="fmap.a">links</str>
>>>>>       <str name="fmap.div">ignored_</str>
>>>>>
>>>>>     </lst>
>>>>>   </requestHandler>
>>>>>
>>>>> Is it possible to index multiple files with the same id?
>>>>> It is necessary to implement my own RequestHandler?
>>>>>
>>>>> With best regards Mark
>>>>>
>>>>>
>>>>>
>>
>

Re: index multiple files into one index entity

Reply via email to