You did not open source it by any chance? :-)

Personal blog:
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD

On Sun, May 26, 2013 at 8:23 PM, Yury Kats <> wrote:
> That's exactly what happens. Each streams goes into a separate document.
> If all streams share the same unique id parameter, the last stream
> will overwrite everything.
> I've asked this same question last year. Got no responses and ended up
> writing my own UpdateRequestProcessor.
> See
> On 5/26/2013 11:15 AM, Alexandre Rafalovitch wrote:
>> If I understand correctly, the issue is:
>> 1) The client provides multiple content stream and expects Tika to
>> parse all of them and stick all the extracted content into one big
>> SolrDoc
>> 2) Tika (looking at load() method of:
>> (Github link: ) does not actually suspect that
>> it's load method may be called multiple types and therefore happily
>> submit the document at the end of that call. Probably submits a new
>> document for each content source, which probably means it just
>> overrides the same doc over and over again.
>> If I am right, then we have a bug in Tika handler's expectations (of
>> single load() call). The next step would be to put together a very
>> simple use case and open a Jira case with it.
>> Regards,
>>    Alex.
>> P.s. I am not a Solr code wrangler, so this MAY be completely wrong.
>> Personal blog:
>> LinkedIn:
>> - Time is the quality of nature that keeps events from happening all
>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>> On Sun, May 26, 2013 at 10:46 AM, Erick Erickson
>> <> wrote:
>>> I'm still not quite getting the issue. Separate requests (i.e. any
>>> addition of a SolrInputDocument) are treated as a separate document.
>>> There's no notion of "append the contents of one doc to another based
>>> on ID", unless you're doing atomic updates.
>>> And Tika takes some care to index separate files as separate documents.
>>> Now, if you don't need these as with the same uniqueKey, you might
>>> index them as separate documents and include a field that lets you
>>> associate these documents somehow (see the group/field collapsing Wiki
>>> page).
>>> But otherwise, I think I need a higher-level view of what you're
>>> trying to accomplish to make an intelligent comment.
>>> Best
>>> Erick
>>> On Thu, May 23, 2013 at 9:05 AM,  <> wrote:
>>>> Hello Erick,
>>>> Thank you for your fast answer.
>>>> Maybe I don't exclaim my question clearly.
>>>> I want index many files to one index entity. I will use the same behavior 
>>>> as any other multivalued field which can indexed to one unique id.
>>>> So I think every ContentStreamUpdateRequest represent one index entity, 
>>>> isn't it? And with each addContentStream I will add one File to this 
>>>> entity.
>>>> Thank you and with best Regards
>>>> Mark
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Erick Erickson []
>>>> Gesendet: Donnerstag, 23. Mai 2013 14:11
>>>> An:
>>>> Betreff: Re: index multiple files into one index entity
>>>> I just skimmed your post, but I'm responding to the last bit.
>>>> If you have <uniqueKey> defined as "id" in schema.xml then no, you cannot 
>>>> have multiple documents with the same ID.
>>>> Whenever a new doc comes in it replaces the old doc with that ID.
>>>> You can remove the <uniqueKey> definition and do what you want, but there 
>>>> are very few Solr installations with no <uniqueKey> and it's probably a 
>>>> better idea to make your id's truly unique.
>>>> Best
>>>> Erick
>>>> On Thu, May 23, 2013 at 6:14 AM,  <> wrote:
>>>>> Hello solr team,
>>>>> I want to index multiple fields into one solr index entity, with the
>>>>> same id. We are using solr 4.1
>>>>> I try it with following source fragment:
>>>>>     public void addContentSet(ContentSet contentSet) throws
>>>>> SearchProviderException {
>>>>>                                 ...
>>>>>             ContentStreamUpdateRequest csur = 
>>>>> generateCSURequest(contentSet.getIndexId(), contentSet);
>>>>>             String indexId = contentSet.getIndexId();
>>>>>             ConcurrentUpdateSolrServer server = 
>>>>> serverPool.getUpdateServer(indexId);
>>>>>             server.request(csur);
>>>>>                                 ...
>>>>>     }
>>>>>     private ContentStreamUpdateRequest generateCSURequest(String indexId, 
>>>>> ContentSet contentSet)
>>>>>             throws IOException {
>>>>>         ContentStreamUpdateRequest csur = new
>>>>> ContentStreamUpdateRequest(confStore.getExtractUrl());
>>>>>         ModifiableSolrParams parameters = csur.getParams();
>>>>>         if (parameters == null) {
>>>>>             parameters = new ModifiableSolrParams();
>>>>>         }
>>>>>         parameters.set("literalsOverride", "false");
>>>>>         // maps the tika default content attribute to the Attribute with 
>>>>> name 'fulltext'
>>>>>         parameters.set("fmap.content", 
>>>>> SearchSystemAttributeDef.FULLTEXT.getName());
>>>>>         // create an empty content stream, this seams necessary for 
>>>>> ContentStreamUpdateRequest
>>>>>         csur.addContentStream(new ImaContentStream());
>>>>>         for (Content content : contentSet.getContentList()) {
>>>>>             csur.addContentStream(new ImaContentStream(content));
>>>>>             // for each content stream add additional attributes
>>>>>             parameters.add("literal." + 
>>>>> SearchSystemAttributeDef.CONTENT_ID.getName(), 
>>>>> content.getBinaryObjectId().toString());
>>>>>             parameters.add("literal." + 
>>>>> SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
>>>>>             parameters.add("literal." + 
>>>>> SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
>>>>>             parameters.add("literal." + 
>>>>> SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
>>>>>         }
>>>>>         parameters.set(" ", indexId);
>>>>>         // adding some other attributes
>>>>>         ...
>>>>>         csur.setParams(parameters);
>>>>>         return csur;
>>>>>     }
>>>>> During debugging I can see that the method 'server.request(csur)' read 
>>>>> for each ImaContentStream the buffer.
>>>>> When I'm looking on solr catalina log I see that the attached files reach 
>>>>> the solr servlet.
>>>>> INFO: Releasing directory:/data/V-4-1/master0/data/index
>>>>> Apr 25, 2013 5:48:07 AM
>>>>> org.apache.solr.update.processor.LogUpdateProcessor finish
>>>>> INFO: [master0] webapp=/solr-4-1 path=/update/extract 
>>>>> params={literal.searchconnectortest15_c8150e41_cc49_4a ...... 
>>>>> & .....
>>>>> {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720),
>>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424),
>>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304),
>>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336),
>>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216),
>>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58
>>>>> But only the latest in the content list will be indexed.
>>>>> My schema.xml has the following field definitions:
>>>>>     <field name="id" type="string" indexed="true" stored="true" 
>>>>> required="true" />
>>>>>     <field name="content" type="text_general" indexed="false"
>>>>> stored="true" multiValued="true"/>
>>>>>     <field name="contentkey" type="string" indexed="true" stored="true" 
>>>>> multiValued="true"/>
>>>>>     <field name="contentid" type="string" indexed="true" stored="true" 
>>>>> multiValued="true"/>
>>>>>     <field name="contentfilename " type="string" indexed="true" 
>>>>> stored="true" multiValued="true"/>
>>>>>     <field name="contentmimetype" type="string" indexed="true"
>>>>> stored="true" multiValued="true"/>
>>>>>     <field name="fulltext" type="text_general" indexed="true"
>>>>> stored="true" multiValued="true"/>
>>>>> I'm using the tika ExtractingRequestHandler which can extract binary 
>>>>> files.
>>>>>   <requestHandler name="/update/extract"
>>>>>                   startup="lazy"
>>>>>                   class="solr.extraction.ExtractingRequestHandler" >
>>>>>     <lst name="defaults">
>>>>>       <str name="lowernames">true</str>
>>>>>       <str name="uprefix">ignored_</str>
>>>>>       <!-- capture link hrefs but ignore div attributes -->
>>>>>       <str name="captureAttr">true</str>
>>>>>       <str name="fmap.a">links</str>
>>>>>       <str name="fmap.div">ignored_</str>
>>>>>     </lst>
>>>>>   </requestHandler>
>>>>> Is it possible to index multiple files with the same id?
>>>>> It is necessary to implement my own RequestHandler?
>>>>> With best regards Mark

Reply via email to