Okay. I tried to use the id which is formed my manifoldcf documentum connector. I ran the job i could see in between from the SOLR admin screen that documents were getting indexed. But just after the end of the job i see all my created indexes gets deleted.
Snippet from Simple History is given below. Why this document deletion activity gets added and deletes all my created indexes when i keep the unique id as "id" in the schema.xml file of SOLR ? Start Time <http://localhost:8080/mcf-crawler-ui/execute.jsp>Activity<http://localhost:8080/mcf-crawler-ui/execute.jsp> IdentifierResult Code <http://localhost:8080/mcf-crawler-ui/execute.jsp> Bytes <http://localhost:8080/mcf-crawler-ui/execute.jsp>Time<http://localhost:8080/mcf-crawler-ui/execute.jsp>Result Description 03-29-2012 13:00:26.837 document deletion (Solr_TEST_QA) http://example.domain.com:8088/webtop/component/drl?versio... nLabel=CURRENT&objectId=09d905e78000676d 200 0 110 03-29-2012 12:55:37.869 fetch 09d905e78000676d REJECTED 86823 4184 03-29-2012 12:55:34.934 document ingest (Solr_TEST_QA) http://example.domain.com:8088/webtop/component/drl?versio... nLabel=CURRENT&objectId=09d905e78000676d 200 8158 235 On Thu, Mar 29, 2012 at 12:41 AM, Karl Wright <daddy...@gmail.com> wrote: > "So do you find this design appropriate and feasible ?" It sounds > like you are still trying to merge records in Solr but this time using > Solr Cell to somehow do this. Since SolrCell is a pipeline, I don't > think you will find it easy to keep data from one job aligned with > data from another. That's why I suggested just allowing both kinds of > documents to be indexed as-is, and just making sure that you include a > metadata reference to the main document in each. > > Karl > > > On Wed, Mar 28, 2012 at 2:43 PM, Anupam Bhattacharya > <anupam...@gmail.com> wrote: > > The second option seems to be more useful as it will allow me add to any > > business logic. > > So similar to SOLR Cell (/update/extract) my new RequestHandler will be > > added in solrconfig.xml which will do all the manipulations. > > Later, I need to get all field values into a temp variable by first > > searching by id in the lucene indexes and then add these values into the > > incoming new field values list. > > > > So do you find this design appropriate and feasible ? > > > > Anupam > > > > On Wed, Mar 28, 2012 at 11:46 PM, Karl Wright <daddy...@gmail.com> > wrote: > >> > >> Thanks - now I understand what you are trying to do more clearly. > >> > >> The Documentum connector is going to pick up the XML document and the > >> PDF document as separate entities. Thus, they'd also be indexed in > >> Solr separately. So if we use that as a starting point, let's see > >> where it might lead. > >> > >> First, you'd want each PDF document to have metadata that refers back > >> to the XML parent document. I'm not sure how easy it is to set up > >> such a metadata reference in Documentum, but I vaguely recall there > >> was indeed some such field. So let's presume you can get that. Then, > >> you'd want to make sure your Solr schema included an "XML document" > >> field, which had the URL of the parent XML document (or, for XML > >> documents, the document's own URL) as content. That would be the ID > >> you'd use to present a result item to a user. > >> > >> Does this sound reasonable so far? > >> > >> The only other piece you might need is manipulation of either the > >> PDF's metadata, or the XML document's metadata, or both. For that, > >> I'd use Solr Cell to perform whatever mappings and manipulations made > >> sense before the documents actually get indexed. > >> > >> Karl > >> > >> On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya > >> <anupam...@gmail.com> wrote: > >> > I would have been happy if I had to index PDF and XML separately. > >> > But for my use-case. XML is the main document containing bibliographic > >> > information (which needs to presented as search result) and consists a > >> > reference to a child/supporting document which is a actual PDF file. I > >> > need > >> > to index the PDF text and if any search matches with the PDF content > >> > then > >> > the parent/XML bibliographic information needs to be presented. > >> > > >> > I am trying to call the SOLR search engine with one single query to > show > >> > corresponding XML detail for a search term present in the PDF. I > checked > >> > that from SOLR 4.x version SOLR-Join Plugin is introduced. > >> > (http://wiki.apache.org/solr/Join) but work like inner query. > >> > > >> > Again the main requirement is that the PDF should be searchable but it > >> > master details from XML should only be presented to request the actual > >> > PDF. > >> > > >> > -Anupam > >> > > >> > On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright <daddy...@gmail.com> > >> > wrote: > >> >> > >> >> This doesn't sound like a problem a connector can solve. The problem > >> >> sounds like severe misuse of Solr/Lucene to me. You are using the > >> >> wrong document key and Lucene does not let you modify a document > index > >> >> once it is created, and no matter what you do to ManifoldCF it can't > >> >> get around that restriction. So it sounds like you need to > >> >> fundamentally rethink your design. > >> >> > >> >> If all you want to do is index XML and PDF as separate documents, > just > >> >> change your Solr output connection specification to change the > >> >> selected "id" field appropriately. Then, BOTH documents will be > >> >> indexed by Solr, each with different metadata as you originally > >> >> specified. I'm frankly having a really hard time seeing why this is > >> >> so hard. > >> >> > >> >> Karl > >> >> > >> >> > >> >> On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya > >> >> <anupam...@gmail.com> wrote: > >> >> > Should I write a new Documentum Connector with our specific > use-case > >> >> > to > >> >> > go > >> >> > forward ? > >> >> > I guess your book will be helpful to understand connector framework > >> >> > in > >> >> > manifoldcf. > >> >> > > >> >> > On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright <daddy...@gmail.com> > >> >> > wrote: > >> >> >> > >> >> >> Right, LUCENE never did allow you to modify a document's indexes, > >> >> >> only > >> >> >> replace them. What I'm trying to tell you is that there is no > >> >> >> reason > >> >> >> to have the same document ID for both documents. ManifoldCF will > >> >> >> support treating the XML document and PDF document as different > >> >> >> documents in Solr. But if you want them to in fact be the same > >> >> >> document, just combined in some way, neither ManifoldCF nor Lucene > >> >> >> will support that at this time. > >> >> >> > >> >> >> Karl > >> >> >> > >> >> >> > >> >> >> On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya > >> >> >> <anupam...@gmail.com> wrote: > >> >> >> > I saw that the index getting created by 1st PDF indexing job > which > >> >> >> > worked > >> >> >> > perfectly well for a particular id. Later when i ran the 2nd XML > >> >> >> > indexing > >> >> >> > Job for the same id. I lost all field indexed by the 1st job > and i > >> >> >> > was > >> >> >> > left > >> >> >> > out with field indexes created my this 2nd job. > >> >> >> > > >> >> >> > I thought that it would combine field values for a specified doc > >> >> >> > id. > >> >> >> > > >> >> >> > As per Lucene developers they mention that by design Lucene > >> >> >> > doesn't > >> >> >> > support > >> >> >> > this. > >> >> >> > > >> >> >> > Pls. see following url :: > >> >> >> > https://issues.apache.org/jira/browse/LUCENE-3837 > >> >> >> > > >> >> >> > -Anupam > >> >> >> > > >> >> >> > > >> >> >> > On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright < > daddy...@gmail.com> > >> >> >> > wrote: > >> >> >> >> > >> >> >> >> The Solr handler that you are using should not matter here. > >> >> >> >> > >> >> >> >> Can you look at the Simple History report, and do the > following: > >> >> >> >> > >> >> >> >> - Look for a document that is being indexed in both PDF and > XML. > >> >> >> >> - Find the "ingestion" activity for that document for both PDF > >> >> >> >> and > >> >> >> >> XML > >> >> >> >> - Compare the ID's (which for the ingestion activity are the > >> >> >> >> URL's > >> >> >> >> of > >> >> >> >> the documents in Webtop) > >> >> >> >> > >> >> >> >> If the URLs are in fact different, then you should be able to > >> >> >> >> make > >> >> >> >> this work. You need to look at how you configured your Solr > >> >> >> >> instance, > >> >> >> >> and which fields you are specifying in your Solr output > >> >> >> >> connection. > >> >> >> >> You want those Webtop urls to be indexed as the unique document > >> >> >> >> identifier in Solr, not some other ID. > >> >> >> >> > >> >> >> >> Thanks, > >> >> >> >> Karl > >> >> >> >> > >> >> >> >> > >> >> >> >> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya > >> >> >> >> <anupam...@gmail.com> wrote: > >> >> >> >> > Today I ran 2 job one by one but it seems since we are using > >> >> >> >> > /update/extract Request Handler the field values for common > id > >> >> >> >> > gets > >> >> >> >> > overridden by the latest job. I want to update certain field > in > >> >> >> >> > the > >> >> >> >> > lucene indexes for the doc rather than completely update with > >> >> >> >> > new > >> >> >> >> > values and by loosing other field value entries. > >> >> >> >> > > >> >> >> >> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright > >> >> >> >> > <daddy...@gmail.com> > >> >> >> >> > wrote: > >> >> >> >> >> For Documentum, content length is in bytes, I believe. It > >> >> >> >> >> does > >> >> >> >> >> not > >> >> >> >> >> set the length, it filters out all documents greater than > the > >> >> >> >> >> specified length. Leaving the field blank will perform no > >> >> >> >> >> filtering. > >> >> >> >> >> > >> >> >> >> >> Document types in Documentum are specified by mime type, so > >> >> >> >> >> you'd > >> >> >> >> >> want > >> >> >> >> >> to select all that apply. The actual one used will depend > on > >> >> >> >> >> how > >> >> >> >> >> your > >> >> >> >> >> particular instance of Documentum is configured, but if you > >> >> >> >> >> pick > >> >> >> >> >> them > >> >> >> >> >> all you should have no problem. > >> >> >> >> >> > >> >> >> >> >> Karl > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya > >> >> >> >> >> <anupam...@gmail.com> wrote: > >> >> >> >> >>> Thanks!! Seems from your explanation that i can update same > >> >> >> >> >>> documents > >> >> >> >> >>> other > >> >> >> >> >>> field values. I inquired about this because I have two > >> >> >> >> >>> different > >> >> >> >> >>> document > >> >> >> >> >>> with a parent-child relationship which needs to be indexed > as > >> >> >> >> >>> one > >> >> >> >> >>> document > >> >> >> >> >>> in lucene index. > >> >> >> >> >>> > >> >> >> >> >>> As you must have understood by now that i am trying to do > >> >> >> >> >>> this > >> >> >> >> >>> for > >> >> >> >> >>> Documentum CMS. I have seen the configuration screen for > >> >> >> >> >>> setting > >> >> >> >> >>> the > >> >> >> >> >>> Content > >> >> >> >> >>> length & second for filtering document type. So my question > >> >> >> >> >>> is > >> >> >> >> >>> what > >> >> >> >> >>> unit the > >> >> >> >> >>> Content length accepts values (bit,bytes,KB,MB etc) & > whether > >> >> >> >> >>> this > >> >> >> >> >>> configuration set the lengths for documents full text > >> >> >> >> >>> indexing > >> >> >> >> >>> ?. > >> >> >> >> >>> > >> >> >> >> >>> Additionally to scan only one kind of document e.g PDF what > >> >> >> >> >>> should > >> >> >> >> >>> be > >> >> >> >> >>> added > >> >> >> >> >>> to filter those documents? is it application/pdf OR PDF ? > >> >> >> >> >>> > >> >> >> >> >>> Regards > >> >> >> >> >>> Anupam > >> >> >> >> >>> > >> >> >> >> >>> > >> >> >> >> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright > >> >> >> >> >>> <daddy...@gmail.com> > >> >> >> >> >>> wrote: > >> >> >> >> >>>> > >> >> >> >> >>>> The document key in Solr is the url of the document, as > >> >> >> >> >>>> constructed > >> >> >> >> >>>> by > >> >> >> >> >>>> the connector you are using. If you are using the same > >> >> >> >> >>>> document > >> >> >> >> >>>> to > >> >> >> >> >>>> construct two different Solr documents, ManifoldCF by > >> >> >> >> >>>> definition > >> >> >> >> >>>> cannot be aware of this. But if these are different files > >> >> >> >> >>>> from > >> >> >> >> >>>> the > >> >> >> >> >>>> point of view of ManifoldCF they will have different URLs > >> >> >> >> >>>> and > >> >> >> >> >>>> be > >> >> >> >> >>>> treated differently. The jobs can overlap in this case > with > >> >> >> >> >>>> no > >> >> >> >> >>>> difficulty. > >> >> >> >> >>>> > >> >> >> >> >>>> Karl > >> >> >> >> >>>> > >> >> >> >> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya > >> >> >> >> >>>> <anupam...@gmail.com> wrote: > >> >> >> >> >>>> > I want to configure two jobs to index in SOLR using > >> >> >> >> >>>> > ManifoldCF > >> >> >> >> >>>> > using > >> >> >> >> >>>> > /extract/update requestHandler. > >> >> >> >> >>>> > 1st to synchronize only XML files & 2nd to synchronize > the > >> >> >> >> >>>> > PDF > >> >> >> >> >>>> > file. > >> >> >> >> >>>> > If both these document share a unique id. Can i combine > >> >> >> >> >>>> > the > >> >> >> >> >>>> > indexes > >> >> >> >> >>>> > for > >> >> >> >> >>>> > both > >> >> >> >> >>>> > in 1 SOLR schema without overriding the details added by > >> >> >> >> >>>> > previous > >> >> >> >> >>>> > job. > >> >> >> >> >>>> > > >> >> >> >> >>>> > suppose, > >> >> >> >> >>>> > xmldoc indexes field0(id), field1, field2, field3 > >> >> >> >> >>>> > & pdfdoc indexes field0(id), field4, field5, field6. > >> >> >> >> >>>> > > >> >> >> >> >>>> > Output docindex ==> (xml+pdf doc), field0(id), field1, > >> >> >> >> >>>> > field2, > >> >> >> >> >>>> > field3, > >> >> >> >> >>>> > field4, field5, field6 > >> >> >> >> >>>> > > >> >> >> >> >>>> > Regards > >> >> >> >> >>>> > Anupam > >> >> >> >> >>>> > > >> >> >> >> >>>> > > >> >> >> >> >>> > >> >> >> >> >>> > >> >> >> >> >>> > >> >> >> >> >>> > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > -- > >> >> >> >> > Thanks & Regards > >> >> >> >> > Anupam Bhattacharya > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > -- > >> >> >> > Thanks & Regards > >> >> >> > Anupam Bhattacharya > >> >> >> > > >> >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > Thanks & Regards > >> >> > Anupam Bhattacharya > >> >> > > >> >> > > >> > > >> > > >> > > >> > > >> > -- > >> > Thanks & Regards > >> > Anupam Bhattacharya > >> > > >> > > > > > > > > > > > -- > > Thanks & Regards > > Anupam Bhattacharya > > > > > -- Thanks & Regards Anupam Bhattacharya