Re: How to index data from multiple data source
Thanks Alex, indeed, the relative path to PDF document is stored in the database. I will try to use your approach. Regards, Yusniel Hidalgo - Mensaje original - De: "Alexandre Rafalovitch" Para: "solr-user" Enviados: Sábado, 24 de Enero 2015 18:19:48 Asunto: Re: How to index data from multiple data source You could use nested entities in DIH. So, if you store - for example - path to the PDF in the database, you could do a nested entity with TikaEntityProcessor to load the content. Just make sure the field names do not conflict. Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 24 January 2015 at 18:11, Yusniel Hidalgo wrote: > Dear Solr community, > > I am diving into Solr recently and I need help in the following usage > scenery. I am working on a project for extract and search bibliographic > metadata from PDF files. Firstly, my PDF files are processed to extract > bibliographic metadata such as title, authors, affiliations, keywords and > abstract. These metadata are stored in a relational database and then are > indexed in Solr via DIH, however, I need to index also the fulltext of PDF > and maintain the same ID between metadata indexed from DIH and fulltext of > PDF indexed in Solr index. How to do that? How to configure sorlconfig.xml > and schema.xml to do it? > > Thanks in advance. > > Best regards. > > Yusniel Hidalgo > > > --- > XII Aniversario de la creación de la Universidad de las Ciencias > Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014. > --- XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
Re: How to index data from multiple data source
You could use nested entities in DIH. So, if you store - for example - path to the PDF in the database, you could do a nested entity with TikaEntityProcessor to load the content. Just make sure the field names do not conflict. Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 24 January 2015 at 18:11, Yusniel Hidalgo wrote: > Dear Solr community, > > I am diving into Solr recently and I need help in the following usage > scenery. I am working on a project for extract and search bibliographic > metadata from PDF files. Firstly, my PDF files are processed to extract > bibliographic metadata such as title, authors, affiliations, keywords and > abstract. These metadata are stored in a relational database and then are > indexed in Solr via DIH, however, I need to index also the fulltext of PDF > and maintain the same ID between metadata indexed from DIH and fulltext of > PDF indexed in Solr index. How to do that? How to configure sorlconfig.xml > and schema.xml to do it? > > Thanks in advance. > > Best regards. > > Yusniel Hidalgo > > > --- > XII Aniversario de la creación de la Universidad de las Ciencias > Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014. >
How to index data from multiple data source
Dear Solr community, I am diving into Solr recently and I need help in the following usage scenery. I am working on a project for extract and search bibliographic metadata from PDF files. Firstly, my PDF files are processed to extract bibliographic metadata such as title, authors, affiliations, keywords and abstract. These metadata are stored in a relational database and then are indexed in Solr via DIH, however, I need to index also the fulltext of PDF and maintain the same ID between metadata indexed from DIH and fulltext of PDF indexed in Solr index. How to do that? How to configure sorlconfig.xml and schema.xml to do it? Thanks in advance. Best regards. Yusniel Hidalgo --- XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
Re: How to index data from multiple data source
Hi Yusniel, Solr manages documents as a whole. This means updating an existing document means replacing. So you should/could index metadata and full text in one step, one solr document under one unique ID. That would the simplest case. You could also also use nested child documents to use block joins(depending on what version of Solr you are using, more info here: http://blog.griddynamics.com/2013/09/solr-block-join-support.html), but in my opinion this would be an overkill. We also manage a type of "semantic - linked data" mimic using additional fields(named by real ontology predicate/property names to join documents that are related, see https://wiki.apache.org/solr/Join). So you could add the full text as an additional document with it's own ID and fill a solr document field with the ID of the parent metadata document. The on query time you can join them. Joins in solr always give as result the joined document(TO), not both (it's no like a SQL join, more like and inner query), so we experimented with self joins (the field holding the parent ID document also holds it's own ID), but as you can understand this is in no way optimal. Related: We are using a Digital Objects Repository (Fedora Commons + Islandora) to archive exactly what you wan't to do. Our PDF files, and also many other type of data and metadata, are ingested as objects inside the repository, including technical metadata, MODS, DC, binary stream and full text. Then this whole object (as a FOXML) goes through an XSLT transformation and into Solr. If you are interested you can browse Islandoras google group. https://groups.google.com/forum/#!forum/islandora and visit Islandora's WIKI. https://wiki.duraspace.org/display/ISLANDORA714/Islandora. There is much documentation under the fedoragsearch module that does the real indexing. You can see our schemas and solr config there. Feel free to write me if you need/wan't more data. Cheers Diego Pino Navarro Krayon Media Pedro de Valdivia 575 Pucón - Chile F:+56-45-2442469 On Jan 21, 2015, at 2:43 AM, Yusniel Hidalgo Delgado wrote: > > > Dear Solr community, > > > > > I am diving into Solr recently and I need help in the following usage > scenery. I am working on a project for extract and search bibliographic > metadata from PDF files. Firstly, my PDF files are processed to extract > bibliographic metadata such as title, authors, affiliations, keywords and > abstract. These metadata are stored in a relational database and then are > indexed in Solr via DIH, however, I need to index also the fulltext of PDF > and maintain the same ID between metadata indexed and fulltext of PDF indexed > in Solr index. How to do that? How to configure sorlconfig.xml and schema.xml > to do it? > > > > > Thanks in advance. > > > > > Best regards > > Yusniel Hidalgo Delgado > Semantic Web Research Group > University of Informatics Sciences > http://gws-uci.blogspot.com/ > Havana, Cuba > > > > > --- > XII Aniversario de la creación de la Universidad de las Ciencias > Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
Re: How to index data from multiple data source
On 1/20/2015 10:43 PM, Yusniel Hidalgo Delgado wrote: > I am diving into Solr recently and I need help in the following usage > scenery. I am working on a project for extract and search bibliographic > metadata from PDF files. Firstly, my PDF files are processed to extract > bibliographic metadata such as title, authors, affiliations, keywords and > abstract. These metadata are stored in a relational database and then are > indexed in Solr via DIH, however, I need to index also the fulltext of PDF > and maintain the same ID between metadata indexed and fulltext of PDF indexed > in Solr index. How to do that? How to configure sorlconfig.xml and schema.xml > to do it? How are you doing the indexing? If it's in a program you wrote yourself, simply extend that program to obtain the information you need and add it to the document that you index. The Apache Tika project is one way to parse rich text documents. If you are using the dataimport handler, you are likely to need a nested entity to gather the additional information and include it in the document that is being indexed in the parent entity. The reply from Alvaro shows one way to integrate Tika into DIH. It looks like those instructions are geared to an extremely old Solr version (3.6.2) and probably won't work as-is on a newer version. Solr 4.x was already available when that blog post was written two years ago, so I don't know why they went with 3.6.2. Thanks, Shawn
Re: How to index data from multiple data source
Hi, You can find several examples of configuring tika+dih to index pdf in internet (e.g. https://tuxdna.wordpress.com/2013/02/04/indexing-the-documents-stored-in-a-database-using-apache-solr-and-apache-tika/ ) Regards. On Jan 21, 2015 6:54 AM, "Yusniel Hidalgo Delgado" wrote: > > > Dear Solr community, > > > > > I am diving into Solr recently and I need help in the following usage > scenery. I am working on a project for extract and search bibliographic > metadata from PDF files. Firstly, my PDF files are processed to extract > bibliographic metadata such as title, authors, affiliations, keywords and > abstract. These metadata are stored in a relational database and then are > indexed in Solr via DIH, however, I need to index also the fulltext of PDF > and maintain the same ID between metadata indexed and fulltext of PDF > indexed in Solr index. How to do that? How to configure sorlconfig.xml and > schema.xml to do it? > > > > > Thanks in advance. > > > > > Best regards > > Yusniel Hidalgo Delgado > Semantic Web Research Group > University of Informatics Sciences > http://gws-uci.blogspot.com/ > Havana, Cuba > > > > > --- > XII Aniversario de la creación de la Universidad de las Ciencias > Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014. >
How to index data from multiple data source
Dear Solr community, I am diving into Solr recently and I need help in the following usage scenery. I am working on a project for extract and search bibliographic metadata from PDF files. Firstly, my PDF files are processed to extract bibliographic metadata such as title, authors, affiliations, keywords and abstract. These metadata are stored in a relational database and then are indexed in Solr via DIH, however, I need to index also the fulltext of PDF and maintain the same ID between metadata indexed and fulltext of PDF indexed in Solr index. How to do that? How to configure sorlconfig.xml and schema.xml to do it? Thanks in advance. Best regards Yusniel Hidalgo Delgado Semantic Web Research Group University of Informatics Sciences http://gws-uci.blogspot.com/ Havana, Cuba --- XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.