Re: How to index data from multiple data source

2015-01-24 Thread Yusniel Hidalgo
Thanks Alex, indeed, the relative path to PDF document is stored in the 
database. I will try to use your approach.

Regards,

Yusniel Hidalgo

- Mensaje original -
De: "Alexandre Rafalovitch" 
Para: "solr-user" 
Enviados: Sábado, 24 de Enero 2015 18:19:48
Asunto: Re: How to index data from multiple data source

You could use nested entities in DIH.

So, if you store - for example - path to the PDF in the database, you
could do a nested entity with TikaEntityProcessor to load the content.
Just make sure the field names do not conflict.

Regards,
   Alex.


Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 24 January 2015 at 18:11, Yusniel Hidalgo  wrote:
> Dear Solr community,
>
> I am diving into Solr recently and I need help in the following usage 
> scenery. I am working on a project for extract and search bibliographic 
> metadata from PDF files. Firstly, my PDF files are processed to extract 
> bibliographic metadata such as title, authors, affiliations, keywords and 
> abstract. These metadata are stored in a relational database and then are 
> indexed in Solr via DIH, however, I need to index also the fulltext of PDF 
> and maintain the same ID between metadata indexed from DIH and fulltext of 
> PDF indexed in Solr index. How to do that? How to configure sorlconfig.xml 
> and schema.xml to do it?
>
> Thanks in advance.
>
> Best regards.
>
> Yusniel Hidalgo
>
>
> ---
> XII Aniversario de la creación de la Universidad de las Ciencias 
> Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
>


---
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 
12 años de historia junto a Fidel. 12 de diciembre de 2014.



Re: How to index data from multiple data source

2015-01-24 Thread Alexandre Rafalovitch
You could use nested entities in DIH.

So, if you store - for example - path to the PDF in the database, you
could do a nested entity with TikaEntityProcessor to load the content.
Just make sure the field names do not conflict.

Regards,
   Alex.


Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 24 January 2015 at 18:11, Yusniel Hidalgo  wrote:
> Dear Solr community,
>
> I am diving into Solr recently and I need help in the following usage 
> scenery. I am working on a project for extract and search bibliographic 
> metadata from PDF files. Firstly, my PDF files are processed to extract 
> bibliographic metadata such as title, authors, affiliations, keywords and 
> abstract. These metadata are stored in a relational database and then are 
> indexed in Solr via DIH, however, I need to index also the fulltext of PDF 
> and maintain the same ID between metadata indexed from DIH and fulltext of 
> PDF indexed in Solr index. How to do that? How to configure sorlconfig.xml 
> and schema.xml to do it?
>
> Thanks in advance.
>
> Best regards.
>
> Yusniel Hidalgo
>
>
> ---
> XII Aniversario de la creación de la Universidad de las Ciencias 
> Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
>


How to index data from multiple data source

2015-01-24 Thread Yusniel Hidalgo
Dear Solr community, 

I am diving into Solr recently and I need help in the following usage scenery. 
I am working on a project for extract and search bibliographic metadata from 
PDF files. Firstly, my PDF files are processed to extract bibliographic 
metadata such as title, authors, affiliations, keywords and abstract. These 
metadata are stored in a relational database and then are indexed in Solr via 
DIH, however, I need to index also the fulltext of PDF and maintain the same ID 
between metadata indexed from DIH and fulltext of PDF indexed in Solr index. 
How to do that? How to configure sorlconfig.xml and schema.xml to do it? 

Thanks in advance. 

Best regards. 

Yusniel Hidalgo


---
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 
12 años de historia junto a Fidel. 12 de diciembre de 2014.



Re: How to index data from multiple data source

2015-01-21 Thread Diego Pino
Hi Yusniel,

Solr manages documents as a whole. This means updating an existing document 
means replacing. So you should/could index metadata and full text in one step, 
one solr document under one unique ID. That would the simplest case. You could 
also also use nested  child documents to use block joins(depending on what 
version of Solr you are using, more info here: 
http://blog.griddynamics.com/2013/09/solr-block-join-support.html), but in my 
opinion this would be an overkill. We also manage a type of "semantic - linked 
data" mimic using  additional fields(named by real ontology predicate/property 
names to join documents that are related, see 
https://wiki.apache.org/solr/Join). So you could add the full text as an 
additional document with it's own ID and fill a solr document field with the ID 
of the parent metadata document. The on query time you can join them. Joins in 
solr always give as result the joined document(TO), not both (it's no like a 
SQL join, more like and inner query), so we experimented with self joins (the 
field holding the parent ID document also holds it's own ID), but as you can 
understand this is in no way optimal.

Related: We are using a Digital Objects Repository (Fedora Commons + Islandora) 
to archive exactly what you wan't to do. Our PDF files, and also many other 
type of data and metadata, are ingested as objects inside the repository, 
including technical metadata, MODS, DC, binary stream and full text. Then this 
whole object (as a FOXML) goes through an XSLT transformation and into Solr. If 
you are interested you can browse Islandoras google group. 
https://groups.google.com/forum/#!forum/islandora and visit Islandora's WIKI. 
https://wiki.duraspace.org/display/ISLANDORA714/Islandora. There is much 
documentation under the fedoragsearch module that does the real indexing. You 
can see our schemas and solr config there. 

Feel free to write me if you need/wan't more data.

Cheers

Diego Pino Navarro
Krayon Media
Pedro de Valdivia 575
Pucón - Chile
F:+56-45-2442469




On Jan 21, 2015, at 2:43 AM, Yusniel Hidalgo Delgado  wrote:

> 
> 
> Dear Solr community, 
> 
> 
> 
> 
> I am diving into Solr recently and I need help in the following usage 
> scenery. I am working on a project for extract and search bibliographic 
> metadata from PDF files. Firstly, my PDF files are processed to extract 
> bibliographic metadata such as title, authors, affiliations, keywords and 
> abstract. These metadata are stored in a relational database and then are 
> indexed in Solr via DIH, however, I need to index also the fulltext of PDF 
> and maintain the same ID between metadata indexed and fulltext of PDF indexed 
> in Solr index. How to do that? How to configure sorlconfig.xml and schema.xml 
> to do it? 
> 
> 
> 
> 
> Thanks in advance. 
> 
> 
> 
> 
> Best regards 
> 
> Yusniel Hidalgo Delgado 
> Semantic Web Research Group 
> University of Informatics Sciences 
> http://gws-uci.blogspot.com/ 
> Havana, Cuba 
> 
> 
> 
> 
> ---
> XII Aniversario de la creación de la Universidad de las Ciencias 
> Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.



Re: How to index data from multiple data source

2015-01-21 Thread Shawn Heisey
On 1/20/2015 10:43 PM, Yusniel Hidalgo Delgado wrote:
> I am diving into Solr recently and I need help in the following usage 
> scenery. I am working on a project for extract and search bibliographic 
> metadata from PDF files. Firstly, my PDF files are processed to extract 
> bibliographic metadata such as title, authors, affiliations, keywords and 
> abstract. These metadata are stored in a relational database and then are 
> indexed in Solr via DIH, however, I need to index also the fulltext of PDF 
> and maintain the same ID between metadata indexed and fulltext of PDF indexed 
> in Solr index. How to do that? How to configure sorlconfig.xml and schema.xml 
> to do it? 

How are you doing the indexing?  If it's in a program you wrote
yourself, simply extend that program to obtain the information you need
and add it to the document that you index.  The Apache Tika project is
one way to parse rich text documents.

If you are using the dataimport handler, you are likely to need a nested
entity to gather the additional information and include it in the
document that is being indexed in the parent entity. The reply from
Alvaro shows one way to integrate Tika into DIH.  It looks like those
instructions are geared to an extremely old Solr version (3.6.2) and
probably won't work as-is on a newer version.  Solr 4.x was already
available when that blog post was written two years ago, so I don't know
why they went with 3.6.2.

Thanks,
Shawn



Re: How to index data from multiple data source

2015-01-20 Thread Alvaro Cabrerizo
Hi,

You can find several examples of configuring tika+dih to index pdf in
internet (e.g.
https://tuxdna.wordpress.com/2013/02/04/indexing-the-documents-stored-in-a-database-using-apache-solr-and-apache-tika/
)

Regards.
On Jan 21, 2015 6:54 AM, "Yusniel Hidalgo Delgado"  wrote:

>
>
> Dear Solr community,
>
>
>
>
> I am diving into Solr recently and I need help in the following usage
> scenery. I am working on a project for extract and search bibliographic
> metadata from PDF files. Firstly, my PDF files are processed to extract
> bibliographic metadata such as title, authors, affiliations, keywords and
> abstract. These metadata are stored in a relational database and then are
> indexed in Solr via DIH, however, I need to index also the fulltext of PDF
> and maintain the same ID between metadata indexed and fulltext of PDF
> indexed in Solr index. How to do that? How to configure sorlconfig.xml and
> schema.xml to do it?
>
>
>
>
> Thanks in advance.
>
>
>
>
> Best regards
>
> Yusniel Hidalgo Delgado
> Semantic Web Research Group
> University of Informatics Sciences
> http://gws-uci.blogspot.com/
> Havana, Cuba
>
>
>
>
> ---
> XII Aniversario de la creación de la Universidad de las Ciencias
> Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
>


How to index data from multiple data source

2015-01-20 Thread Yusniel Hidalgo Delgado


Dear Solr community, 




I am diving into Solr recently and I need help in the following usage scenery. 
I am working on a project for extract and search bibliographic metadata from 
PDF files. Firstly, my PDF files are processed to extract bibliographic 
metadata such as title, authors, affiliations, keywords and abstract. These 
metadata are stored in a relational database and then are indexed in Solr via 
DIH, however, I need to index also the fulltext of PDF and maintain the same ID 
between metadata indexed and fulltext of PDF indexed in Solr index. How to do 
that? How to configure sorlconfig.xml and schema.xml to do it? 




Thanks in advance. 




Best regards 

Yusniel Hidalgo Delgado 
Semantic Web Research Group 
University of Informatics Sciences 
http://gws-uci.blogspot.com/ 
Havana, Cuba 




---
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 
12 años de historia junto a Fidel. 12 de diciembre de 2014.