Re: [HippoCMS-dev] how to extract PDF text into property?

Enrico Cervato Tue, 06 Apr 2010 06:47:59 -0700

Hi Ard,

Thank you very much for your answer.


Only to be a 100% clear about the situation ... my understanding is
that there is no PDF (real) extractor for hippo CMS 6 at the moment.
So if I wanted to implement what I need I should write the extractor
myself. Does that correspond to the truth?

Thank you,
-- 
Enrico Cervato - 0031 (0)615293346
Open Source Software Engineer
Sourcesense - making sense of Open Source: http://www.sourcesense.com




On Thu, Apr 1, 2010 at 9:41 PM, Ard Schrijvers
<[email protected]> wrote:
> Hello Enrico,
>
> I think you should try to hook in the pdf extractor just like the
> property extractors. The problem with the normal pdf (and also xml
> extractor) is that they take place *after* the save, where extractors
> that set properties are before the save. It is unlucky that they both
> are called extractors.
>
> Anyways, I think you should dive into the property extractors, and see
> if you can do something similar for pdf's. For me it has unfortunately
> been to long to know this
>
> Regards Ard
>
> On Thu, Apr 1, 2010 at 4:05 PM, Enrico Cervato
> <[email protected]> wrote:
>> Hi everybody,
>>
>> When performing a DASL query I am retrieving also some PDF's from my
>> binaries folder. I would like to provide to video also an extract from
>> the text in the PDF's. Is that possible?
>>
>> In the extractors.xml of my repository I already set the PDFExtractor.
>> Reading the [1], my understanding that is not a real extractor, it is
>> more an indexer. Therefore it will index the text contained in the PDF
>> but it will not extract it as a property.
>>
>> >From [2] it seems that it is not possible to extract the text from PDF's.
>>
>> I think it should be possible to do it somehow ... can you give me
>> some suggestions?
>> Thank you very much for your attention!
>>
>> [1] http://old.nabble.com/Help-with-PDFExtractor-td26808675.html
>> [2] http://old.nabble.com/Show-content-of-a-pdf-document-td18647758.html
>>
>> --
>> Enrico Cervato - 0031 (0)615293346
>> Open Source Software Engineer
>> Sourcesense - making sense of Open Source: http://www.sourcesense.com
>> ********************************************
>> Hippocms-dev: Hippo CMS 6 development public mailinglist
>>
>> Searchable archives can be found at:
>> MarkMail: http://hippocms-dev.markmail.org
>> Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
>>
>>
> ********************************************
> Hippocms-dev: Hippo CMS 6 development public mailinglist
>
> Searchable archives can be found at:
> MarkMail: http://hippocms-dev.markmail.org
> Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
>
>
********************************************
Hippocms-dev: Hippo CMS 6 development public mailinglist

Searchable archives can be found at:
MarkMail: http://hippocms-dev.markmail.org
Nabble: http://www.nabble.com/Hippo-CMS-f26633.html

Re: [HippoCMS-dev] how to extract PDF text into property?

Reply via email to