Re: [HippoCMS-dev] how to extract PDF text into property?

Ard Schrijvers Tue, 06 Apr 2010 07:38:41 -0700

there is a pdf extractor, but this one only runs during indexing, and
not for storing a property as you seem to want


Regards Ard

On Tue, Apr 6, 2010 at 3:47 PM, Enrico Cervato
<[email protected]> wrote:
> Hi Ard,
>
> Thank you very much for your answer.
>
> Only to be a 100% clear about the situation ... my understanding is
> that there is no PDF (real) extractor for hippo CMS 6 at the moment.
> So if I wanted to implement what I need I should write the extractor
> myself. Does that correspond to the truth?
>
> Thank you,
> --
> Enrico Cervato - 0031 (0)615293346
> Open Source Software Engineer
> Sourcesense - making sense of Open Source: http://www.sourcesense.com
>
>
>
>
> On Thu, Apr 1, 2010 at 9:41 PM, Ard Schrijvers
> <[email protected]> wrote:
>> Hello Enrico,
>>
>> I think you should try to hook in the pdf extractor just like the
>> property extractors. The problem with the normal pdf (and also xml
>> extractor) is that they take place *after* the save, where extractors
>> that set properties are before the save. It is unlucky that they both
>> are called extractors.
>>
>> Anyways, I think you should dive into the property extractors, and see
>> if you can do something similar for pdf's. For me it has unfortunately
>> been to long to know this
>>
>> Regards Ard
>>
>> On Thu, Apr 1, 2010 at 4:05 PM, Enrico Cervato
>> <[email protected]> wrote:
>>> Hi everybody,
>>>
>>> When performing a DASL query I am retrieving also some PDF's from my
>>> binaries folder. I would like to provide to video also an extract from
>>> the text in the PDF's. Is that possible?
>>>
>>> In the extractors.xml of my repository I already set the PDFExtractor.
>>> Reading the [1], my understanding that is not a real extractor, it is
>>> more an indexer. Therefore it will index the text contained in the PDF
>>> but it will not extract it as a property.
>>>
>>> >From [2] it seems that it is not possible to extract the text from PDF's.
>>>
>>> I think it should be possible to do it somehow ... can you give me
>>> some suggestions?
>>> Thank you very much for your attention!
>>>
>>> [1] http://old.nabble.com/Help-with-PDFExtractor-td26808675.html
>>> [2] http://old.nabble.com/Show-content-of-a-pdf-document-td18647758.html
>>>
>>> --
>>> Enrico Cervato - 0031 (0)615293346
>>> Open Source Software Engineer
>>> Sourcesense - making sense of Open Source: http://www.sourcesense.com
>>> ********************************************
>>> Hippocms-dev: Hippo CMS 6 development public mailinglist
>>>
>>> Searchable archives can be found at:
>>> MarkMail: http://hippocms-dev.markmail.org
>>> Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
>>>
>>>
>> ********************************************
>> Hippocms-dev: Hippo CMS 6 development public mailinglist
>>
>> Searchable archives can be found at:
>> MarkMail: http://hippocms-dev.markmail.org
>> Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
>>
>>
> ********************************************
> Hippocms-dev: Hippo CMS 6 development public mailinglist
>
> Searchable archives can be found at:
> MarkMail: http://hippocms-dev.markmail.org
> Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
>
>
********************************************
Hippocms-dev: Hippo CMS 6 development public mailinglist

Searchable archives can be found at:
MarkMail: http://hippocms-dev.markmail.org
Nabble: http://www.nabble.com/Hippo-CMS-f26633.html

Re: [HippoCMS-dev] how to extract PDF text into property?

Reply via email to