Re: [HippoCMS-dev] Help with PDFExtractor

Maurizio Pillitu Wed, 16 Dec 2009 03:17:50 -0800

Got it to work!

There were some restrictions in the DASL query that were excluding the PDF
result to come out.


Thanks a lot!

mau

On Wed, Dec 16, 2009 at 11:26 AM, Jasha Joachimsthal <
[email protected]> wrote:

> 2009/12/16 Maurizio Pillitu <[email protected]>:
> > Thanks guys,
> > I know I was missing some bits of the big picture :)
> >
> > So here's the next question: when I perform a DASL query, I normally
> > *select* some properties *from* some repository location (path) *where* a
> > certain property matches one or more conditions; if I don't have a
> property
> > to match, how can I define the *where* condition?
> >
> > sounds like a very stupid question .... sorry for that.
>
> There are no stupid questions!
> For fulltext search, you can do <d:contains>mySearchWord</d:contains>
> If you really need properties, you can let the user set them in the
> assets perspective. See [1]
>
> [1]
> http://wiki.onehippo.com/display/CMS/WebDAV+properties+used+by+Hippo+CMS
>
> > Thx again
> >
> > mau
> >
> > On Wed, Dec 16, 2009 at 11:01 AM, Jeroen Reijn <[email protected]>
> wrote:
> >
> >> Hi Maurizio,
> >>
> >> as far as I know the pdf extractor as you have you configured now
> extracts
> >> all content to the lucene index only and makes sure that the text can be
> >> found and mapped to the pdf document. I don't think Slide has a
> repository
> >> extractor that can extract the information and store it as a property.
> >>
> >> Regards,
> >>
> >> Jeroen
> >>
> >> Maurizio Pillitu wrote:
> >>
> >>> Hi everyone,
> >>> I'm trying to use the PDFExtractor (using Hippo Repository 1.2.15);
> I've
> >>> added to my (default) extractors.xml the following:
> >>>
> >>> ....
> >>> <extractor classname="org.apache.slide.extractor.PDFExtractor"
> >>> uri="/files/default.preview/binaries" content-type="application/pdf"/>
> >>> .....
> >>>
> >>> then I dropped a Google Docs generated PDF file (attached) in
> >>> /files/default.preview/binaries (via WebDAV); I see the repository
> logging
> >>> some interesting bits (attached) as if the extraction process went
> fine,
> >>> but
> >>> I can't see the extracted data; I'd have expected a WebDAV property
> >>> attached
> >>> to the file, but nothing shows up; this is the list of properties
> related
> >>> with the PDF file (using DAVExplorer)
> >>>
> >>> getlastmodified DAV: Wed, 16 Dec 2009 09:38:35 GMT
> >>> displayname DAV: this_is_my_title.pdf
> >>> modificationdate DAV: 2009-12-16T09:38:35Z
> >>> UID DAV: 96da71317f000001004b0bbb796bcb32
> >>> supportedlock DAV:
> >>> getcontenttype DAV: application/pdf
> >>> getcontentlength DAV: 5078
> >>> resourcetype DAV:
> >>> getcontentlanguage DAV: en
> >>> getetag DAV: ada3fdca64b1fd70a3d7b2ed66b3e68b
> >>> lockdiscovery DAV:
> >>> source DAV:
> >>> creationdate DAV: 2009-12-16T09:38:35Z
> >>>
> >>>
> >>> I feel like I'm missing something on how the PDFExtractor works; I've
> >>> looked
> >>> for some documentation or specific configurations, but I couldn't find
> >>> anything interesting.
> >>>
> >>> Any hints?
> >>> TIA
> >>>  mau
> >>>
> >>> Met vriendelijke groet,
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------
> >>>
> >>>
> >>> ********************************************
> >>> Hippocms-dev: Hippo CMS development public mailinglist
> >>>
> >>> Searchable archives can be found at:
> >>> MarkMail: http://hippocms-dev.markmail.org
> >>> Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
> >>>
> >>>  ********************************************
> >> Hippocms-dev: Hippo CMS development public mailinglist
> >>
> >> Searchable archives can be found at:
> >> MarkMail: http://hippocms-dev.markmail.org
> >> Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
> >>
> >>
> >
> >
> > --
> >
> > Met vriendelijke groet,
> > --
> > Maurizio Pillitu - 0031 (0)615655668
> > Opensource Software Engineer
> > Scrum Certified Master - http://www.scrumalliance.org
> > Sourcesense - making sense of Open Source: http://www.sourcesense.com
> > ********************************************
> > Hippocms-dev: Hippo CMS development public mailinglist
> >
> > Searchable archives can be found at:
> > MarkMail: http://hippocms-dev.markmail.org
> > Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
> >
> >
> ********************************************
> Hippocms-dev: Hippo CMS development public mailinglist
>
> Searchable archives can be found at:
> MarkMail: http://hippocms-dev.markmail.org
> Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
>
>


-- 

Met vriendelijke groet,
-- 
Maurizio Pillitu - 0031 (0)615655668
Opensource Software Engineer
Scrum Certified Master - http://www.scrumalliance.org
Sourcesense - making sense of Open Source: http://www.sourcesense.com
********************************************
Hippocms-dev: Hippo CMS development public mailinglist

Searchable archives can be found at:
MarkMail: http://hippocms-dev.markmail.org
Nabble: http://www.nabble.com/Hippo-CMS-f26633.html

Re: [HippoCMS-dev] Help with PDFExtractor

Reply via email to