> On Jun 5, 2017, at 10:43am, Allison, Timothy B. <talli...@mitre.org> wrote: > > Jim, > Thank you, again, for reaching out to us. Now that we have a user who > actually cares about macros, I have some follow up questions, we aren’t > treating js in html as a macro…should we try to do that? Are there other > macro-like bits of code that we should be extracting?
Oddly enough, this just came up for me a few days ago. I was going to use a custom mapper and content handler to extract the <script> data, but having built-in support that treats them as macros would be better. So yes, please :) How would you handle the src=xxx attribute? Ultimately I plan to treat these like an import statement in a regular source code file. Regards, — Ken > From: Jim Idle [mailto:ji...@proofpoint.com <mailto:ji...@proofpoint.com>] > Sent: Sunday, June 4, 2017 4:07 AM > To: user@tika.apache.org <mailto:user@tika.apache.org> > Subject: RE: Extracting macros in 1.15 > > Direct Java calls and "I am using the AutoDetectParser at the moment." > > I find an online example buried a test for another package, so I have worked > out how to do it now, but it seems that if I have many difference document > types to support I will have to configure each parser separately. So be it, > but it seems like there is a case for a subset of options that may apply to > all such as "extract anything that qualifies as a 'macro'" that all parsers > would obey if they have not been told anything specifically. > > It is my opinion (for what it's worth 😉, that all parsers should extract > everything they can unless told otherwise, but it is what it is I guess and I > am pleased to have TIKA as an aid in analyzing all the myriad document types. > > Jim > > pc = new ParseContext(); > parser = new AutoDetectParser(); > OfficeParserConfig officeParserConfig = new OfficeParserConfig(); > officeParserConfig.setExtractMacros(true); > pc.set(OfficeParserConfig.class, officeParserConfig); > > > > > -----Original Message----- > > From: Nick Burch [mailto:apa...@gagravarr.org <mailto:apa...@gagravarr.org>] > > Sent: Saturday, June 3, 2017 16:36 > > To: user@tika.apache.org <mailto:user@tika.apache.org> > > Subject: Re: Extracting macros in 1.15 > > > > On Sat, 3 Jun 2017, Jim Idle wrote: > > > After being baffled why macros no longer show up in 1.15 I found: > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org > > > <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org> > > > _jira_browse_TIKA- > > 2D2302&d=DwIBAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJy > > > > > p031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=o8gr > > 8gP1-gre > > > > > pBVLNkl9r56fM6Jt6LIlRff8aub3bEA&s=8nhkO_W_dLX6R9XdCgmgqoEpbRlvVL > > iSwf4L > > > rAFE1tA&e= > > > > > > Can anyone point me to an example of doing this? I am finding bits and > > > pieces but no example of turning macros back on.I basically want all > > > macros in all documents, office, pdf, anything really. > > > > How do you call Apache Tika? Tika App? Tika Server? Tika java class facade? > > Direct Java calls to TikaConfig / AutoDetectParser etc? > > > > The solution will differ depending on which one you use > > > > Nick -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr