On Thu, Jul 15, 2010 at 6:43 AM, Jukka Zitting <jukka.zitt...@gmail.com>wrote:
> The way I recommend is to pass a custom Parser implementation through > the ParseContext. This gives you detailed access to each component > document. > > I looked at the code a little further, and I don't see exactly how I can do this. I am using an AutoDetectParser, and in my ParseContext I've placed another AutoDetectParser. At the top level I might be parsing a tar.gz, and inside this tar.gz there are text, PDF, and zip files. As far as I can tell, when I start to parse files embedded in one of the containers (tar.gz or zip), it is actually PackageExtractor that gets the parser from the ParseContext, and it is also PackageExtractor that creates a new Metadata object that it doesn't share, thus keeping me from being able to look at the metadata. Does this mean that, to get access to the metadata for subdocuments I would need to do the following: * Create a replacements for PackageParser and PackageExtractor that do what I want with the metadata * use get parsers and set parsers on the AutoDetectParser, and replace the parser for each of the following MediaTypes MediaType.application("x-archive"), MediaType.application("x-bzip"), MediaType.application("x-bzip2"), MediaType.application("x-cpio"), MediaType.application("x-gtar"), MediaType.application("x-gzip"), MediaType.application("x-tar"), MediaType.application("zip")))); I wonder if it would be easier to update PackageExtractor to check if there is a metadata stack in the ParseContext, and if so, push the new metadata object just before parsing a subdocument, and pop the the metadata object just after the parse (maybe just after writing the end of the <div> section. Paul