i'm facing up to with tika parsing.
I my use case i have to parse different file types using the right parser,
including an .eml file.
As input of my app i can have every kind of file. In particular i have a
MyEmail.eml file whose content-type is recognized as text/html. I aim to get
all the available file's metadata.
Using AutoDetectParser MyEmail.eml is recognized as text/html and it's not
good enough... so i have to use the right RFC822Parser by which i can get
Message-From .. Message-To metadata.
For this purpose i have write these few code lines:

       File f = new File("MyEmail.eml");
       is= new FileInputStream(f);

       Tika tika = new Tika();
       String mimeType = tika.detect(is);
    
      
      if (FileUtils.getExtension("MyEmail.eml").equalsIgnoreCase("eml")){
          if (mimeType.equalsIgnoreCase("text/html"))             
                  parser = new RFC822Parser();
          else
                  parser = new AutoDetectParser();
          
      }else{
          parser = new AutoDetectParser();
      }
    
      parser.parse(is, ch, metadata,new ParseContext());
      for (int i = 0; i < metadata.names().length; i++) {
          String item = metadata.names()[i];
          System.out.println(item + " -- " + metadata.get(item));
      }

In this case the result of metadata syso is just content-type
=application/octet-stream.
If i comment out tika.detect(is) ... the syso output print all the metadata
i need.
If i initialize a second input stream on the same filename and i write:

       is2= new FileInputStream(f);
       Tika tika = new Tika();
       String mimeType = tika.detect(is2);

the syso  prints all the metadata i need.
What happens using the tika.detect(inputstream) function?
thanks a lot




--
View this message in context: 
http://lucene.472066.n3.nabble.com/parser-metadata-empty-after-tika-detect-tp4136053.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Reply via email to