Hi Ioan Eugen, please find attached a patch.
it uses the following fromline pattern: static final String DEFAULT = "^From \\S+.*\\d{4}$"; so that it matches more lines. 1. From ieu...@apache.org Fri Sep 09 14:04:52 2011 2. From MAILER-DAEMON Wed Oct 05 21:54:09 2011 3. From - Wed Apr 02 06:51:08 2014 so looking for an "@" sign is not enforced any more. The patch fixes a typo: - private Matcher fromLineMathcer; + private Matcher fromLineMatcher; in many places of the source code. It adds a reference to the original mbox File so that the error message: + if (mbox!=null) + path=mbox.getPath(); + throw new IllegalArgumentException("File "+path+" does not contain From_ lines that match the pattern '"+MESSAGE_START.pattern()+"'! Maybe not be a valid Mbox."); can be improved. Who is going to check this patch and what needs to be done to get it into the official repo? I would also like to add more test cases and especially include some dummy mboxes. And as mentioned I'd like to check the iterator against all my Thunderbird mboxes to check whether it will successfully parse them all. Also I am offering to write a few "tutorial lines". Where would I have to put these? Cheers Wolfgang Am 22.07.14 22:23, schrieb Ioan Eugen Stan: > Hello Wolfgang, > > I developed MailboxIterator. It's nice to see that it's helpful :) > > You get that error because MboxIterator does not know how to split the > messages. Messages in an mbox file are separated via lines that start > with '' From:'. They are called (by me at least) 'From lines' :) . > One problem with the mbox format is that it's a bit 'free-form' in the > sense that developers abused it and we have some variants [1]. > > One thing that you could try is to supply a different From line > regular expression to MboxIterator via regexpPattern argument. It will > split messages based on this new value. > > [1] http://wiki2.dovecot.org/MailboxFormat/mbox > > Good luck and please post the your results. > > Regards, > > On Fri, Jul 18, 2014 at 12:53 PM, Wolfgang Fahl <w...@bitplan.com> wrote: >> Dear mime4j developers, >> >> for one of my projects I have been using mime4j successfully to import >> e-mail into our CRM database for some two years know. >> Currently I am trying to add a feature which would allow reading Mozilla >> Thunderbird Mailbox content. >> As of mime4j 0.8 there seems to be a MboxIterator which could do that. >> Since I didn't find any publicly available source repository which I >> could use to access the 0.8-Snapshop I have copied >> the three source files: >> * CharBufferWrapper.java >> * FromLinePatterns.java >> * MboxIterator.java >> >> into my source tree and I am using these together with the following >> maven dependency: >> >> <!-- EMail handling --> >> <dependency> >> <groupId>org.apache.james</groupId> >> <artifactId>apache-mime4j-core</artifactId> >> <version>0.7.2</version> >> </dependency> >> <dependency> >> <groupId>org.apache.james</groupId> >> <artifactId>apache-mime4j-dom</artifactId> >> <version>0.7.2</version> >> </dependency> >> >> The iterator works somewhat o.k. on some of the Thunderbird mailbox >> files and loops thru the mails in it correctly. >> The mails can than not be directly parsed with mime4j - there is one >> newline at the begining which spoils the show. After >> working around this it's working as expected in some cases. In other >> cases there is an error: >> >> java.lang.IllegalArgumentException: File does not contain From_ lines! >> Maybe not be a vaild Mbox. >> at >> org.apache.james.mime4j.mboxiterator.MboxIterator.initMboxIterator(MboxIterator.java:85) >> at >> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:75) >> at >> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:62) >> at >> org.apache.james.mime4j.mboxiterator.MboxIterator$Builder.build(MboxIterator.java:241) >> at >> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:386) >> at >> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:261) >> at >> com.bitplan.clientutils.rest.TestMailAccess.testMailById(TestMailAccess.java:77) >> >> By the way - there is a typo in the above error message "vaild" should >> be "valid". >> >> The error is something I'd like to fix or work-around. >> >> I have two big user accounts with several hundred mailbox files and some >> 300.000 mails from the last 15 years which I'd like >> to use as a testcase against which to run the mime4j implementation. >> >> Would you please supply me with some pointers where I get the necessary >> source code and how i could supply patches and >> testcases for the project? >> >> Also it would be good to know whether others would be interested in the >> Thunderbird Mailbox reading capability. >> >> >> Cheers >> Wolfgang >> >> -- >> >> BITPlan - smart solutions >> Wolfgang Fahl >> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn >> Tel. +49 2154 811-480, Fax +49 2154 811-481 >> Web: http://www.bitplan.de >> BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, >> Geschäftsführer: Wolfgang Fahl >> > > -- BITPlan - smart solutions Wolfgang Fahl Pater-Delp-Str. 1, D-47877 Willich Schiefbahn Tel. +49 2154 811-480, Fax +49 2154 811-481 Web: http://www.bitplan.de BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl
diff --git a/src/main/java/org/apache/james/mime4j/mboxiterator/FromLinePatterns.java b/src/main/java/org/apache/james/mime4j/mboxiterator/FromLinePatterns.java index 724077c..4a8cc49 100644 --- a/src/main/java/org/apache/james/mime4j/mboxiterator/FromLinePatterns.java +++ b/src/main/java/org/apache/james/mime4j/mboxiterator/FromLinePatterns.java @@ -28,11 +28,12 @@ public interface FromLinePatterns { /** * Match a line like: From ieu...@apache.org Fri Sep 09 14:04:52 2011 */ - static final String DEFAULT = "^From \\S+@\\S.*\\d{4}$"; - + // static final String DEFAULT = "^From \\S+@\\S.*\\d{4}$"; /** * Other type of From_ line: From MAILER-DAEMON Wed Oct 05 21:54:09 2011 + * Thunderbird mbox content: From - Wed Apr 02 06:51:08 2014 */ + static final String DEFAULT = "^From \\S+.*\\d{4}$"; } diff --git a/src/main/java/org/apache/james/mime4j/mboxiterator/MboxIterator.java b/src/main/java/org/apache/james/mime4j/mboxiterator/MboxIterator.java index 4b94c63..59c018a 100644 --- a/src/main/java/org/apache/james/mime4j/mboxiterator/MboxIterator.java +++ b/src/main/java/org/apache/james/mime4j/mboxiterator/MboxIterator.java @@ -46,7 +46,7 @@ public class MboxIterator implements Iterable<CharBufferWrapper>, Closeable { private final FileInputStream theFile; private final CharBuffer mboxCharBuffer; - private Matcher fromLineMathcer; + private Matcher fromLineMatcher; private boolean fromLineFound; private final MappedByteBuffer byteBuffer; private final CharsetDecoder DECODER; @@ -58,6 +58,7 @@ public class MboxIterator implements Iterable<CharBufferWrapper>, Closeable { private final Pattern MESSAGE_START; private int findStart = -1; private int findEnd = -1; + private final File mbox; private MboxIterator(final File mbox, final Charset charset, @@ -70,19 +71,28 @@ public class MboxIterator implements Iterable<CharBufferWrapper>, Closeable { this.MESSAGE_START = Pattern.compile(regexpPattern, regexpFlags); this.DECODER = charset.newDecoder(); this.mboxCharBuffer = CharBuffer.allocate(MAX_MESSAGE_SIZE); + this.mbox=mbox; this.theFile = new FileInputStream(mbox); this.byteBuffer = theFile.getChannel().map(FileChannel.MapMode.READ_ONLY, 0, theFile.getChannel().size()); initMboxIterator(); } - - private void initMboxIterator() throws IOException, CharConversionException { + + /** + * initialize the Mailbox iterator + * @throws IOException + * @throws CharConversionException + */ + protected void initMboxIterator() throws IOException, CharConversionException { decodeNextCharBuffer(); - fromLineMathcer = MESSAGE_START.matcher(mboxCharBuffer); - fromLineFound = fromLineMathcer.find(); + fromLineMatcher = MESSAGE_START.matcher(mboxCharBuffer); + fromLineFound = fromLineMatcher.find(); if (fromLineFound) { - saveFindPositions(fromLineMathcer); - } else if (fromLineMathcer.hitEnd()) { - throw new IllegalArgumentException("File does not contain From_ lines! Maybe not be a vaild Mbox."); + saveFindPositions(fromLineMatcher); + } else if (fromLineMatcher.hitEnd()) { + String path=""; + if (mbox!=null) + path=mbox.getPath(); + throw new IllegalArgumentException("File "+path+" does not contain From_ lines that match the pattern '"+MESSAGE_START.pattern()+"'! Maybe not be a valid Mbox."); } } @@ -139,12 +149,12 @@ public class MboxIterator implements Iterable<CharBufferWrapper>, Closeable { */ public CharBufferWrapper next() { final CharBuffer message; - fromLineFound = fromLineMathcer.find(); + fromLineFound = fromLineMatcher.find(); if (fromLineFound) { message = mboxCharBuffer.slice(); message.position(findEnd + 1); - saveFindPositions(fromLineMathcer); - message.limit(fromLineMathcer.start()); + saveFindPositions(fromLineMatcher); + message.limit(fromLineMatcher.start()); } else { /* We didn't find other From_ lines this means either: * - we reached end of mbox and no more messages @@ -163,17 +173,17 @@ public class MboxIterator implements Iterable<CharBufferWrapper>, Closeable { } catch (CharConversionException ex) { throw new RuntimeException(ex); } - fromLineMathcer = MESSAGE_START.matcher(mboxCharBuffer); - fromLineFound = fromLineMathcer.find(); + fromLineMatcher = MESSAGE_START.matcher(mboxCharBuffer); + fromLineFound = fromLineMatcher.find(); if (fromLineFound) { - saveFindPositions(fromLineMathcer); + saveFindPositions(fromLineMatcher); } message = mboxCharBuffer.slice(); - message.position(fromLineMathcer.end() + 1); - fromLineFound = fromLineMathcer.find(); + message.position(fromLineMatcher.end() + 1); + fromLineFound = fromLineMatcher.find(); if (fromLineFound) { - saveFindPositions(fromLineMathcer); - message.limit(fromLineMathcer.start()); + saveFindPositions(fromLineMatcher); + message.limit(fromLineMatcher.start()); } } else { message = mboxCharBuffer.slice();