Problem with detection of .mbox file

2016-07-25 Thread Vjeran Marcinko
Hello,

I fist noticed that my .mbox file doesn't get parsed by MBoxParser,
and later, after debugging Tika source code, I found what the problem
is - default detector doesn't even recognize it as "applciation/mbox"
MIME type, and although file extension is .mbox, it ignores this hint
because its "magic" way of detecting file type based on some amount of
initial bytes detects it is "text/html" so it ignores the hint, and
returns "text/html"...And by consequence, the parsing never goes to
the correct parser.

Is there some way I could override this magic detection and enforce
that detection in this case is based solely on file extension for
these .mbox files?

-Vjeran

#
Anyway, here is the beginning of my MBOX file which I got from Google
exporting my GMAil emails:


>From 1540828415824941917@xxx Mon Jul 25 12:08:06 + 2016
X-GM-THRID: 1540828415824941917
X-Gmail-Labels: Inbox,Important,clojure
Delivered-To: vmarci...@gmail.com
Received: by 10.31.56.17 with SMTP id f17csp1614203vka;
Mon, 25 Jul 2016 05:08:06 -0700 (PDT)
X-Received: by 10.202.95.133 with SMTP id t127mr8226795oib.80.1469448485990;
Mon, 25 Jul 2016 05:08:05 -0700 (PDT)
Return-Path: 
Received: from o1678940x148.outbound-mail.sendgrid.net
(o1678940x148.outbound-mail.sendgrid.net. [167.89.40.148])
by mx.google.com with ESMTPS id k58si11358370otb.279.2016.07.25.05.08.05
for 
(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
Mon, 25 Jul 2016 05:08:05 -0700 (PDT)
Received-SPF: pass (google.com: domain of
bounces+2693180-18a0-vmarcinko=gmail@m.dripemail2.com designates
167.89.40.148 as permitted sender) client-ip=167.89.40.148;
Authentication-Results: mx.google.com;
   dkim=pass header.i=@dripemail2.com;
   dkim=pass header.i=@sendgrid.info;
   spf=pass (google.com: domain of
bounces+2693180-18a0-vmarcinko=gmail@m.dripemail2.com designates
167.89.40.148 as permitted sender)
smtp.mailfrom=bounces+2693180-18a0-vmarcinko=gmail@m.dripemail2.com
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=dripemail2.com;
h=content-type:from:mime-version:subject:to; s=s1;
bh=wbY8sP/TelOpmU6q09dgY8v3muI=; b=Vo/m0Lx7f8jNAHU2m0vLO6StuGms/
XeJeiLBV4CHyhwMNr4UuuBIJmDVGIuv6YGSJPN9REUYVuCqFyaPOAZiBtlie8Awq
7uB7KxZKnFPDh/7XQRz1Z1kKx0dGiENBOoymZFglCebm9my2i+trZ6EzN4YFOB/+
ZNpksoRirEVhws=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sendgrid.info;
h=content-type:from:mime-version:subject:to:x-feedback-id;
s=smtpapi; bh=wbY8sP/TelOpmU6q09dgY8v3muI=; b=vnSfe24bbcPSeungct
GphBd1h4S4i96PxeapkjmxCLyzeItTItNETiCtkLFbGnzFTVYVvzDOmcI47BYFHu
yOM0kILRdMzFt1d7HNVE1EJCB0DHVS83Yk7vaH/jc+IU34jJgZBlG0yR292QYtYk
7WA4ETOIQnQ+3K3pJ+wUYNGKs=
Received: by filter0448p1mdw1.sendgrid.net with SMTP id
filter0448p1mdw1.23984.5796012246
2016-07-25 12:08:02.669274519 + UTC
Received: from MjY5MzE4MA (ec2-54-210-139-199.compute-1.amazonaws.com
[54.210.139.199])
by ismtpd0002p1iad1.sendgrid.net (SG) with HTTP id zyxIxF_lRFKgFZxIoq9BKA
for ; Mon, 25 Jul 2016 12:08:02.739 + (UTC)
Content-Type: multipart/alternative;
boundary=0082ce9e57fb837e9dfa9ca77bc69f450567ae3138b24a5db1e7237fc121
Date: Mon, 25 Jul 2016 12:08:02 +
From: "Eric at PurelyFunctional.tv" 
Mime-Version: 1.0
Subject: Twitter Bot, Atom Editor, and Scraping HTML
To: vmarci...@gmail.com
Message-ID: 
X-SG-EID: 
pywWA7gL46oOK7j8609IHsuM8bBS72IBx+uWB+d8D/N9t0rE4+TMmdgXQpvC7JIN3ekubbU2qCgHqS
 7W8GJ+aKX8qAKYokC5jzRvyv4CX3KHlasoMaqSUGqYEuHYx1e9vMNhqBIB4+nZN4uZmnKvRrvnYMZy
 NtpRNDKB0S28xjv5CxGmqbRggtf8RLQ7d2s5RIuQwIMIZQ3nLl3OrnmbjtZAP91VtQFkbhRATrKx7i
 o=
X-SG-ID: 
6l1ICXxVk1U2NQBE+KPgx+uy7/oBj9jrT6lO2L7BaL4cap+kBh3uUy+RmDmEF7s+mSBwxVfvlgfHyu
 osKIvS9Q==
X-Feedback-ID: 
2693180:l1fkQA9YLlZ4PTqywTL3Zu+zLq2XYmkeuiZ1WV+xvFE=:l1fkQA9YLlZ4PTqywTL3Zu+zLq2XYmkeuiZ1WV+xvFE=:SG

--0082ce9e57fb837e9dfa9ca77bc69f450567ae3138b24a5db1e7237fc121
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8
Mime-Version: 1.0

Dear Clojurist,

Thanks again for being there. I am so lucky to have you here on
my PurelyFunctional.tv email list.

A lot of people ask me what it takes to be hirable in Clojure. Of
course, the answer is complicated, but the short version is "not
very much". I wrote about it.

Read What do I have to learn to be hirable in Clojure? ( http://t.dripemail=
2.com/c/eyJhY2NvdW50X2lkIjoiMzY1MTcxNyIsImRlbGl2ZXJ5X2lkIjoiMjE3NTQ4MzEyIiw=
idXJsIjoiaHR0cDovL3d3dy5saXNwY2FzdC5jb20vaGlyYWJsZS1pbi1jbG9qdXJlP19fcz15bj=
R6dm8xcnY5cGhkazR4cG11diJ9 )


Re: Problem with detection of .mbox file

2016-07-25 Thread Nick Burch

On Mon, 25 Jul 2016, Vjeran Marcinko wrote:

I fist noticed that my .mbox file doesn't get parsed by MBoxParser,
and later, after debugging Tika source code, I found what the problem
is - default detector doesn't even recognize it as "applciation/mbox"
MIME type, and although file extension is .mbox, it ignores this hint
because its "magic" way of detecting file type based on some amount of
initial bytes detects it is "text/html"


Can you try with a recent Tika nightly build? Only there have been some 
tweaks done around that sort of thing recently


If a nightly build / build from Git still shows the issue, please open a 
bug in Jira and attach a problematic file, then we can take a look!


Nick


RE: Problem with detection of .mbox file

2016-07-25 Thread Allison, Timothy B.
> Can you try with a recent Tika nightly build?
e.g. 
https://builds.apache.org/job/Tika-trunk/lastBuild/org.apache.tika$tika-app/

-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org] 
Sent: Monday, July 25, 2016 3:03 PM
To: user@tika.apache.org
Subject: Re: Problem with detection of .mbox file

On Mon, 25 Jul 2016, Vjeran Marcinko wrote:
> I fist noticed that my .mbox file doesn't get parsed by MBoxParser, 
> and later, after debugging Tika source code, I found what the problem 
> is - default detector doesn't even recognize it as "applciation/mbox"
> MIME type, and although file extension is .mbox, it ignores this hint 
> because its "magic" way of detecting file type based on some amount of 
> initial bytes detects it is "text/html"

Can you try with a recent Tika nightly build? Only there have been some tweaks 
done around that sort of thing recently

If a nightly build / build from Git still shows the issue, please open a bug in 
Jira and attach a problematic file, then we can take a look!

Nick


Re: Problem with detection of .mbox file

2016-07-25 Thread Vjeran Marcinko
Thanx guys, I can do it in some clumsy way, but before I try it, is
there some maven repo for such nightly builds that I can include and
specify these 1.4-SNAPSHOT deps ?

On Mon, Jul 25, 2016 at 9:14 PM, Allison, Timothy B.  wrote:
>> Can you try with a recent Tika nightly build?
> e.g. 
> https://builds.apache.org/job/Tika-trunk/lastBuild/org.apache.tika$tika-app/
>
> -Original Message-
> From: Nick Burch [mailto:apa...@gagravarr.org]
> Sent: Monday, July 25, 2016 3:03 PM
> To: user@tika.apache.org
> Subject: Re: Problem with detection of .mbox file
>
> On Mon, 25 Jul 2016, Vjeran Marcinko wrote:
>> I fist noticed that my .mbox file doesn't get parsed by MBoxParser,
>> and later, after debugging Tika source code, I found what the problem
>> is - default detector doesn't even recognize it as "applciation/mbox"
>> MIME type, and although file extension is .mbox, it ignores this hint
>> because its "magic" way of detecting file type based on some amount of
>> initial bytes detects it is "text/html"
>
> Can you try with a recent Tika nightly build? Only there have been some 
> tweaks done around that sort of thing recently
>
> If a nightly build / build from Git still shows the issue, please open a bug 
> in Jira and attach a problematic file, then we can take a look!
>
> Nick


RE: Problem with detection of .mbox file

2016-07-25 Thread Allison, Timothy B.


apache.snapshots
Apache Development Snapshot Repository

https://repository.apache.org/content/repositories/snapshots/

false


true




-Original Message-
From: Vjeran Marcinko [mailto:vmarci...@gmail.com] 
Sent: Monday, July 25, 2016 3:25 PM
To: user@tika.apache.org
Subject: Re: Problem with detection of .mbox file

Thanx guys, I can do it in some clumsy way, but before I try it, is there some 
maven repo for such nightly builds that I can include and specify these 
1.4-SNAPSHOT deps ?

On Mon, Jul 25, 2016 at 9:14 PM, Allison, Timothy B.  wrote:
>> Can you try with a recent Tika nightly build?
> e.g. 
> https://builds.apache.org/job/Tika-trunk/lastBuild/org.apache.tika$tik
> a-app/
>
> -Original Message-
> From: Nick Burch [mailto:apa...@gagravarr.org]
> Sent: Monday, July 25, 2016 3:03 PM
> To: user@tika.apache.org
> Subject: Re: Problem with detection of .mbox file
>
> On Mon, 25 Jul 2016, Vjeran Marcinko wrote:
>> I fist noticed that my .mbox file doesn't get parsed by MBoxParser, 
>> and later, after debugging Tika source code, I found what the problem 
>> is - default detector doesn't even recognize it as "applciation/mbox"
>> MIME type, and although file extension is .mbox, it ignores this hint 
>> because its "magic" way of detecting file type based on some amount 
>> of initial bytes detects it is "text/html"
>
> Can you try with a recent Tika nightly build? Only there have been 
> some tweaks done around that sort of thing recently
>
> If a nightly build / build from Git still shows the issue, please open a bug 
> in Jira and attach a problematic file, then we can take a look!
>
> Nick


Is Tika (especially CharsetDetector) considered thread-safe?

2016-07-25 Thread c . leitinger
Hi,

I am working in a project where Tika is getting used in a heavily
multi-threaded environment. Lately, there have been some issues where
character set detection in isolation gives plausible results, while
running it in parallel gives results that are way off.

The root cause has not yet been found, but within the team, there was
quite some finger-pointing towards Tika's thread-safety and lots of
FUD especially around org.apache.tika.parser.txt.CharsetDetector.

But it seems no one in our team reached out or cared to either bug
report or ask on the mailing list.

So just to get rid of the FUD: Is
org.apache.tika.parser.txt.CharsetDetector considered to be
thread-safe?
(Some bugs suggest that Tika cares about thread-safety, but I could
not find anything in the javadoc for CharsetDetector)

Thanks and Best regards,
Christian


P.S.: We're building a fresh, new CharSetDetector for each byte array
that should have the character set encoding detected. And only the
thread that created the CharSetDetector is using it.


P.P.S.: We're still using Tika 1.9.


Re: Problem with detection of .mbox file

2016-07-25 Thread Luís Filipe Nassif
Hi,

Based on https://en.wikipedia.org/wiki/Mbox, you can add the following
entry in org/apache/tika/mime/custom-mimetypes.xml:








The priority must be greater than message/rfc822. It sometimes returns
false positives, but detects mbox files without extension, which are very
very commom.

Luis

2016-07-25 16:36 GMT-03:00 Allison, Timothy B. :

> 
> 
> apache.snapshots
> Apache Development Snapshot Repository
> 
> https://repository.apache.org/content/repositories/snapshots/
> 
> false
> 
> 
> true
> 
> 
> 
>
> -Original Message-
> From: Vjeran Marcinko [mailto:vmarci...@gmail.com]
> Sent: Monday, July 25, 2016 3:25 PM
> To: user@tika.apache.org
> Subject: Re: Problem with detection of .mbox file
>
> Thanx guys, I can do it in some clumsy way, but before I try it, is there
> some maven repo for such nightly builds that I can include and specify
> these 1.4-SNAPSHOT deps ?
>
> On Mon, Jul 25, 2016 at 9:14 PM, Allison, Timothy B. 
> wrote:
> >> Can you try with a recent Tika nightly build?
> > e.g.
> > https://builds.apache.org/job/Tika-trunk/lastBuild/org.apache.tika$tik
> > a-app/
> >
> > -Original Message-
> > From: Nick Burch [mailto:apa...@gagravarr.org]
> > Sent: Monday, July 25, 2016 3:03 PM
> > To: user@tika.apache.org
> > Subject: Re: Problem with detection of .mbox file
> >
> > On Mon, 25 Jul 2016, Vjeran Marcinko wrote:
> >> I fist noticed that my .mbox file doesn't get parsed by MBoxParser,
> >> and later, after debugging Tika source code, I found what the problem
> >> is - default detector doesn't even recognize it as "applciation/mbox"
> >> MIME type, and although file extension is .mbox, it ignores this hint
> >> because its "magic" way of detecting file type based on some amount
> >> of initial bytes detects it is "text/html"
> >
> > Can you try with a recent Tika nightly build? Only there have been
> > some tweaks done around that sort of thing recently
> >
> > If a nightly build / build from Git still shows the issue, please open a
> bug in Jira and attach a problematic file, then we can take a look!
> >
> > Nick
>


RE: Is Tika (especially CharsetDetector) considered thread-safe?

2016-07-25 Thread Allison, Timothy B.
Charset detection _should_ be thread safe.  If you can help us track down the 
problem (unit test?), we need to fix this.

Thank you for raising this.

Best,

 Tim

-Original Message-
From: c.leitin...@lirum.at [mailto:c.leitin...@lirum.at] 
Sent: Monday, July 25, 2016 6:01 PM
To: user@tika.apache.org
Subject: Is Tika (especially CharsetDetector) considered thread-safe?

Hi,

I am working in a project where Tika is getting used in a heavily 
multi-threaded environment. Lately, there have been some issues where character 
set detection in isolation gives plausible results, while running it in 
parallel gives results that are way off.

The root cause has not yet been found, but within the team, there was quite 
some finger-pointing towards Tika's thread-safety and lots of FUD especially 
around org.apache.tika.parser.txt.CharsetDetector.

But it seems no one in our team reached out or cared to either bug report or 
ask on the mailing list.

So just to get rid of the FUD: Is
org.apache.tika.parser.txt.CharsetDetector considered to be thread-safe?
(Some bugs suggest that Tika cares about thread-safety, but I could not find 
anything in the javadoc for CharsetDetector)

Thanks and Best regards,
Christian


P.S.: We're building a fresh, new CharSetDetector for each byte array that 
should have the character set encoding detected. And only the thread that 
created the CharSetDetector is using it.


P.P.S.: We're still using Tika 1.9.


RE: Is Tika (especially CharsetDetector) considered thread-safe?

2016-07-25 Thread Allison, Timothy B.
With 1.13 and this code, I'm not able to see any problems with our handful of 
test files in our unit tests.  

Exactly what code are you using?  How are you doing detection?


@Test
public void testMultiThreadedEncodingDetection() throws Exception {
Path testDocs = 
Paths.get(this.getClass().getResource("/test-documents").toURI());
List paths = new ArrayList<>();
Map encodings = new ConcurrentHashMap<>();
for (File file : testDocs.toFile().listFiles()) {
if (file.getName().endsWith(".txt") || 
file.getName().endsWith(".html")) {
String encoding = getEncoding(file.toPath());
paths.add(file.toPath());
encodings.put(file.toPath(), encoding);
}
}
for (int i = 0; i < 100; i++) {
new Thread(new EncodingDetector(paths, encodings)).run();
}
assertTrue("success!", true);
}

private class EncodingDetector implements Runnable {
private final List paths;
private final Map encodings;
private final Random r = new Random();
private EncodingDetector(List paths, Map encodings) 
{
this.paths = paths;
this.encodings = encodings;
}

@Override
public void run() {
for (int i = 0; i < 100; i++) {
int pInd = r.nextInt(paths.size());
String detectedEncoding = null;
try {
detectedEncoding = getEncoding(paths.get(pInd));
} catch (Exception e) {
throw new RuntimeException(e);
}
String trueEncoding = encodings.get(paths.get(pInd));
if (! detectedEncoding.equals(trueEncoding)) {
throw new RuntimeException("detected: " + detectedEncoding +
" but should have been: "+trueEncoding);
}
}
}
}

public String getEncoding(Path p) throws Exception {
try (InputStream is = TikaInputStream.get(p)) {
AutoDetectReader reader = new AutoDetectReader(is);
String val = reader.getCharset().toString();
if (val == null) {
return "NULL";
} else {
return val;
}
}
}

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, July 25, 2016 9:21 PM
To: user@tika.apache.org
Subject: RE: Is Tika (especially CharsetDetector) considered thread-safe?

Charset detection _should_ be thread safe.  If you can help us track down the 
problem (unit test?), we need to fix this.

Thank you for raising this.

Best,

 Tim



RE: Is Tika (especially CharsetDetector) considered thread-safe?

2016-07-25 Thread Allison, Timothy B.


Still couldn't find any problems with actual multithreaded code. :(

@Test
public void testMultiThreadingEncodingDetection() throws Exception {
Path testDocs = 
Paths.get(this.getClass().getResource("/test-documents").toURI());
List paths = new ArrayList<>();
Map encodings = new ConcurrentHashMap<>();
for (File file : testDocs.toFile().listFiles()) {
if (file.getName().endsWith(".txt") || 
file.getName().endsWith(".html")) {
System.out.println(file);
String encoding = getEncoding(file.toPath());
paths.add(file.toPath());
encodings.put(file.toPath(), encoding);
}
}
int numThreads = 100;
ExecutorService ex = Executors.newFixedThreadPool(numThreads);
CompletionService completionService =
new ExecutorCompletionService<>(ex);

for (int i = 0; i < numThreads; i++) {
completionService.submit(new EncodingDetector(paths, encodings), 
"done");
}
int completed = 0;
while (completed < numThreads) {
Future future = completionService.take();
if (future.isDone() && "done".equals(future.get())) {
completed++;
}
}
assertTrue("success!", true);
}

private class EncodingDetector implements Runnable {
private final List paths;
private final Map encodings;
private final Random r = new Random();
private EncodingDetector(List paths, Map encodings) 
{
this.paths = paths;
this.encodings = encodings;
}

@Override
public void run() {
for (int i = 0; i < 1000; i++) {
int pInd = r.nextInt(paths.size());

String detectedEncoding = null;
try {
detectedEncoding = getEncoding(paths.get(pInd));
} catch (Exception e) {
throw new RuntimeException(e);
}
String trueEncoding = encodings.get(paths.get(pInd));
if (! detectedEncoding.equals(trueEncoding)) {
throw new RuntimeException("detected: " + detectedEncoding +
" but should have been: "+trueEncoding);
}
}
}
}

public String getEncoding(Path p) throws Exception {
try (InputStream is = TikaInputStream.get(p)) {
AutoDetectReader reader = new AutoDetectReader(is);
String val = reader.getCharset().toString();
if (val == null) {
return "NULL";
} else {
return val;
}
}
}

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, July 25, 2016 10:17 PM
To: user@tika.apache.org
Subject: RE: Is Tika (especially CharsetDetector) considered thread-safe?

With 1.13 and this code, I'm not able to see any problems with our handful of 
test files in our unit tests.  

Exactly what code are you using?  How are you doing detection?



RE: Is Tika (especially CharsetDetector) considered thread-safe?

2016-07-25 Thread Allison, Timothy B.
Y, we have a problem.  Thank you for raising this.

https://issues.apache.org/jira/browse/TIKA-2041




RE: Is Tika (especially CharsetDetector) considered thread-safe?

2016-07-25 Thread Allison, Timothy B.
Christian,
  If you could open an account on JIRA, it would be helpful for discussion on 
this issue.  Thank you, again.

Best,

Tim
   

-Original Message-
From: c.leitin...@lirum.at [mailto:c.leitin...@lirum.at] 
Sent: Monday, July 25, 2016 6:01 PM
To: user@tika.apache.org
Subject: Is Tika (especially CharsetDetector) considered thread-safe?

Hi,

I am working in a project where Tika is getting used in a heavily 
multi-threaded environment. Lately, there have been some issues where character 
set detection in isolation gives plausible results, while running it in 
parallel gives results that are way off.

The root cause has not yet been found, but within the team, there was quite 
some finger-pointing towards Tika's thread-safety and lots of FUD especially 
around org.apache.tika.parser.txt.CharsetDetector.

But it seems no one in our team reached out or cared to either bug report or 
ask on the mailing list.

So just to get rid of the FUD: Is
org.apache.tika.parser.txt.CharsetDetector considered to be thread-safe?
(Some bugs suggest that Tika cares about thread-safety, but I could not find 
anything in the javadoc for CharsetDetector)

Thanks and Best regards,
Christian


P.S.: We're building a fresh, new CharSetDetector for each byte array that 
should have the character set encoding detected. And only the thread that 
created the CharSetDetector is using it.


P.P.S.: We're still using Tika 1.9.


Re: Problem with detection of .mbox file

2016-07-25 Thread Vjeran Marcinko
Thanx a bunch for a suggested workaround.

Also, I have checked and bug exists in latest 1.4 nightly build

-Vjeran

On Tue, Jul 26, 2016 at 2:22 AM, Luís Filipe Nassif  wrote:
> Hi,
>
> Based on https://en.wikipedia.org/wiki/Mbox, you can add the following entry
> in org/apache/tika/mime/custom-mimetypes.xml:
>
> 
> 
> 
> 
> 
> 
>
> The priority must be greater than message/rfc822. It sometimes returns false
> positives, but detects mbox files without extension, which are very very
> commom.
>
> Luis
>
> 2016-07-25 16:36 GMT-03:00 Allison, Timothy B. :
>>
>> 
>> 
>> apache.snapshots
>> Apache Development Snapshot Repository
>>
>> https://repository.apache.org/content/repositories/snapshots/
>> 
>> false
>> 
>> 
>> true
>> 
>> 
>> 
>>
>> -Original Message-
>> From: Vjeran Marcinko [mailto:vmarci...@gmail.com]
>> Sent: Monday, July 25, 2016 3:25 PM
>> To: user@tika.apache.org
>> Subject: Re: Problem with detection of .mbox file
>>
>> Thanx guys, I can do it in some clumsy way, but before I try it, is there
>> some maven repo for such nightly builds that I can include and specify these
>> 1.4-SNAPSHOT deps ?
>>
>> On Mon, Jul 25, 2016 at 9:14 PM, Allison, Timothy B. 
>> wrote:
>> >> Can you try with a recent Tika nightly build?
>> > e.g.
>> > https://builds.apache.org/job/Tika-trunk/lastBuild/org.apache.tika$tik
>> > a-app/
>> >
>> > -Original Message-
>> > From: Nick Burch [mailto:apa...@gagravarr.org]
>> > Sent: Monday, July 25, 2016 3:03 PM
>> > To: user@tika.apache.org
>> > Subject: Re: Problem with detection of .mbox file
>> >
>> > On Mon, 25 Jul 2016, Vjeran Marcinko wrote:
>> >> I fist noticed that my .mbox file doesn't get parsed by MBoxParser,
>> >> and later, after debugging Tika source code, I found what the problem
>> >> is - default detector doesn't even recognize it as "applciation/mbox"
>> >> MIME type, and although file extension is .mbox, it ignores this hint
>> >> because its "magic" way of detecting file type based on some amount
>> >> of initial bytes detects it is "text/html"
>> >
>> > Can you try with a recent Tika nightly build? Only there have been
>> > some tweaks done around that sort of thing recently
>> >
>> > If a nightly build / build from Git still shows the issue, please open a
>> > bug in Jira and attach a problematic file, then we can take a look!
>> >
>> > Nick
>
>