[jira] [Closed] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too

2015-04-03 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1592.
-
Resolution: Invalid

Closing as Invalid. Feel free to create additional issues if you run into other 
problems with Tika!

Thank you for updating with the solution! I'm glad you found it. :) (I'm also 
glad this wasn't a Tika issue... Ha.)

> It seems dbus and x11 server are invoked, and fails for some reason too
> ---
>
> Key: TIKA-1592
> URL: https://issues.apache.org/jira/browse/TIKA-1592
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
> Environment: CentOs 6.6, Java 1.7
>Reporter: Michael Couck
>
> Exception running unit tests:
> GConf Error: Failed to contact configuration server; some possible causes are 
> that you need to enable TCP/IP networking for ORBit, or you have stale NFS 
> locks due to a system crash. See http://projects.gnome.org/gconf/ for 
> information. (Details -  1: Not running within active session)
> Is Tika trying to start an x11 server using dbus? Why? This breaks the unit 
> tests, the logging is a gig for each run, and even a 64 core server is 100% 
> cpu during the failure. I am completely confounded. Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Mattmann, Chris A (3980)
+1 this makes immense sense to me. Thanks Juls and Tim.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: "tallison314...@gmail.com" 
Reply-To: "dev@tika.apache.org" 
Date: Friday, April 3, 2015 at 5:35 AM
To: "d...@pdfbox.apache.org" , "dev@tika.apache.org"
, "d...@poi.apache.org" 
Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

>All,
>  What do we think?
>
>On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote:
>
>CommonCrawl currently has the WET format that extracts plain text from
>web pages.  My guess is that this is text stripping from text-y formats.
>Let me know if I'm wrong!
>
>
>Would there be any interest in adding another format: WETT (WET-Tika) or
>supplementing the current WET by using Tika to extract contents from
>binary formats too: PDF, MSWord, etc.
>
>
>Julien Nioche kindly carved out 220 GB for us to experiment with on
>TIKA-1302  on a
>Rackspace vm.  But, I'm wondering now if it would make more sense to have
>CommonCrawl run Tika as part of its regular process and make the output
>available in one of your standard formats.
>
>
>
>CommonCrawl consumers would get Tika output, and the Tika dev community
>(including its dependencies, PDFBox, POI, etc.) could get the stacktraces
>to help prioritize bug fixes.
>
>
>Cheers,
>
>
>  Tim 
>
>
>
>



Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Andreas Beeker
Hi,

similar to Dominiks approach of checking the file base for parsing errors,
I'd like to scan for certain file constellations, for the typically "left over 
bytes" error
or other record combinations which I can't reproduce with my MS/Libre office 
versions.

I haven't thought about how it's actually done, but I think logging the 
location in the
integration tests and later manually checking the corresponding files should be
sufficient.

Best wishes,
Andi



On 03.04.2015 17:51, Dominik Stadler wrote:
> Hi,
>
> I am very interested as I am following the Common Crawl activity for
> some time already. It sounds like a neat idea to do the check already
> when the crawl is done, are the binary documents already part of the
> crawl-data?
>
> ...
>
> Dominik.
>
> On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B.  
> wrote:
>> All,
>>
>> What do you think?
>>
>>
>>



Fwd: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread tallison314159
All,
  What do we think?

On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote:

> CommonCrawl currently has the WET format that extracts plain text from web 
> pages.  My guess is that this is text stripping from text-y formats.  Let 
> me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or 
> supplementing the current WET by using Tika to extract contents from binary 
> formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on 
> TIKA-1302  on 
> a Rackspace vm.  But, I'm wondering now if it would make more sense to have 
> CommonCrawl run Tika as part of its regular process and make the output 
> available in one of your standard formats.  
>
> CommonCrawl consumers would get Tika output, and the Tika dev community 
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces 
> to help prioritize bug fixes.
>
> Cheers,
>
>   Tim 
>


Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Dominik Stadler
Hi,

I am very interested as I am following the Common Crawl activity for
some time already. It sounds like a neat idea to do the check already
when the crawl is done, are the binary documents already part of the
crawl-data?

Actually I am currently playing around with the Common Crawl URL Index
(http://blog.commoncrawl.org/2013/01/common-crawl-url-index/) which is
a much smaller sized download (230G) and only contains URLs without
all the additional information.

The index is a bit outdated and currently only covers half of the full
common crawl, however there are people working on refreshing it for
the latest crawls.

I wrote a small app which extracts interesting URLs out of these (aka
files that POI should be able to open), resulting in aprox. 6.6
million links! Based on some tests for the full download there would
be around 3.3 million documents requiring approximately 3TB of
storage. Note that this is still an old crawl with only half of the
data included, so a current crawl will be considerably bigger!

Running them through the integration testing that we added in POI
(which performs text and property extraction but also some other
POI-related actions) already showed a few cases where slightly
off-spec documents can cause bugs to appear, some initial related
commits will follow shortly...

Dominik.

On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B.  wrote:
> All,
>
> What do you think?
>
>
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>
> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, 
> talliso...@gmail.com wrote:
> CommonCrawl currently has the WET format that extracts plain text from web 
> pages.  My guess is that this is text stripping from text-y formats.  Let me 
> know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or 
> supplementing the current WET by using Tika to extract contents from binary 
> formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on 
> TIKA-1302 on a Rackspace vm. 
>  But, I'm wondering now if it would make more sense to have CommonCrawl run 
> Tika as part of its regular process and make the output available in one of 
> your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community 
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to 
> help prioritize bug fixes.
>
> Cheers,
>
>   Tim
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
> For additional commands, e-mail: dev-h...@poi.apache.org
>


Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Konstantin Gribov
Dominik,
I've downloaded one of WARC files (from CC-MAIN-2015-01,
https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-06/segments/1422115855094.38/warc/CC-MAIN-20150124161055-0-ip-10-180-212-252.ec2.internal.warc.gz,
1.2GB) and
it contains at least PDFs and DOCs in crawled data.

-- 
Best regards,
Konstantin Gribov

пт, 3 апр. 2015 г. в 18:52, Dominik Stadler :

Hi,
>
> I am very interested as I am following the Common Crawl activity for
> some time already. It sounds like a neat idea to do the check already
> when the crawl is done, are the binary documents already part of the
> crawl-data?
>
> Actually I am currently playing around with the Common Crawl URL Index
> (http://blog.commoncrawl.org/2013/01/common-crawl-url-index/) which is
> a much smaller sized download (230G) and only contains URLs without
> all the additional information.
>
> The index is a bit outdated and currently only covers half of the full
> common crawl, however there are people working on refreshing it for
> the latest crawls.
>
> I wrote a small app which extracts interesting URLs out of these (aka
> files that POI should be able to open), resulting in aprox. 6.6
> million links! Based on some tests for the full download there would
> be around 3.3 million documents requiring approximately 3TB of
> storage. Note that this is still an old crawl with only half of the
> data included, so a current crawl will be considerably bigger!
>
> Running them through the integration testing that we added in POI
> (which performs text and property extraction but also some other
> POI-related actions) already showed a few cases where slightly
> off-spec documents can cause bugs to appear, some initial related
> commits will follow shortly...
>
> Dominik.
>
> On Fri, Apr 3, 2015 at 4:28 PM, Allison, Timothy B. 
> wrote:
> > All,
> >
> > What do you think?
> >
> >
> > https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
> >
> >
> > On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com
>  wrote:
> > CommonCrawl currently has the WET format that extracts plain text from
> web pages.  My guess is that this is text stripping from text-y formats.
> Let me know if I'm wrong!
> >
> > Would there be any interest in adding another format: WETT (WET-Tika) or
> supplementing the current WET by using Tika to extract contents from binary
> formats too: PDF, MSWord, etc.
> >
> > Julien Nioche kindly carved out 220 GB for us to experiment with on
> TIKA-1302 on a Rackspace
> vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
> run Tika as part of its regular process and make the output available in
> one of your standard formats.
> >
> > CommonCrawl consumers would get Tika output, and the Tika dev community
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
> to help prioritize bug fixes.
> >
> > Cheers,
> >
> >   Tim
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
> > For additional commands, e-mail: dev-h...@poi.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
> For additional commands, e-mail: dev-h...@poi.apache.org
>
>


[jira] [Commented] (TIKA-1593) Doco: Broken link to "Parser Quick Start Guide"

2015-04-03 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394555#comment-14394555
 ] 

Konstantin Gribov commented on TIKA-1593:
-

Thank you, Dan.

Seems, it should be something like https://tika.apache.org/1.7/parser_guide.html

> Doco: Broken link to "Parser Quick Start Guide"
> ---
>
> Key: TIKA-1593
> URL: https://issues.apache.org/jira/browse/TIKA-1593
> Project: Tika
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.7
>Reporter: Dan Rollo
>Priority: Minor
>
> The Tika web page: https://tika.apache.org/contribute.html, under the 
> Section: "New Parsers, Detectors and Mime Types", there is a link with the 
> text: "Parser Quick Start Guide". The link URL is: 
> https://tika.apache.org/parser_guide.apt, and does not work. 
> The ".apt" extension seems odd. I don't know what the link should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Konstantin Gribov
Tim,
seems interesting, because it provides big test dataset.
As I see, they store pdfs/docs in WARC files, so there's source data for
parsing.

-- 
Best regards,
Konstantin Gribov

пт, 3 апр. 2015 г. в 17:29, Allison, Timothy B. :

> All,
>
> What do you think?
>
>
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>
> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com talliso...@gmail.com> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web
> pages.  My guess is that this is text stripping from text-y formats.  Let
> me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or
> supplementing the current WET by using Tika to extract contents from binary
> formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on
> TIKA-1302 on a Rackspace
> vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
> run Tika as part of its regular process and make the output available in
> one of your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
> to help prioritize bug fixes.
>
> Cheers,
>
>   Tim
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
> For additional commands, e-mail: dev-h...@poi.apache.org
>
>


Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Oleg Tikhonov
I Tim,
Having looked at CC, a couple of ideas crossed the mind. I think it's cool.
+1.

BR,
Oleg
On 3 Apr 2015 17:29, "Allison, Timothy B."  wrote:

> All,
>
> What do you think?
>
>
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>
> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com talliso...@gmail.com> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web
> pages.  My guess is that this is text stripping from text-y formats.  Let
> me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or
> supplementing the current WET by using Tika to extract contents from binary
> formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on
> TIKA-1302 on a Rackspace
> vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
> run Tika as part of its regular process and make the output available in
> one of your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
> to help prioritize bug fixes.
>
> Cheers,
>
>   Tim
>


[jira] [Created] (TIKA-1593) Doco: Broken link to "Parser Quick Start Guide"

2015-04-03 Thread Dan Rollo (JIRA)
Dan Rollo created TIKA-1593:
---

 Summary: Doco: Broken link to "Parser Quick Start Guide"
 Key: TIKA-1593
 URL: https://issues.apache.org/jira/browse/TIKA-1593
 Project: Tika
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.7
Reporter: Dan Rollo
Priority: Minor


The Tika web page: https://tika.apache.org/contribute.html, under the Section: 
"New Parsers, Detectors and Mime Types", there is a link with the text: "Parser 
Quick Start Guide". The link URL is: https://tika.apache.org/parser_guide.apt, 
and does not work. 

The ".apt" extension seems odd. I don't know what the link should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Allison, Timothy B.
All,

What do you think?


https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0


On Friday, April 3, 2015 at 8:23:11 AM UTC-4, 
talliso...@gmail.com wrote:
CommonCrawl currently has the WET format that extracts plain text from web 
pages.  My guess is that this is text stripping from text-y formats.  Let me 
know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or 
supplementing the current WET by using Tika to extract contents from binary 
formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on 
TIKA-1302 on a Rackspace vm.  
But, I'm wondering now if it would make more sense to have CommonCrawl run Tika 
as part of its regular process and make the output available in one of your 
standard formats.

CommonCrawl consumers would get Tika output, and the Tika dev community 
(including its dependencies, PDFBox, POI, etc.) could get the stacktraces to 
help prioritize bug fixes.

Cheers,

  Tim


RE: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Allison, Timothy B.
Sorry, link wasn’t included:

https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0

From: tallison314...@gmail.com [mailto:tallison314...@gmail.com]
Sent: Friday, April 03, 2015 8:35 AM
To: d...@pdfbox.apache.org; dev@tika.apache.org; d...@poi.apache.org
Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

All,
  What do we think?

On Friday, April 3, 2015 at 8:23:11 AM UTC-4, 
talliso...@gmail.com wrote:
CommonCrawl currently has the WET format that extracts plain text from web 
pages.  My guess is that this is text stripping from text-y formats.  Let me 
know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or 
supplementing the current WET by using Tika to extract contents from binary 
formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on 
TIKA-1302 on a Rackspace vm.  
But, I'm wondering now if it would make more sense to have CommonCrawl run Tika 
as part of its regular process and make the output available in one of your 
standard formats.

CommonCrawl consumers would get Tika output, and the Tika dev community 
(including its dependencies, PDFBox, POI, etc.) could get the stacktraces to 
help prioritize bug fixes.

Cheers,

  Tim


[jira] [Commented] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too

2015-04-03 Thread Michael Couck (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394356#comment-14394356
 ] 

Michael Couck commented on TIKA-1592:
-

Just for completeness, I upgraded/updated the os, perhaps a confounding event? 
As it turns out, small change in the dbus/display/gconf/x11 combination, and 
this is required, put in the /etc/profile:

eval $(dbus-launch --sh-syntax)
export DBUS_SESSION_BUS_ADDRESS
export DBUS_SESSION_BUS_PID

A little cryptic perhaps? Well there you have it, several days to get to that, 
hope no one else falls into the same trap.

Cheers,
Michael

> It seems dbus and x11 server are invoked, and fails for some reason too
> ---
>
> Key: TIKA-1592
> URL: https://issues.apache.org/jira/browse/TIKA-1592
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
> Environment: CentOs 6.6, Java 1.7
>Reporter: Michael Couck
>
> Exception running unit tests:
> GConf Error: Failed to contact configuration server; some possible causes are 
> that you need to enable TCP/IP networking for ORBit, or you have stale NFS 
> locks due to a system crash. See http://projects.gnome.org/gconf/ for 
> information. (Details -  1: Not running within active session)
> Is Tika trying to start an x11 server using dbus? Why? This breaks the unit 
> tests, the logging is a gig for each run, and even a 64 core server is 100% 
> cpu during the failure. I am completely confounded. Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)