Re: Using Tika that comes with Solr 5.2

2016-02-05 Thread Nick Burch

On Fri, 5 Feb 2016, Steven White wrote:

I went over to Tika's home page and tried to figure out what are the JARs I
need (so I don't have to use Tika's JARs that come with Solr).  I looked
around and couldn't find a "dist" of the JARs.


There isn't one - it's expected that you'll be using Maven or Ivy or 
Gradle, and add the dependency on Tika through that (full details given on 
the site), and the tool will take care of fetching them for you.


We provide single jars for the App and the Server, to make it easy to use 
those as standalone applications - they have the dependencies inlined


You could probably grab the OSGi bundle and unpack that to get most of the 
jars, but using an automated build tool like Maven or Gradle would be 
simpler and less likely to go wrong!


Nick


Re: Using Tika that comes with Solr 5.2

2016-02-05 Thread Steven White
Good point Uwe.

I went over to Tika's home page and tried to figure out what are the JARs I
need (so I don't have to use Tika's JARs that come with Solr).  I looked
around and couldn't find a "dist" of the JARs.  From my reading under
Getting Started, it looks like I have to build Tika from source to get the
JARs.  Is this true or am I missing something?  I would like to skip the
build process and simply grab all the required JARs.

Steve


On Wed, Feb 3, 2016 at 12:28 PM, Uwe Schindler  wrote:

> Hi,
>
>
>
> Morphlines stuff is not needed at all. This is a Mapreduce/Hadoop
> integration of Solr (see documentation) – mostly command line tools around
> Solr and Hadoop.
>
>
>
> FYI: In Solr we don’t show the warnings, because otherwise the user would
> get a lot of useless warnings. We may fix this when TIKA 2.0 comes with the
> tika-parsers module split into multiple parts. In Solr we currently removed
> all TIKA dependencies that conflict with the rest of Solr/Lucene or are not
> useful for fulltext indexing. We only left those that are useful for
> “fulltext” extraction (e.g. office document formats). But we have no
> parsers for CLASS files (breaks, because of ASM conflict) or Netcdf files
> (License issues in older versions of TIKA).
>
>
>
> I see no reason to show these warnings, because if you use Solr as
> documented, it should work correctly. We no longer support running Solr
> inside foreign Application Servers. So everything should work out of box.
>
>
>
> Uwe
>
>
>
> -
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: u...@thetaphi.de
>
>
>
> *From:* Steven White [mailto:swhite4...@gmail.com]
> *Sent:* Wednesday, February 03, 2016 5:44 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Using Tika that comes with Solr 5.2
>
>
>
> Nick, that would be a good think to do: changing Ignore to Warn otherwise
> newcomers will have no clue why this isn't working.
>
>
>
> Another question to the team regarding this topic.
>
>
>
> I see JARs under solr\contrib\morphlines-cell\lib\ and
> solr\contrib\morphlines-core\lib\  The ones under "morphlines-cell" there
> are 2 files with "tika" as their name.  My question is, do I need those for
> general Tika usage?  The README.txt clearly states "*Experimental*" but
> doesn't say if I need them to use Tika.
>
>
>
> Steve
>
>
>
>
>
> On Wed, Feb 3, 2016 at 9:16 AM, Nick Burch  wrote:
>
> On Wed, 3 Feb 2016, Uwe Schindler wrote:
>
> The reason for this behaviour is part of TIKA: If a parser cannot load
> because of classes it refers to are missing, it is automatically disabled.
> Because you missed the actual PDF/Powerpoint/… classes, this is what
> happens for all those parsers.
>
>
> I wonder if it might be worth SOLR changing their default Tika config from
> Ignore to Warn, so that SOLR users (who probably aren't as clued up on how
> it all works as the average Tika user) will get to find out more quickly
> that they've missed something?
>
> Nick
>
>
>


RE: Using Tika that comes with Solr 5.2

2016-02-03 Thread Uwe Schindler
Hi,

 

Morphlines stuff is not needed at all. This is a Mapreduce/Hadoop integration 
of Solr (see documentation) – mostly command line tools around Solr and Hadoop.

 

FYI: In Solr we don’t show the warnings, because otherwise the user would get a 
lot of useless warnings. We may fix this when TIKA 2.0 comes with the 
tika-parsers module split into multiple parts. In Solr we currently removed all 
TIKA dependencies that conflict with the rest of Solr/Lucene or are not useful 
for fulltext indexing. We only left those that are useful for “fulltext” 
extraction (e.g. office document formats). But we have no parsers for CLASS 
files (breaks, because of ASM conflict) or Netcdf files (License issues in 
older versions of TIKA).

 

I see no reason to show these warnings, because if you use Solr as documented, 
it should work correctly. We no longer support running Solr inside foreign 
Application Servers. So everything should work out of box.

 

Uwe

 

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: u...@thetaphi.de

 

From: Steven White [mailto:swhite4...@gmail.com] 
Sent: Wednesday, February 03, 2016 5:44 PM
To: user@tika.apache.org
Subject: Re: Using Tika that comes with Solr 5.2

 

Nick, that would be a good think to do: changing Ignore to Warn otherwise 
newcomers will have no clue why this isn't working.

 

Another question to the team regarding this topic.

 

I see JARs under solr\contrib\morphlines-cell\lib\ and 
solr\contrib\morphlines-core\lib\  The ones under "morphlines-cell" there are 2 
files with "tika" as their name.  My question is, do I need those for general 
Tika usage?  The README.txt clearly states "*Experimental*" but doesn't say if 
I need them to use Tika.

 

Steve

 

 

On Wed, Feb 3, 2016 at 9:16 AM, Nick Burch mailto:apa...@gagravarr.org> > wrote:

On Wed, 3 Feb 2016, Uwe Schindler wrote:

The reason for this behaviour is part of TIKA: If a parser cannot load because 
of classes it refers to are missing, it is automatically disabled. Because you 
missed the actual PDF/Powerpoint/… classes, this is what happens for all those 
parsers.


I wonder if it might be worth SOLR changing their default Tika config from 
Ignore to Warn, so that SOLR users (who probably aren't as clued up on how it 
all works as the average Tika user) will get to find out more quickly that 
they've missed something?

Nick

 



Re: Using Tika that comes with Solr 5.2

2016-02-03 Thread Steven White
Nick, that would be a good think to do: changing Ignore to Warn otherwise
newcomers will have no clue why this isn't working.

Another question to the team regarding this topic.

I see JARs under solr\contrib\morphlines-cell\lib\ and
solr\contrib\morphlines-core\lib\  The ones under "morphlines-cell" there
are 2 files with "tika" as their name.  My question is, do I need those for
general Tika usage?  The README.txt clearly states "*Experimental*" but
doesn't say if I need them to use Tika.

Steve


On Wed, Feb 3, 2016 at 9:16 AM, Nick Burch  wrote:

> On Wed, 3 Feb 2016, Uwe Schindler wrote:
>
>> The reason for this behaviour is part of TIKA: If a parser cannot load
>> because of classes it refers to are missing, it is automatically disabled.
>> Because you missed the actual PDF/Powerpoint/… classes, this is what
>> happens for all those parsers.
>>
>
> I wonder if it might be worth SOLR changing their default Tika config from
> Ignore to Warn, so that SOLR users (who probably aren't as clued up on how
> it all works as the average Tika user) will get to find out more quickly
> that they've missed something?
>
> Nick


RE: Using Tika that comes with Solr 5.2

2016-02-03 Thread Nick Burch

On Wed, 3 Feb 2016, Uwe Schindler wrote:
The reason for this behaviour is part of TIKA: If a parser cannot load 
because of classes it refers to are missing, it is automatically 
disabled. Because you missed the actual PDF/Powerpoint/… classes, this 
is what happens for all those parsers.


I wonder if it might be worth SOLR changing their default Tika config from 
Ignore to Warn, so that SOLR users (who probably aren't as clued up on how 
it all works as the average Tika user) will get to find out more quickly 
that they've missed something?


Nick

RE: Using Tika that comes with Solr 5.2

2016-02-03 Thread Uwe Schindler
Hi,

 

The reason for this behaviour is part of TIKA: If a parser cannot load because 
of classes it refers to are missing, it is automatically disabled. Because you 
missed the actual PDF/Powerpoint/… classes, this is what happens for all those 
parsers.

 

Uwe

 

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: u...@thetaphi.de

 

From: Steven White [mailto:swhite4...@gmail.com] 
Sent: Wednesday, February 03, 2016 2:48 PM
To: user@tika.apache.org
Subject: Re: Using Tika that comes with Solr 5.2

 

Thanks everyone.  After posting about this issue, I found my issue.  I was 
missing a whole set of Tika JARs that are found under Solr: 
\solr\contrib\extraction\lib\

 

Steve

 

On Wed, Feb 3, 2016 at 8:29 AM, Nick Burch mailto:apa...@gagravarr.org> > wrote:

On Tue, 2 Feb 2016, Steven White wrote:

What I'm finding is that Tika will not extract the raw text off PDF,
Powerpoint, ets. files but it will off raw text files.


I'd suggest you try some of the steps in the troubleshooting page:
  http://wiki.apache.org/tika/Troubleshooting%20Tika
Probably start at the "No Content Extracted" section, and follow the links to 
the possible problems + ways to check

Solr 5.2 comes with the following Tika JARs which I have included all of
them: tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar,
tika-xmp-1.7.jar, vorbis-java-tika-0.6.jar,
kite-morphlines-tika-core-0.12.1.jar and
kite-morphlines-tika-decompress-0.12.1.jar


You seem to be missing quite a few of the Tika dependencies, which may well be 
it, follow the troubleshooting guide to check!

Nick

 



Re: Using Tika that comes with Solr 5.2

2016-02-03 Thread Steven White
Thanks everyone.  After posting about this issue, I found my issue.  I was
missing a whole set of Tika JARs that are found under Solr:
\solr\contrib\extraction\lib\

Steve

On Wed, Feb 3, 2016 at 8:29 AM, Nick Burch  wrote:

> On Tue, 2 Feb 2016, Steven White wrote:
>
>> What I'm finding is that Tika will not extract the raw text off PDF,
>> Powerpoint, ets. files but it will off raw text files.
>>
>
> I'd suggest you try some of the steps in the troubleshooting page:
>   http://wiki.apache.org/tika/Troubleshooting%20Tika
> Probably start at the "No Content Extracted" section, and follow the links
> to the possible problems + ways to check
>
> Solr 5.2 comes with the following Tika JARs which I have included all of
>> them: tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar,
>> tika-xmp-1.7.jar, vorbis-java-tika-0.6.jar,
>> kite-morphlines-tika-core-0.12.1.jar and
>> kite-morphlines-tika-decompress-0.12.1.jar
>>
>
> You seem to be missing quite a few of the Tika dependencies, which may
> well be it, follow the troubleshooting guide to check!
>
> Nick
>


Re: Using Tika that comes with Solr 5.2

2016-02-03 Thread Nick Burch

On Tue, 2 Feb 2016, Steven White wrote:

What I'm finding is that Tika will not extract the raw text off PDF,
Powerpoint, ets. files but it will off raw text files.


I'd suggest you try some of the steps in the troubleshooting page:
  http://wiki.apache.org/tika/Troubleshooting%20Tika
Probably start at the "No Content Extracted" section, and follow the links 
to the possible problems + ways to check



Solr 5.2 comes with the following Tika JARs which I have included all of
them: tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar,
tika-xmp-1.7.jar, vorbis-java-tika-0.6.jar,
kite-morphlines-tika-core-0.12.1.jar and
kite-morphlines-tika-decompress-0.12.1.jar


You seem to be missing quite a few of the Tika dependencies, which may 
well be it, follow the troubleshooting guide to check!


Nick


RE: Using Tika that comes with Solr 5.2

2016-02-03 Thread Allison, Timothy B.
The problem (I think) is that tika-parsers.jar includes just the Tika parsers 
(wrappers) around a boatload of actual parsers/dependencies (POI, PDFBox, etc). 
 If you are using jars, I’d recommend the tika-app.jar which includes all 
dependencies.
From: Steven White [mailto:swhite4...@gmail.com]
Sent: Tuesday, February 02, 2016 7:01 PM
To: user@tika.apache.org
Subject: Using Tika that comes with Solr 5.2

Hi everyone,

I have written a standalone application that works with Solr 5.2.  I'm using 
the existing JARs that come with Solr to index data off a file system.  My 
applications scans the file system, looking for files and then uses Tika to 
extract the raw text and then sends the raw text to Solr, using SolrJ, for 
indexing.

What I'm finding is that Tika will not extract the raw text off PDF, 
Powerpoint, ets. files but it will off raw text files.

Here is the code for:

public static void parseWithTika() throws Exception {
  File file = new File("C:\\temp\\test.pdf");

  FileInputStream in =- new FileInputStream(file);
  AutoDetectParser parser = new AutoDetectParser();
  Metadata metadata = new Metadata();
  BodyContentHandler contentHandler = new BodyContentHandler();

  parse.parse(in, contentHandler, metadata);

  String content = contentHandelr.toString();  <=== 'content is always an empty 
string

  in.close();
}

In the above code, 'content' is always empty (the above is: off 
https://tika.apache.org/1.8/examples.html)

Solr 5.2 comes with the following Tika JARs which I have included all of them: 
tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar, tika-xmp-1.7.jar, 
vorbis-java-tika-0.6.jar, kite-morphlines-tika-core-0.12.1.jar and 
kite-morphlines-tika-decompress-0.12.1.jar

Any idea why this isn't working?

Thanks!

Steve