Re: tika-core, tika-parser

2012-02-09 Thread Markus Jelsma


On Wednesday 08 February 2012 18:27:32 Ken Krugler wrote:
 On Feb 8, 2012, at 5:28am, Markus Jelsma wrote:
  On Wednesday 08 February 2012 14:22:36 Julien Nioche wrote:
  sorry don't understand what your issue is. We have a dependency on
  tika-parsers and the actual parser implementations (listed in tika
  parsers' POM) are pulled transitively just like any other dependency
  managed by Ivy. They end up being copied in 
  runtime/local/plugins/parse-tika/ or put in the job in runtime/deploy/
  
  My problem is that i am working on some code for Tika-parsers
  1.1-SNAPSHOT that i need to use in Nutch. However, when i build
  tika-parsers and put it in Nutch' lib directory i still seem to be
  missing dependencies. Then trouble
 
  begins:
 I don't know anything about how Nutch handles jars in its lib directory,
 but this sounds like you have a raw jar (tika-parsers) without its
 pom.xml.
 
 So then Ivy (or Maven) doesn't know about the transitive dependencies on
 other jars, which are needed to implement the actual parsing support.

You're right, that's exactly what happened. However, i wasn't completely aware 
of it. Thanks

 
 -- Ken
 
  Exception in thread main java.lang.NoClassDefFoundError: Could not
  initialize class org.apache.tika.parser.dwg.DWGParser
  
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:247)
 at sun.misc.Service$LazyIterator.next(Service.java:271)
 at
 org.apache.nutch.parse.tika.TikaConfig.init(TikaConfig.java:149
 ) at
  
  org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:2
  11)
  
 at
 org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:25
 4) at
  
  org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162
  )
  
 at
  
  org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
  
 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71)
 at
 org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
 org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)
  
  Nick told me to remove DWG from the org.apache.tika.parsers.Parsers
  config file, which i did. But then other dependency issues come and go.
  The more parsers i remove from the config file the better it goes, but
  then Tika won't build anymore because of failing tests.
  
  I asked this on the Nutch list because i wasn't sure anymore how Nutch
  deals with these its own deps, which you explained well.
  
  I'll give up for now :)
  
  On 8 February 2012 13:03, Markus Jelsma markus.jel...@openindex.io 
wrote:
  Yes, it looks like it! It should also be upgraded to Tika 1.0. But
  that's something else.
  
  dependencies, dependencies, dependencies :(
  
  On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote:
  The dependencies for the plugins are defined locally as shown in the
  URL below, where you can see the ref to tika-parsers for parse-tika.
  Is that more clear for you Markus?
  
  On 8 February 2012 12:58, Lewis John Mcgibbney
  
  lewis.mcgibb...@gmail.comwrote:
  Hi Markus,
  
  For starters
  
  http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?
  vi
  
  ew=markup
  
  Can we pick our way through this?
  
  Thanks
  
  
  On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
  markus.jel...@openindex.io
  
  wrote:
  Hi,
  
  Can anyone shed light on this? We don't have any parsers in our libs
  
  dir
  
  and
  we don't have tika-parsers jar, only the tika-core jar. Where are
  the parsers
  and how does this all work?
  
  I've posted a question (same subject) on the Tika list and Nick
  tells
  
  me
  
  there
  must be parsers somewhere. Well, i have no idea how we do it in
  Nutch, do you?
  
  Thanks
  
  --
  *Lewis*
  
  --
  Markus Jelsma - CTO - Openindex
 
 --
 Ken Krugler
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Mahout  Solr

-- 
Markus Jelsma - CTO - Openindex


tika-core, tika-parser

2012-02-08 Thread Markus Jelsma
Hi,

Can anyone shed light on this? We don't have any parsers in our libs dir and 
we don't have tika-parsers jar, only the tika-core jar. Where are the parsers 
and how does this all work? 

I've posted a question (same subject) on the Tika list and Nick tells me there 
must be parsers somewhere. Well, i have no idea how we do it in Nutch, do you?

Thanks


Re: tika-core, tika-parser

2012-02-08 Thread Lewis John Mcgibbney
Hi Markus,

For starters

http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?view=markup

Can we pick our way through this?

Thanks

On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi,

 Can anyone shed light on this? We don't have any parsers in our libs dir
 and
 we don't have tika-parsers jar, only the tika-core jar. Where are the
 parsers
 and how does this all work?

 I've posted a question (same subject) on the Tika list and Nick tells me
 there
 must be parsers somewhere. Well, i have no idea how we do it in Nutch, do
 you?

 Thanks




-- 
*Lewis*


Re: tika-core, tika-parser

2012-02-08 Thread Julien Nioche
The dependencies for the plugins are defined locally as shown in the URL
below, where you can see the ref to tika-parsers for parse-tika. Is that
more clear for you Markus?

On 8 February 2012 12:58, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:

 Hi Markus,

 For starters


 http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?view=markup

 Can we pick our way through this?

 Thanks


 On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma markus.jel...@openindex.io
  wrote:

 Hi,

 Can anyone shed light on this? We don't have any parsers in our libs dir
 and
 we don't have tika-parsers jar, only the tika-core jar. Where are the
 parsers
 and how does this all work?

 I've posted a question (same subject) on the Tika list and Nick tells me
 there
 must be parsers somewhere. Well, i have no idea how we do it in Nutch, do
 you?

 Thanks




 --
 *Lewis*




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: tika-core, tika-parser

2012-02-08 Thread Markus Jelsma
Yes, it's listed there indeed! But where are the parser impls then? I'll check 
this out. I must be getting crazy or something!

On Wednesday 08 February 2012 13:58:46 Lewis John Mcgibbney wrote:
 Hi Markus,
 
 For starters
 
 http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?view
 =markup
 
 Can we pick our way through this?
 
 Thanks
 
 On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
 
 markus.jel...@openindex.iowrote:
  Hi,
  
  Can anyone shed light on this? We don't have any parsers in our libs dir
  and
  we don't have tika-parsers jar, only the tika-core jar. Where are the
  parsers
  and how does this all work?
  
  I've posted a question (same subject) on the Tika list and Nick tells me
  there
  must be parsers somewhere. Well, i have no idea how we do it in Nutch, do
  you?
  
  Thanks

-- 
Markus Jelsma - CTO - Openindex


Re: tika-core, tika-parser

2012-02-08 Thread Markus Jelsma
Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's 
something else.

dependencies, dependencies, dependencies :(

On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote:
 The dependencies for the plugins are defined locally as shown in the URL
 below, where you can see the ref to tika-parsers for parse-tika. Is that
 more clear for you Markus?
 
 On 8 February 2012 12:58, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.comwrote:
  Hi Markus,
  
  For starters
  
  
  http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi
  ew=markup
  
  Can we pick our way through this?
  
  Thanks
  
  
  On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
  markus.jel...@openindex.io
  
   wrote:
  Hi,
  
  Can anyone shed light on this? We don't have any parsers in our libs dir
  and
  we don't have tika-parsers jar, only the tika-core jar. Where are the
  parsers
  and how does this all work?
  
  I've posted a question (same subject) on the Tika list and Nick tells me
  there
  must be parsers somewhere. Well, i have no idea how we do it in Nutch,
  do you?
  
  Thanks
  
  --
  *Lewis*

-- 
Markus Jelsma - CTO - Openindex


Re: tika-core, tika-parser

2012-02-08 Thread Julien Nioche
sorry don't understand what your issue is. We have a dependency on
tika-parsers and the actual parser implementations (listed in tika parsers'
POM) are pulled transitively just like any other dependency managed by Ivy.
They end up being copied in  runtime/local/plugins/parse-tika/ or put in
the job in runtime/deploy/


On 8 February 2012 13:03, Markus Jelsma markus.jel...@openindex.io wrote:

 Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's
 something else.

 dependencies, dependencies, dependencies :(

 On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote:
  The dependencies for the plugins are defined locally as shown in the URL
  below, where you can see the ref to tika-parsers for parse-tika. Is that
  more clear for you Markus?
 
  On 8 February 2012 12:58, Lewis John Mcgibbney
 lewis.mcgibb...@gmail.comwrote:
   Hi Markus,
  
   For starters
  
  
  
 http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi
   ew=markup
  
   Can we pick our way through this?
  
   Thanks
  
  
   On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
   markus.jel...@openindex.io
  
wrote:
   Hi,
  
   Can anyone shed light on this? We don't have any parsers in our libs
 dir
   and
   we don't have tika-parsers jar, only the tika-core jar. Where are the
   parsers
   and how does this all work?
  
   I've posted a question (same subject) on the Tika list and Nick tells
 me
   there
   must be parsers somewhere. Well, i have no idea how we do it in Nutch,
   do you?
  
   Thanks
  
   --
   *Lewis*

 --
 Markus Jelsma - CTO - Openindex




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: tika-core, tika-parser

2012-02-08 Thread Markus Jelsma


On Wednesday 08 February 2012 14:22:36 Julien Nioche wrote:
 sorry don't understand what your issue is. We have a dependency on
 tika-parsers and the actual parser implementations (listed in tika parsers'
 POM) are pulled transitively just like any other dependency managed by Ivy.
 They end up being copied in  runtime/local/plugins/parse-tika/ or put in
 the job in runtime/deploy/

My problem is that i am working on some code for Tika-parsers 1.1-SNAPSHOT 
that i need to use in Nutch. However, when i build tika-parsers and put it in 
Nutch' lib directory i still seem to be missing dependencies. Then trouble 
begins:

Exception in thread main java.lang.NoClassDefFoundError: Could not 
initialize class org.apache.tika.parser.dwg.DWGParser
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at sun.misc.Service$LazyIterator.next(Service.java:271)
at org.apache.nutch.parse.tika.TikaConfig.init(TikaConfig.java:149)
at 
org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:211)
at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:254)
at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71)
at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)

Nick told me to remove DWG from the org.apache.tika.parsers.Parsers config 
file, which i did. But then other dependency issues come and go. The more 
parsers i remove from the config file the better it goes, but then Tika won't 
build anymore because of failing tests.

I asked this on the Nutch list because i wasn't sure anymore how Nutch deals 
with these its own deps, which you explained well.

I'll give up for now :)



 
 On 8 February 2012 13:03, Markus Jelsma markus.jel...@openindex.io wrote:
  Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's
  something else.
  
  dependencies, dependencies, dependencies :(
  
  On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote:
   The dependencies for the plugins are defined locally as shown in the
   URL below, where you can see the ref to tika-parsers for parse-tika.
   Is that more clear for you Markus?
   
   On 8 February 2012 12:58, Lewis John Mcgibbney
  
  lewis.mcgibb...@gmail.comwrote:
Hi Markus,

For starters
  
  http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi
  
ew=markup

Can we pick our way through this?

Thanks


On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
markus.jel...@openindex.io

 wrote:
Hi,

Can anyone shed light on this? We don't have any parsers in our libs
  
  dir
  
and
we don't have tika-parsers jar, only the tika-core jar. Where are
the parsers
and how does this all work?

I've posted a question (same subject) on the Tika list and Nick
tells
  
  me
  
there
must be parsers somewhere. Well, i have no idea how we do it in
Nutch, do you?

Thanks

--
*Lewis*
  
  --
  Markus Jelsma - CTO - Openindex

-- 
Markus Jelsma - CTO - Openindex


Re: tika-core, tika-parser

2012-02-08 Thread Ken Krugler

On Feb 8, 2012, at 5:28am, Markus Jelsma wrote:

 
 
 On Wednesday 08 February 2012 14:22:36 Julien Nioche wrote:
 sorry don't understand what your issue is. We have a dependency on
 tika-parsers and the actual parser implementations (listed in tika parsers'
 POM) are pulled transitively just like any other dependency managed by Ivy.
 They end up being copied in  runtime/local/plugins/parse-tika/ or put in
 the job in runtime/deploy/
 
 My problem is that i am working on some code for Tika-parsers 1.1-SNAPSHOT 
 that i need to use in Nutch. However, when i build tika-parsers and put it in 
 Nutch' lib directory i still seem to be missing dependencies. Then trouble 
 begins:

I don't know anything about how Nutch handles jars in its lib directory, but 
this sounds like you have a raw jar (tika-parsers) without its pom.xml.

So then Ivy (or Maven) doesn't know about the transitive dependencies on other 
jars, which are needed to implement the actual parsing support.

-- Ken

 
 Exception in thread main java.lang.NoClassDefFoundError: Could not 
 initialize class org.apache.tika.parser.dwg.DWGParser
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at sun.misc.Service$LazyIterator.next(Service.java:271)
at org.apache.nutch.parse.tika.TikaConfig.init(TikaConfig.java:149)
at 
 org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:211)
at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:254)
at 
 org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
at 
 org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71)
at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)
 
 Nick told me to remove DWG from the org.apache.tika.parsers.Parsers config 
 file, which i did. But then other dependency issues come and go. The more 
 parsers i remove from the config file the better it goes, but then Tika won't 
 build anymore because of failing tests.
 
 I asked this on the Nutch list because i wasn't sure anymore how Nutch deals 
 with these its own deps, which you explained well.
 
 I'll give up for now :)
 
 
 
 
 On 8 February 2012 13:03, Markus Jelsma markus.jel...@openindex.io wrote:
 Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's
 something else.
 
 dependencies, dependencies, dependencies :(
 
 On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote:
 The dependencies for the plugins are defined locally as shown in the
 URL below, where you can see the ref to tika-parsers for parse-tika.
 Is that more clear for you Markus?
 
 On 8 February 2012 12:58, Lewis John Mcgibbney
 
 lewis.mcgibb...@gmail.comwrote:
 Hi Markus,
 
 For starters
 
 http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi
 
 ew=markup
 
 Can we pick our way through this?
 
 Thanks
 
 
 On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
 markus.jel...@openindex.io
 
 wrote:
 Hi,
 
 Can anyone shed light on this? We don't have any parsers in our libs
 
 dir
 
 and
 we don't have tika-parsers jar, only the tika-core jar. Where are
 the parsers
 and how does this all work?
 
 I've posted a question (same subject) on the Tika list and Nick
 tells
 
 me
 
 there
 must be parsers somewhere. Well, i have no idea how we do it in
 Nutch, do you?
 
 Thanks
 
 --
 *Lewis*
 
 --
 Markus Jelsma - CTO - Openindex
 
 -- 
 Markus Jelsma - CTO - Openindex

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Mahout  Solr