Re: Nutch dev. plans

2009-07-25 Thread Andrzej Bialecki

Kirby Bohling wrote:

On Fri, Jul 17, 2009 at 5:21 PM, Andrzej Bialecki wrote:

Doğacan Güney wrote:


There's no specific design yet except I can't stand the existing plugin
framework anymore ... ;) I started reading on OSGI and it seems that it
supports the functionality that we need, and much more - it certainly
looks
like a better alternative than maintaining our plugin system beyond 1.x
...


Couldn't agree more with the "can't stand plugin framework" :D

Any good links on OSGI stuff?

I found this:

http://neilbartlett.name/blog/osgi-articles


Hi Kirby,

Thanks for your insights - please see my comments below.


Plugins are called Bundles in OSGi parlance, but I'll use plugin as
that's the term used by Nutch.

I have done quite a bit of OSGi work (I used to develop RCP
applications for a living).  OSGi is great, as long as you plan on not
using reflection to retrieve classes directly, and you don't plan on
using a library that uses it directly.

Pretty much every use of usage like this:

Class clazz = Class.forName(stringFromConfig);
// Code to create an object using this class...

Will fail, unless the code is very classloader aware.  So if you're
going to switch over to using OSGi (which I think would be wonderful),
you'll want to ensure that you can deal with all of the third-party
libraries.  I haven't played much with any of the Declarative Services
stuff (I think that was slated for OSGi, but it might have just been
an Eclipse extension).


This is an important issue - so I think we need first to do some 
experiments, and continue development on a branch for a while ... Still 
the whole ecosystem that OSGI offers is worth the trouble IMHO.





The OSGi uses classloader segmentation to allow multiple conflicting
versions of the same code inside the same project.  So having a
pattern like:

Plugin A: nutch.api (Which contains say the interface Parser { })
Plugin B: parser.word (which has class WordParser implements Parser)

Plugin B has to depend on Plugin A so it can see the parser.  In this
case, Plugin A can't have code that uses Class.forName("WordParser");

OSGi changes the default classloader delegation, you can only see
classes in plugins you depend upon, and cycles in the dependencies are
not allowed.


If I understand it correctly, this is pretty much how it's supposed to 
work in our current plugin system ... only it's more primitive and it's 
got some warts ;)




If you want to do that, you end up having to do:

ClassLoader loader = ParserRegistery.lookupPlugin("WordParser");
Class.forname("WordParser", loader);

OSGi has some SPI-like way way to have a plugin note the fact that it
contributes an implementation of the Parser interface.  Eclipse builds
on top of it, and that's what Eclipse 3.x implemented the
Extension/ExtensionPoint system on top of.  I believe they are called
services in "raw" OSGi.

It's not a huge deal to write that yourself for API's you implement.
The problem is that it can be difficult to integrate really useful
third-party libraries that don't account for this change in
classloader behaviour.  At points it can make it very problematic to
use a specific XML parser that has the features you want (or some
library you want to use really wants).  Because they do this sort of
thing all the time.


This doesn't sound too much different from what we do already in Nutch 
plugins.




I'm guessing that Tika isn't ready for this.  Given that it's an
Apache and/or Lucene project, it can probably be addressed.  My guess
is that a number of the libraries they depend upon won't be.


I think we would like Tika to function as an OSGI plugin (or a group of 
plugins?) out of the box so that we could avoid having to wrap it ourselves.



You can use fragments to get away from that (a fragment requires a
host bundle, the fragment's classes are loaded using the same
classloader as the host), but it doing that defeats a lot of the
reason for using OSGi (at least in terms of allowing you to use
multiple conflicting libraries in the same application).


Thank you again for the comments - I'm a newbie to OSGI, so I'll 
probably start with small experiments and see how it goes. If you think 
you could help us with this by providing some guidance or help with the 
design then that would be great.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Server suggestion

2009-07-25 Thread Dennis Kubes
My mistake, you're right.  The last processing clusters we built were 
using Xeon quad cores, not i7s.  The i7s were search servers which 
didn't need ecc memory.  AFAICT, wikipedia is correct and the i7s don't 
yet support ECC.


So my suggestion would be to stick with Xeon procs or something that 
supports ECC for the processing clusters.  I would never build a 
processing cluster that doesn't have ECC memory.  We spent a few weeks 
when we first started trying to tracking down weird corruption checksum 
bugs ultimately related to using non-ECC memory on a cluster.


Dennis

Doğacan Güney wrote:

Hi Dennis,

On Fri, Jul 24, 2009 at 16:46, Dennis Kubes wrote:


fredericoagent wrote:

If I want to setup nutch with lets say 400 million urls in the database.

Is it better to have a 4-5 super fast and loaded servers or have 12-15
smaller , cheaper servers.

More smaller servers.  Make sure they are energy efficient though and have a
decent amount of Ram.  If a server goes down, you aren't affected as much.


By superfast I mean cpu is latest quad core or latest six core processor
with 6 Gigs Ram and 1. or 1.5 TB HD.

By cheap I mean something like a Xeon quad core 2.26 cpu with 3 Gig Ram
and
500 Sata HD.


or if anyone can suggest a better spec ideal

Our first servers were 1Ghz (Yes really) running hadoop 0.04 way back when.
 Our first production clusters were core2, 4G ECC, 1 750G hard drive.  These
days been building i7 8-core, 12G ECC, 4T raid-5 machines with up to 8
disks, 2U for around 2200.00 each.  If you are looking for a good server
builder check out swt.com. They are supermicro resellers and build solid
machines.



It suggests here:

http://en.wikipedia.org/wiki/Core_i7#Drawbacks

that core i7's do not support ECC rams. Have you ran into any issues or is WP
wrong here?



Suggestions.  Don't skimp on the hard drive, do at least 750G or more. Price
difference is negligible.  Do at least 2G Ram, 4G is better, 8G is better
than that.  You can get up to 12G on regular motherboards these days.  After
that it gets much more expensive.  Ao more recent processors, such as core2
or i7.  They are more power efficient per processing unit.  If you want a
really fast machine, do multiple disks in a raid-5 format.

Dennis







Re: Nutch dev. plans

2009-07-25 Thread Kirby Bohling
Comments inline below:

On Sat, Jul 25, 2009 at 2:23 PM, Andrzej Bialecki wrote:
> Kirby Bohling wrote:
>>
>> On Fri, Jul 17, 2009 at 5:21 PM, Andrzej Bialecki wrote:
>>>
>>> Doğacan Güney wrote:
>>>
> There's no specific design yet except I can't stand the existing plugin
> framework anymore ... ;) I started reading on OSGI and it seems that it
> supports the functionality that we need, and much more - it certainly
> looks
> like a better alternative than maintaining our plugin system beyond 1.x
> ...
>
 Couldn't agree more with the "can't stand plugin framework" :D

 Any good links on OSGI stuff?
>>>
>>> I found this:
>>>
>>> http://neilbartlett.name/blog/osgi-articles
>
> Hi Kirby,
>
> Thanks for your insights - please see my comments below.
>
>> Plugins are called Bundles in OSGi parlance, but I'll use plugin as
>> that's the term used by Nutch.
>>
>> I have done quite a bit of OSGi work (I used to develop RCP
>> applications for a living).  OSGi is great, as long as you plan on not
>> using reflection to retrieve classes directly, and you don't plan on
>> using a library that uses it directly.
>>
>> Pretty much every use of usage like this:
>>
>> Class clazz = Class.forName(stringFromConfig);
>> // Code to create an object using this class...
>>
>> Will fail, unless the code is very classloader aware.  So if you're
>> going to switch over to using OSGi (which I think would be wonderful),
>> you'll want to ensure that you can deal with all of the third-party
>> libraries.  I haven't played much with any of the Declarative Services
>> stuff (I think that was slated for OSGi, but it might have just been
>> an Eclipse extension).
>
> This is an important issue - so I think we need first to do some
> experiments, and continue development on a branch for a while ... Still the
> whole ecosystem that OSGI offers is worth the trouble IMHO.
>

I think you're correct about it being worth while.  I've got a git
repository that I use for my work, I'll see about setting up a github
and start to use that as a public place to get some of my stuff so you
can see it.  Unfortunately, I have some proprietary stuff that I can't
contribute back (most of which you don't want anyways).  I do have
bugfixes for core issues that I do have permission to contribute.
It'd be much easier for me to use Git to migrate the work back and
forth between work and there.  It's also much smoother for me to
develop a series of "easy to review" patches using it.


>
>
>> The OSGi uses classloader segmentation to allow multiple conflicting
>> versions of the same code inside the same project.  So having a
>> pattern like:
>>
>> Plugin A: nutch.api (Which contains say the interface Parser { })
>> Plugin B: parser.word (which has class WordParser implements Parser)
>>
>> Plugin B has to depend on Plugin A so it can see the parser.  In this
>> case, Plugin A can't have code that uses Class.forName("WordParser");
>>
>> OSGi changes the default classloader delegation, you can only see
>> classes in plugins you depend upon, and cycles in the dependencies are
>> not allowed.
>
> If I understand it correctly, this is pretty much how it's supposed to work
> in our current plugin system ... only it's more primitive and it's got some
> warts ;)

That's a fair and accurate statement.

>
>>
>> If you want to do that, you end up having to do:
>>
>> ClassLoader loader = ParserRegistery.lookupPlugin("WordParser");
>> Class.forname("WordParser", loader);
>>
>> OSGi has some SPI-like way way to have a plugin note the fact that it
>> contributes an implementation of the Parser interface.  Eclipse builds
>> on top of it, and that's what Eclipse 3.x implemented the
>> Extension/ExtensionPoint system on top of.  I believe they are called
>> services in "raw" OSGi.
>>
>> It's not a huge deal to write that yourself for API's you implement.
>> The problem is that it can be difficult to integrate really useful
>> third-party libraries that don't account for this change in
>> classloader behaviour.  At points it can make it very problematic to
>> use a specific XML parser that has the features you want (or some
>> library you want to use really wants).  Because they do this sort of
>> thing all the time.
>
> This doesn't sound too much different from what we do already in Nutch
> plugins.

Yes.  I think that's accurate.

>
>>
>> I'm guessing that Tika isn't ready for this.  Given that it's an
>> Apache and/or Lucene project, it can probably be addressed.  My guess
>> is that a number of the libraries they depend upon won't be.
>
> I think we would like Tika to function as an OSGI plugin (or a group of
> plugins?) out of the box so that we could avoid having to wrap it ourselves.
>

I think Tika as one plugin would lead to a charge of "bloat", given
all the formats it currently supports that you now ship as plugins.
Long term do you see Nutch just supporting everything Tika does "out
of the box" and including all of the dependencie