Re: [Fwd: [PROPOSAL] index server project]

2006-10-19 Thread Stefan Groschupf

Hi Doug,

we discussed the need of such a tool several times internally and  
developed some workarounds for nutch, so I would be definitely  
interested to contribute to such a project.
Having a separated project that depends on hadoop would be the best  
case for our usecases.


Best,
Stefan



Am 18.10.2006 um 23:35 schrieb Doug Cutting:

FYI, I just pitched a new project you might be interested in on  
[EMAIL PROTECTED]  Dunno if you subscribe to that list, so I'm  
spamming you.  If it sounds interesting, please reply there.  My  
management at Y! is interested in this, so I'm 'in'.


Doug

 Original Message 
Subject: [PROPOSAL] index server project
Date: Wed, 18 Oct 2006 14:17:30 -0700
From: Doug Cutting <[EMAIL PROTECTED]>
Reply-To: general@lucene.apache.org
To: general@lucene.apache.org

It seems that Nutch and Solr would benefit from a shared index serving
infrastructure.  Other Lucene-based projects might also benefit from
this.  So perhaps we should start a new project to build such a thing.
This could start either in java/contrib, or as a separate sub-project,
depending on interest.

Here are some quick ideas about how this might work.

An RPC mechanism would be used to communicate between nodes (probably
Hadoop's).  The system would be configured with a single master node
that keeps track of where indexes are located, and a number of slave
nodes that would maintain, search and replicate indexes.  Clients  
would

talk to the master to find out which indexes to search or update, then
they'll talk directly to slaves to perform searches and updates.

Following is an outline of how this might look.

We assume that, within an index, a file with a given name is written
only once.  Index versions are sets of files, and a new version of an
index is likely to share most files with the prior version.  Versions
are numbered.  An index server should keep old versions of each index
for a while, not immediately removing old files.

public class IndexVersion {
  String Id;   // unique name of the index
  int version; // the version of the index
}

public class IndexLocation {
  IndexVersion indexVersion;
  InetSocketAddress location;
}

public interface ClientToMasterProtocol {
  IndexLocation[] getSearchableIndexes();
  IndexLocation getUpdateableIndex(String id);
}

public interface ClientToSlaveProtocol {
  // normal update
  void addDocument(String index, Document doc);
  int[] removeDocuments(String index, Term term);
  void commitVersion(String index);

  // batch update
  void addIndex(String index, IndexLocation indexToAdd);

  // search
  SearchResults search(IndexVersion i, Query query, Sort sort, int n);
}

public interface SlaveToMasterProtocol {
  // sends currently searchable indexes
  // recieves updated indexes that we should replicate/update
  public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
}

public interface SlaveToSlaveProtocol {
  String[] getFileSet(IndexVersion indexVersion);
  byte[] getFileContent(IndexVersion indexVersion, String file);
  // based on experience in Hadoop, we probably wouldn't really use
  // RPC to send file content, but rather HTTP.
}

The master thus maintains the set of indexes that are available for
search, keeps track of which slave should handle changes to an  
index and

initiates index synchronization between slaves.  The master can be
configured to replicate indexes a specified number of times.

The client library can cache the current set of searchable indexes and
periodically refresh it.  Searches are broadcast to one index with  
each

id and return merged results.  The client will load-balance both
searches and updates.

Deletions could be broadcast to all slaves.  That would probably be  
fast

enough.  Alternately, indexes could be partitioned by a hash of each
document's unique id, permitting deletions to be routed to the
appropriate slave.

Does this make sense?  Does it sound like it would be useful to Solr?
To Nutch?  To others?  Who would be interested and able to work on it?

Doug



~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com





Re: [PROPOSAL] index server project

2006-10-20 Thread Stefan Groschupf

Hi,

The major goal is scale, right? A distributed server provides more  
oomph

than a single-node server can.


Another important goal from my point of view would be index  
management, like index updates during production.


Stefan 


Re: [PROPOSAL] index server project

2006-11-06 Thread Stefan Groschupf

Hi,

do people think we are already in a stage where we can setup some  
basic infrastructure like mailing list and wiki and move the  
discussion to the new mailing list. Maybe setup a incubator project?


I would be happy to help with such basic tasks.

Stefan



Am 31.10.2006 um 22:03 schrieb Yonik Seeley:


On 10/30/06, Doug Cutting <[EMAIL PROTECTED]> wrote:

Yonik Seeley wrote:
> On 10/18/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
>> We assume that, within an index, a file with a given name is  
written

>> only once.
>
> Is this necessary, and will we need the lockless patch (that avoids
> renaming or rewriting *any* files), or is Lucene's current index
> behavior sufficient?

It's not strictly required, but it would make index synchronization a
lot simpler. Yes, I was assuming the lockless patch would be  
committed
to Lucene before this project gets very far.  Something more than  
that

would be required in order to keep old versions, but this could be as
simple as a Directory subclass that refuses to remove files for a  
time.


Or a snapshot (hard links) mechanism.
Lucene would also need a way to open a specific index version (rather
than just the latest), but I guess that could also be hacked into
Directory by hiding later "segments" files (assumes lockless is
committed).

> It's unfortunate the master needs to be involved on every  
document add.


That should not normally be the case.


Ahh... I had assumed that "id" in the following method was document  
id:

 IndexLocation getUpdateableIndex(String id);

I see now it's index id.

But what is index id exactly?  Looking at the example API you laid
down, it must be a single physical index (as opposed to a logical
index).  In which case, is it entirely up to the client to manage
multi-shard indicies?  For example, if we had a "photo" index broken
up into 3 shards, each shard would have a separate index id and it
would be up to the client to know this, and to query across the
different "photo0", "photo1", "photo2" indicies.  The master would
have no clue those indicies were related.  Hmmm, that doesn't work
very well for deletes though.

It seems like there should be the concept of a logical index, that is
composed of multiple shards, and each shard has multiple copies.

Or were you thinking that a cluster would only contain a single
logical index, and hence all different index ids are simply different
shards of that single logical index?  That would seem to be consistent
with ClientToMasterProtocol .getSearchableIndexes() lacking an id
argument.


I was not imagining a real-time system, where the next query after a
document is added would always include that document.  Is that a
requirement?  That's harder.


Not real-time, but it would be nice if we kept it close to what Lucene
can currently provide.
Most people seem fine with a latency of minutes.

At this point I'm mostly trying to see if this functionality would  
meet

the needs of Solr, Nutch and others.



It depends on the project scope and how extensible things are.
It seems like the master would be a WAR, capable of running stand- 
alone.

What about index servers (slaves)?  Would this project include just
the interfaces to be implemented by Solr/Nutch nodes, some common
implementation code behind the interfaces in the form of a library, or
also complete standalone WARs?

I'd need to be able to extend the ClientToSlave protocol to add
additional methods for Solr (for passing in extra parameters and
returning various extra data such as facets, highlighting, etc).

Must we include a notion of document identity and/or document  
version in

the mechanism? Would that facillitate updates and coherency?


It doesn't need to be in the interfaces I don't think, so it depends
on the scope of the index server implementations.

-Yonik



~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com





Re: Lucene-based Distributed Index Leveraging Hadoop

2008-04-03 Thread Stefan Groschupf

Hi All,

we are also very much interested in such a system and actually have to  
realize such a system for an project within the next 3 month.
I would prefer to work on a open source solution instead of doing  
another one behind closed doors, though we would need to start coding  
pretty soon. We have 3 fulltime developers we could contribute for  
this time to such a project.


I'm happy to do all the organisational work like setting up the  
complete infrastructure etc to get it started.
I suggest we start with an sourceforge project since this is fast to  
setup and if we qualify for apache as an lucene or hadoop subproject  
migrate there later, or is it easy to start a apache incubator project?


We might just need a nice name for the project. Doug, any idea? :-)

Should we start from scratch or with a code contribution?
Someone still want to contribute its implementation?


Thanks.
Stefan







On Feb 6, 2008, at 10:57 AM, Ning Li wrote:

There have been several proposals for a Lucene-based distributed index
architecture.
1) Doug Cutting's "Index Server Project Proposal" at
   http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
2) Solr's "Distributed Search" at
   http://wiki.apache.org/solr/DistributedSearch
3) Mark Butler's "Distributed Lucene" at
   http://wiki.apache.org/hadoop/DistributedLucene

We have also been working on a Lucene-based distributed index  
architecture.
Our design differs from the above proposals in the way it leverages  
Hadoop
as much as possible. In particular, HDFS is used to reliably store  
Lucene

instances, Map/Reduce is used to analyze documents and update Lucene
instances
in parallel, and Hadoop's IPC framework is used. Our design is  
geared for
applications that require a highly scalable index and where batch  
updates
to each Lucene instance are acceptable (verses finer-grained  
document at

a time updates).

We have a working implementation of our design and are in the process
of evaluating its performance. An overview of our design is provided  
below.
We welcome feedback and would like to know if you are interested in  
working
on it. If so, we would be happy to make the code publicly available.  
At the
same time, we would like to collaborate with people working on  
existing

proposals and see if we can consolidate our efforts.

TERMINOLOGY
A distributed "index" is partitioned into "shards". Each shard  
corresponds

to
a Lucene instance and contains a disjoint subset of the documents in  
the

index.
Each shard is stored in HDFS and served by one or more "shard  
servers". Here
we only talk about a single distributed index, but in practice  
multiple

indexes
can be supported.

A "master" keeps track of the shard servers and the shards being  
served by

them. An "application" updates and queries the global index through an
"index client". An index client communicates with the shard servers to
execute a query.

KEY RPC METHODS
This section lists the key RPC methods in our design. To simplify the
discussion, some of their parameters have been omitted.

 On the Shard Servers
   // Execute a query on this shard server's Lucene instance.
   // This method is called by an index client.
   SearchResults search(Query query);

 On the Master
   // Tell the master to update the shards, i.e., Lucene instances.
   // This method is called by an index client.
   boolean updateShards(Configuration conf);

   // Ask the master where the shards are located.
   // This method is called by an index client.
   LocatedShards getShardLocations();

   // Send a heartbeat to the master. This method is called by a
   // shard server. In the response, the master informs the
   // shard server when to switch to a newer version of the index.
   ShardServerCommand sendHeartbeat();

QUERYING THE INDEX
To query the index, an application sends a search request to an index
client.
The index client then calls the shard server search() method for  
each shard
of the index, merges the results and returns them to the  
application. The

index client caches the mapping between shards and shard servers by
periodically calling the master's getShardLocations() method.

UPDATING THE INDEX USING MAP/REDUCE
To update the index, an application sends an update request to an  
index

client.
The index client then calls the master's updateShards() method, which
schedules
a Map/Reduce job to update the index. The Map/Reduce job updates the  
shards

in
parallel and copies the new index files of each shard (i.e., Lucene
instance)
to HDFS.

The updateShards() method includes a "configuration", which provides
information for updating the shards. More specifically, the  
configuration

includes the following information:
 - Input path. This provides the location of updated documents,  
e.g., HDFS

   files or directories, or HBase tables.
 - Input formatter. This specifies how to format the input documents.
 - Analysis. This defines the analyzer to use on the input. The  
analyzer
   det

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-04-03 Thread Stefan Groschupf

Should we start from scratch or with a code contribution?
Someone still want to contribute its implementation?
I just noticed - to late though - Ning already contributed the code to  
hadoop. So I guess my question should be rephrased what is the idea of  
moving this into a own project?




Re: Lucene Performance and usage alternatives

2008-08-05 Thread Stefan Groschupf
An alternative is always to distribute the index to a set of servers.  
If you need to scale I guess this is the only long term perspective.
You can do your own home grown lucene distribution or look into  
existing one.
I'm currently working on katta (http://katta.wiki.sourceforge.net/) -  
there is no release yet but we are in the QA and test cycles.
But there are other as well - solar for example provides distribution  
as well.


Stefan


On Aug 5, 2008, at 7:21 AM, ezer wrote:



I just made a program using the java api of Lucene. Its is working  
fine for
my actually index size. But i am worried about performance with an  
biger

index and simultaneous users access.

1) I am worried with the fact of having to make the program in java. I
searched for alternative like the C Port, but i saw that the version  
used

its a little old an no much people seem to use that.

2) I also thinking in compiling the code with cgj to generate native  
code

and not use the jvm. Anybody tried it ? Can be an advantage that could
aproximate to the performance of a C program ?

3) I wont use an application server, i will call the program  
directly from a
php page, is there any architecture model suggested for doing that?  
I mean
for preview many users accessing to the program. The fact of  
initiating one

isntance each time someone do a query and opening the index should not
degrade the performance?
--
View this message in context: 
http://www.nabble.com/Lucene-Performance-and-usage-alternatives-tp18832162p18832162.html
Sent from the Lucene - General mailing list archive at Nabble.com.




~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com




Re: Lucene-based Distributed Index Leveraging Hadoop

2008-08-21 Thread Stefan Groschupf



Hi All, Hi Mark,


It was interesting to hear Mark Butler present his implementation of
Distributed Lucene at the Hadoop User Group meeting in London on
Tuesday. There's obviously been quite a bit of discussion on the
subject, and lots of interested parties. Mark, not sure if you're on
this list but thanks for sharing.


Is there any material published about this? I would be very interested  
to see Marks slides and hear about the discussion.




Is this the forum to ask about open projects? I'm interested in
joining a project as long as it's goals aren't too distant to what I'm
looking for. Based mostly on gut feeling I'd rather go for a
stand-alone project that wasn't dependent on HDFS/Hadoop, but willing
to be convinced otherwise.


Rich, as you know there are a couple project in this area solar,  
compass, dlucene and katta and since all are open source I guess the  
easiest way to be involved is to join the mailing lists.


I only can speak for katta - we are very interested in getting more  
people involved to get other perspective. There is quite some activity  
in our project since our project is part of a upcoming production  
system, but low traffic in mailing list (So far all developers work in  
the same room).


You can find our mailing list on our source forge page:
http://katta.wiki.sourceforge.net/

Please keep in mind that katta is very young and compass or solr might  
be more interesting if you need something working now, though they  
might have different goals and focus than dlucene or katta.


Stefan Groschupf


Re: Lucene-based Distributed Index Leveraging Hadoop

2008-08-22 Thread Stefan Groschupf

Hi,


In terms of which project best fits my needs my gut feeling is that
dlucene is pretty close. It supports incremental updates, and doesn't
build in dependencies on systems like HDFS or Terracotta (I don't yet
understand all the implications of those systems so would rather keep
things simple if possible).


Upgrades...
The way we solve this with katta is that we simply deploy a new small  
index and use * in the client instead of a fixed index name.
Than once a night we merge all the small indexes (since this slows  
down things) together to a big new index.
To solve the problem of duplicate documents each document gets a  
timestamp and in the client we do a simple dedub based on a key and  
use always the latest document with the latest time stamp.


Dependencies...
Katta is independent of those technologies, it is lucene, zookeeper  
and hadoop RPI (instead of RMI, http or Apache Mina). Though we  
support loading index shards from a hadoop file system, but you also  
can load them from a mounted remote hdd NAS or what ever you like



The obvious drawback being that dlucene
doesn't seem to be an active public project.
Mark need to answer this but dlucene is checked in to the katta svn  
and I saw Marko checking in changes to dlucene. There was a discussion  
between Mark and me to bring dlucene and katta together and I really  
would love to see that happen but unfortunately we had a lot of  
pressure from our customer to deliver something so we had to focus on  
other things. More developers getting involved would clearly help  
here.. :-)





Thanks for the reply Stefan. I'll certainly be taking a look through
the code for Katta since no doubt there's a lot to learn in there.
Katta will be deployed into a production system of our customer in  
less than 4 weeks - so we working hard to iron out issues.
However katta is running since 6 weeks in a 10 node test environment  
with heavy load.


Stefan 


[ANN] katta-0.1.0 release - distribute lucene indexes in a grid

2008-09-17 Thread Stefan Groschupf
After 5 month work we are happy to announce the first developer  
preview release of katta.
This release contains all functionality to serve a large, sharded  
lucene index on many servers.
Katta is standing on the shoulders of the giants lucene, hadoop and  
zookeeper.


Main features:
+ Plays well with Hadoop
+ Apache Version 2 License.
+ Node failure tolerance
+ Master failover
+ Shard replication
+ Plug-able network topologies (Shard - Distribution and Selection  
Polices)

+ Node load balancing at client



Please give katta a test drive and give us some feedback!

Download:
http://sourceforge.net/project/platformdownload.php?group_id=225750

website:
http://katta.sourceforge.net/

Getting started in less than 3 min:
http://katta.wiki.sourceforge.net/Getting+started

Installation on a grid:
http://katta.wiki.sourceforge.net/Installation

Katta presentation today (09/17/08) at hadoop user, yahoo mission  
college:

http://upcoming.yahoo.com/event/1075456/
* slides will be available online later


Many thanks for the hard work:
Johannes Zillmann, Marko Bauhardt, Martin Schaaf (101tec)

I apologize the cross posting.


Yours, the Katta Team.

~~~
101tec Inc., Menlo Park, California
http://www.101tec.com






[ANNOUNCE] Katta 0.5 released

2009-04-09 Thread Stefan Groschupf

(...apologies for the cross posting...)

Release 0.5 of Katta is now available.
Katta - Lucene in the cloud.
http://katta.sourceforge.net


This release fixes bugs from 0.4, including one that sorted the  
results wrong under load.
0.5 also upgrades to Zookeeper to version 3.1.,  Lucene to version  
2.4.1 and hadoop 0.19.0.


The new API supports Lucene Query objects instead of just Strings,  
adds support for Amazon EC2,
switched to Ant and Ivy as a build system and some more minor  
improvements.
Also, we improved our online documentation and added sample code that  
illustrates how to create a sharded Lucene index with Hadoop.


See changes at
http://oss.101tec.com/jira/browse/KATTA?report=com.atlassian.jira.plugin.system.project:changelog-panel

Binary distribution is available at
https://sourceforge.net/projects/katta/

Stefan

~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com





Re: [ANNOUNCE] Katta 0.5 released

2009-04-09 Thread Stefan Groschupf

Hi Steve,

I dont like sitting a build system, so I like convention over  
configuration. Maven sounds good for that, but after many years being  
a maven fan I just could not understand why essential plugins are  
still so buggy and why I have to spend so much energy when I want to  
customize something.  Obviously maven describes the project, it is not  
a build script. Also I like dependency management.
Gradle will be a great build tool. It has conventions over  
configuration it uses java syntax (groovy) to write the build scrip,  
has dependency management etc. It is actually really cool but we  
adapted it too early.

It had bugs that blocked our productivity.
Now we are back at ant and use ivy for dependency management. Ivy isnt  
great documented but works pretty solid for us. Ant is solid though I  
dont like to writing scripts in a declarative language - xml and also  
ants multi projects build capabilities aren't the greatest. Anyhow we  
decided for ant since it is a solid working horse.


Stefan

~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com



On Apr 9, 2009, at 9:50 AM, Steven A Rowe wrote:

Oops, just saw on the wiki that "Gradle" (never heard of it before)  
is the build system (former build system, I gather from the release  
announcement) - I'm still interested in why the switch was made,  
though. - Steve


On 4/9/2009 at 12:22 PM, Steven A Rowe wrote:

On 4/9/2009 at 3:16 AM, Stefan Groschupf wrote:

Release 0.5 of Katta is now available.


Congratulations on the release!


[...] switched to Ant and Ivy as a build system [...]


AFAICT, the build previously was performed with Maven 2 - is there a
public discussion available anywhere concerning this switch?  (I  
looked

and couldn't find anything.)  If there's no public discussion
available, can you say a few words about the rationale behind the
switch?







ScaleCamp: get together the night before Hadoop Summit

2009-05-13 Thread Stefan Groschupf

Hi All,

We are planing a community event the night before the Hadoop Summit.
This "BarCamp" (http://en.wikipedia.org/wiki/BarCamp) event will be  
held at the same venue as the Summit (Santa Clara Marriott).

Refreshments will be served to encourage socializing.

To initialize conversations for the social part of the evening we are  
offering people the opportunity to present an experience report of  
their project (within a 15 min presentation).
We have 12 slots in 3 parallel tracks max. The focus should be on  
projects leveraging technologies from the Hadoop eco-system.


Please join us and mingle with the rest of the Hadoop community.

To find out more about this event and signup please visit :
http://www.scaleunlimited.com/events/scale_camp

Please submit your presentation here:
http://www.scaleunlimited.com/about-us/contact


Stefan
P.S. Please spread the word!
P.P.S Apologies for the cross posting.


Re: [APACHECON] Planning

2009-06-17 Thread Stefan Groschupf

Hi Grant,
sorry I lost track here, is there a list of excepted presentations  
somewhere?

Stefan


~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com



On Jun 17, 2009, at 8:42 AM, Grant Ingersoll wrote:

Note, you may not have permission to view that page.  Sorry.  Not my  
call.


Also note that it is _MY_ understanding that airfare is no longer  
covered as part of the speaker package.  Maybe others can confirm  
this.  I'm not sure how this effects people's willingness to speak,  
but it is a downer in my mind.  However, the ASF does have a Travel  
Assistance Committee that people can apply to for assistance.  I  
don't know the details of that.



On Jun 17, 2009, at 10:42 AM, Grant Ingersoll wrote:

OK, we've been alloted 2 days for Lucene:  http://wiki.apache.org/concom-planning/ParticipatingPmcs 
.  More later on info about the Calls for Presentations (CFPs)


Now we need to figure out what we are going to do.

Also, we need, asap, a description that satisfies:

In order to get registration open ASAP, we need a promo-text for each
track. If you're copied on this email, I'll be nagging you for a text
for your project (listed below). If there's someone else I should nag
instead, please let me know. If you know who I should be nagging for
the last three tracks below, please let me know that too.

What should the promo text look like? We need 150-200 words,  
explaining

-what the track will cover (outline is fine, you don't need to have
abstracts and bios if you're not ready for that),
-who the intended audience is, and
-why people will want to attend/what they'll get out of it.

If you're planning something amazing, cool, new or exciting, we want
some information about that. Is there going to be a panel discussion
with some of the central project members telling people what to  
expect

in the widely-anticipated next release? How about a hands-on
masterclass with that really tricky part of the project that everyone
has trouble with? Or everything you need to know to decide which
technologies to use in which situations, and how to get the most out
of your limited resources?









Re: [ACUS09] IMPORTANT SPEAKER CONFIRMATION MESSAGE

2009-07-19 Thread Stefan Groschupf

Sorry I'm a day late, but I can confirm I can do a 20 min Katta Intro.

On Jul 16, 2009, at 12:37 AM, Michael Busch wrote:


I confirm I'm coming and that I'd like to give the talk below.
Alternatively we could also split the talk up into two separate  
talks "Lucene Basics" and "New Features in Lucene and Advanced  
Indexing Techniques".



-. Intro to Katta (Stefan?) (20 mins)




Re: Organizing the Lucene meetup (Was: ApacheCon US)

2009-10-19 Thread Stefan Groschupf

There is an initial schedule online at:
http://wiki.apache.org/lucene-java/LuceneAtApacheConUs2009
Isabel


I still plan to do the Katta introduction. Is someone officially  
maintain the page or should I just go ahead and remove the question  
mark myself?

Stefan