Re: Is Solr ready for graduation?

2007-01-04 Thread Bertrand Delacretaz

On 1/3/07, Yoav Shapira [EMAIL PROTECTED] wrote:

...I'd say definitely ask Lucene first.  (And in general ask the
accepting TLP first, before asking the Incubator).

+1 from me to starting the discussion, and +1 for graduating


Same opinion on all points here.

-Bertrand


Re: Handling disparate data sources in Solr

2007-01-04 Thread Alan Burlison

Chris Hostetter wrote:


For your purposes, if you've got a system that works and does the Document
conversion for you, then you are probably right: Solr may not be a usefull
addition to your architecture.  Solr doesn't really attempt to solve the
problem of parsing differnet kinds of data streams into a unified Document
module -- it just tries to expose all of the Lucene goodness through an
easy to use, easy to configre, HTTP interface.  Besides the
configuration, Solr's other means of being a value add is in it's
IndexReader management, it's caching, and it's plugin support for mixing
and matching request handlers, output writters, and field types as easily
as you can mix and match Analyzers.

There has been some discussion about adding plugin support for the
update side of things as well -- at a very simple level this could allow
for messages to be sent via JSON, or CSV instead of just XML -- but
there's no reason a more comple upate plugin couldn't read in a binary PDF
file and parse it into it's appropriate fields ... but we aren't
quite there yet.  Feel free to bring this up on solr-dev if you'd be
interested in working on it.


I'm interested in discussing this further.  I've moved the discussion 
onto solr-dev, as suggested.


--
Alan Burlison
--


Re: Is Solr ready for graduation?

2007-01-04 Thread Erik Hatcher
And, of course, likewise.  Solr is more than ready to get voted on  
for graduation.


Erik


On Jan 4, 2007, at 4:21 AM, Bertrand Delacretaz wrote:


On 1/3/07, Yoav Shapira [EMAIL PROTECTED] wrote:

...I'd say definitely ask Lucene first.  (And in general ask the
accepting TLP first, before asking the Incubator).

+1 from me to starting the discussion, and +1 for graduating


Same opinion on all points here.

-Bertrand




Re: Use whiteboard for experimental stuff? (was: duplication in client/ruby/solrb/solr)

2007-01-04 Thread Erik Hatcher


On Jan 4, 2007, at 4:26 AM, Bertrand Delacretaz wrote:

On 1/4/07, Mike Klaas [EMAIL PROTECTED] wrote:


...Might labs.apache.org make sense for this project?...


As Flare is closely related to Solr, I think it belongs in our  
repository.


OTOH, it might be good to put such experimental stuff in a
whiteboard directory instead of trunk, to make it clear that it's
not (yet) part of what we're releasing.


Would making this clear in the README files in both the solrb and  
flare directories be sufficient?


I'm happy to move things wherever folks would like.  It's certainly a  
playground of sorts for me right now, though I expect to have flare  
functional very soon.  solrb is already functional, for what its  
worth, though its not particularly fancy yet.


Erik



Re: Is Solr ready for graduation?

2007-01-04 Thread Bill Au

+1 on going for graduation.

Bill

On 1/4/07, Erik Hatcher [EMAIL PROTECTED] wrote:


And, of course, likewise.  Solr is more than ready to get voted on
for graduation.

Erik


On Jan 4, 2007, at 4:21 AM, Bertrand Delacretaz wrote:

 On 1/3/07, Yoav Shapira [EMAIL PROTECTED] wrote:
 ...I'd say definitely ask Lucene first.  (And in general ask the
 accepting TLP first, before asking the Incubator).

 +1 from me to starting the discussion, and +1 for graduating

 Same opinion on all points here.

 -Bertrand




Re: Use whiteboard for experimental stuff? (was: duplication in client/ruby/solrb/solr)

2007-01-04 Thread Yonik Seeley

On 1/4/07, Erik Hatcher [EMAIL PROTECTED] wrote:

Would making this clear in the README files in both the solrb and
flare directories be sufficient?

I'm happy to move things wherever folks would like.  It's certainly a
playground of sorts for me right now, though I expect to have flare
functional very soon.  solrb is already functional, for what its
worth, though its not particularly fancy yet.


I don't think the package ant task doesn't currently includes
anything in clients so I don't think there's an issue w.r.t.
releasing.  A README note would be fine.

-Yonik


Re: Is Solr ready for graduation?

2007-01-04 Thread Yoav Shapira

Hi,
For the curious, here's what votes will be needed and what's binding
in them.  It may seem like a long road, but don't be discouraged: for
a project like Solr, there's largely consensus so these votes are
quick and painless.

First, the Solr PPMC must approve the graduation request.  In this
vote, Solr PPMC members' votes are binding.  Unless I'm mistaken,
right now all Solr committers are also PPMC members, or close to it.

Next, the adopting PMC (in this case Lucene) must vote to accept Solr.
In that vote, only Lucene PMC members' votes are binding; you can see
those people at http://lucene.apache.org/who.html#Lucene+PMC .

Finally, after the Lucene PMC approves, we ask the Incubator PMC.  In
that vote, Incubator only PMC members' votes are binding.  You can see
that list of people at http://incubator.apache.org/whoweare.html .

You will note at least several people (such as Yonik and Erik Hatcher)
will have binding votes in more than one of the above votes.  That's
fine, it's even expected, e.g. from mentors.  They can wear multiple
hats without (hopefully) acquiring some clinical disease.

Yoav

On 1/4/07, Bill Au [EMAIL PROTECTED] wrote:

+1 on going for graduation.

Bill

On 1/4/07, Erik Hatcher [EMAIL PROTECTED] wrote:

 And, of course, likewise.  Solr is more than ready to get voted on
 for graduation.

 Erik


 On Jan 4, 2007, at 4:21 AM, Bertrand Delacretaz wrote:

  On 1/3/07, Yoav Shapira [EMAIL PROTECTED] wrote:
  ...I'd say definitely ask Lucene first.  (And in general ask the
  accepting TLP first, before asking the Incubator).
 
  +1 from me to starting the discussion, and +1 for graduating
 
  Same opinion on all points here.
 
  -Bertrand






Re: Is Solr ready for graduation?

2007-01-04 Thread Yonik Seeley

Thanks for the summary Yoav,
This thread looks like it's the first vote (unless anyone objects), so
here's my +1 for graduation.

-Yonik


On 1/4/07, Yoav Shapira [EMAIL PROTECTED] wrote:

Hi,
For the curious, here's what votes will be needed and what's binding
in them.  It may seem like a long road, but don't be discouraged: for
a project like Solr, there's largely consensus so these votes are
quick and painless.

First, the Solr PPMC must approve the graduation request.  In this
vote, Solr PPMC members' votes are binding.  Unless I'm mistaken,
right now all Solr committers are also PPMC members, or close to it.

Next, the adopting PMC (in this case Lucene) must vote to accept Solr.
 In that vote, only Lucene PMC members' votes are binding; you can see
those people at http://lucene.apache.org/who.html#Lucene+PMC .

Finally, after the Lucene PMC approves, we ask the Incubator PMC.  In
that vote, Incubator only PMC members' votes are binding.  You can see
that list of people at http://incubator.apache.org/whoweare.html .

You will note at least several people (such as Yonik and Erik Hatcher)
will have binding votes in more than one of the above votes.  That's
fine, it's even expected, e.g. from mentors.  They can wear multiple
hats without (hopefully) acquiring some clinical disease.

Yoav

On 1/4/07, Bill Au [EMAIL PROTECTED] wrote:
 +1 on going for graduation.

 Bill

 On 1/4/07, Erik Hatcher [EMAIL PROTECTED] wrote:
 
  And, of course, likewise.  Solr is more than ready to get voted on
  for graduation.
 
  Erik
 
 
  On Jan 4, 2007, at 4:21 AM, Bertrand Delacretaz wrote:
 
   On 1/3/07, Yoav Shapira [EMAIL PROTECTED] wrote:
   ...I'd say definitely ask Lucene first.  (And in general ask the
   accepting TLP first, before asking the Incubator).
  
   +1 from me to starting the discussion, and +1 for graduating
  
   Same opinion on all points here.
  
   -Bertrand


Re: Use whiteboard for experimental stuff? (was: duplication in client/ruby/solrb/solr)

2007-01-04 Thread Mike Klaas

On 1/4/07, Bertrand Delacretaz [EMAIL PROTECTED] wrote:

On 1/4/07, Mike Klaas [EMAIL PROTECTED] wrote:

 ...Might labs.apache.org make sense for this project?...

As Flare is closely related to Solr, I think it belongs in our repository.

OTOH, it might be good to put such experimental stuff in a
whiteboard directory instead of trunk, to make it clear that it's
not (yet) part of what we're releasing.


I'm cool with Flare being part of the Solr repository, as long as it
in a way that makes sense.  But Erik has indicated that the current
situation is temporary, so I think we can worry about that later.
There is definitely no point in throwing up administrative hurdles to
cool nascent subprojects.

-Mike


Re: Is Solr ready for graduation?

2007-01-04 Thread Bertrand Delacretaz

On 1/4/07, Yonik Seeley [EMAIL PROTECTED] wrote:


...This thread looks like it's the first vote (unless anyone objects), so
here's my +1 for graduation...


I hate to be formal, but I'd much prefer voting to happen in clearly
identified [VOTE] threads.

As the community grows, or when people get busy, this helps in not
missing these all-important threads.

-Bertrand


[VOTE] graduate Solr to Lucene subproject

2007-01-04 Thread Yonik Seeley

It's time that Solr graduate from the incubator and become an official
Lucene subproject.

So, please cast your votes:

[ ] +1 ask Lucene PMC and the Incubator PMC to graduate Solr from the
Incubator to become a Lucene subproject.
[ ]  0 Don't care
[ ] -1 Not at this time, stay in the Incubator for now.


-Yonik


Re: [VOTE] graduate Solr to Lucene subproject

2007-01-04 Thread Mike Klaas

On 1/4/07, Yonik Seeley [EMAIL PROTECTED] wrote:

It's time that Solr graduate from the incubator and become an official
Lucene subproject.

So, please cast your votes:


+1


Re: [VOTE] graduate Solr to Lucene subproject

2007-01-04 Thread Bertrand Delacretaz

[X ] +1 ask Lucene PMC and the Incubator PMC to graduate Solr from the
Incubator to become a Lucene subproject.


-Bertrand


Re: [VOTE] graduate Solr to Lucene subproject

2007-01-04 Thread Ryan McKinley


[x] +1 ask Lucene PMC and the Incubator PMC to graduate Solr from the
Incubator to become a Lucene subproject.


I'm new to solr, but I'ts been great so far.  The community is great,
and i will do whatever i can to make it better.


solr-42

2007-01-04 Thread mirko
Hi,

I was wondering if the solution for the Highlighting problems with
HTMLStripWhitespaceTokenizerFactory (see
http://issues.apache.org/jira/browse/SOLR-42) could be resolved in
the following simple way.

The HTMLStripWhitespaceTokenizerFactory basically passes through the
input through an HTMLStripReader which removes the HTML and then passes
to the WhitespaceTokenizer.  If the HTMLStripReader would simply replace
the HTML with spaces (same length as the removed HTML part) then the positions
for the highlighter would be correct.  And most of the Tokenizers would
be happy with this solution (except maybe the KeywordTokenizer).

mirko


Re: Handling disparate data sources in Solr

2007-01-04 Thread Alan Burlison

Original problem statement:

--
I'm considering using Solr to replace an existing bare-metal Lucene 
deployment - the current Lucene setup is embedded inside an existing 
monolithic webapp, and I want to factor out the search functionality 
into a separate webapp so it can be reused more easily.


At present the content of the Lucene index comes from many different 
sources (web pages, documents, blog posts etc) and can be different 
formats (plaintext, HTML, PDF etc).  All the various content types are 
rendered to plaintext before being inserted into the Lucene index.


The net result is that the data in one field in the index (say 
content) may have come from one of a number of source document types. 
 I'm having difficulty understanding how I might map this functionality 
onto Solr.  I understand how (for example) I could use 
HTMLStripStandardTokenizer to insert the contents of a HTML document 
into a field called content, but (assuming I'd written a PDF analyser) 
how would I insert the content of a PDF document into the same content 
field?


I know I could do this by preprocessing the various document types to 
plaintext in the various Solr clients before inserting the data into the 
index, but that means that each client would need to know how to do the 
document transformation.  As well as centralising the index, I also want 
to centralise the handling of the different document types.

--

My initial suggestion, to get the discussion started, is to extend the 
doc and field element with the following attributes:


mime-type
Mime type of the document, e.g. application/pdf, text/html and so on.

encoding
Encoding of the document, with base64 being the standard implementation.

href
The URL of any documents that can be accessed over HTTP, instead of 
embedding them in the indexing request.  The indexer would fetch the 
document using the specified URL.


There would then be entries in the configuration file that map each MIME 
type to a handler that is capable of dealing with that document type.


Thoughts?

--
Alan Burlison
--


Re: [VOTE] graduate Solr to Lucene subproject

2007-01-04 Thread Chris Hostetter
: [ ] +1 ask Lucene PMC and the Incubator PMC to graduate Solr from the
: Incubator to become a Lucene subproject.

+1



-Hoss



[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

2007-01-04 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462338
 ] 

Hoss Man commented on SOLR-42:
--

Suggestion from Mirko on solr-dev: change HTMLStripReader to replace striped 
HTML with equal length whitespace.

(this could possibly be made a constructor option)

 Highlighting problems with HTMLStripWhitespaceTokenizerFactory
 --

 Key: SOLR-42
 URL: https://issues.apache.org/jira/browse/SOLR-42
 Project: Solr
  Issue Type: Bug
  Components: update
Reporter: Andrew May

 Indexing content that contains HTML markup, causes problems with highlighting 
 if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names 
 from being searchable).
 Example title field:
 SUP40/SUPAr/SUP39/SUPAr laserprobe dating of mylonitic fabrics in a 
 polyorogenic terrane of NW Iberia
 Searching for title:fabrics with highlighting on, the highlighted version has 
 the em tags in the wrong place - 22 characters to the left of where they 
 should be (i.e. the sum of the lengths of the tags).
 Response from Yonik on the solr-user mailing-list:
 HTMLStripWhitespaceTokenizerFactory works in two phases...
 HTMLStripReader removes the HTML and passes the result to
 WhitespaceTokenizer... at that point, Tokens are generated, but the
 offsets will correspond to the text after HTML removal, not before.
 I did it this way so that HTMLStripReader  could go before any
 tokenizer (like StandardTokenizer).
 Can you open a JIRA bug for this?  The fix would be a special version
 of HTMLStripReader integrated with a WhitespaceTokenizer to keep
 offsets correct. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira