Re: character escapes in source? ... was: Re: Eclipse: Invalid character constant

2011-04-08 Thread Earwin Burrfoot
On Fri, Apr 8, 2011 at 03:01, Robert Muir rcm...@gmail.com wrote:
 On Thu, Apr 7, 2011 at 6:48 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:

 : -1. These files should be readable, for maintaining, debugging and
 : knowing whats going on.

 Readability is my main concern ... i don't know (and frequently can't
 tell) the differnece between a lot of non ascii characters -- and i'm
 guessing i'm not alone.  when it's spelled out explicitly using the
 character name or escape code, there is no ambiquity about what character
 was intended, or wether it got screwed up by some tool along the way (ie:
 the svn server, an svn client, the patch command, a text editor, an IDE,
 ant's fixcrlf task, etc...)

 Please take the time, just 5 or 10 minutes, to look thru some of this
 source code and tests.

 Imagine if you couldn't just look at the code to see what it does, but
 had to decode from some crazy numeric encoding scheme.
 Imagine if it were this way for things like stopword lists too.

 It would be basically impossible for you to look at the code and
 figure out what it does!
 For example, try looking at thai analyzer tests, if these were all
 numbers, how would you know wtf is going on?

 Although this comes up from time to time, I stand firm on my -1
 because its important to me for the source code to be readable.
 I'm not willing to give this up just because some people cannot read
 writing system XYZ.

 I have said before, i'm willing to change my -1 vote on this, if *ALL*
 string constants (including english ones) are changed to be character
 escapes.
 If you imagine what the code would look like if english string
 constants were instead codes, then I think you will understand my
 point of view!

 Its really really important to source code readability to be able to
 open a file and understand what it does, not to have to use some
 decoder because it uses characters other people dont understand.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



I think having both raw characters /and/ encoded representation is the
best? (one of them in comments)
I'm all for unicode sources, but at least two things hit me repeatedly:
1. Tools do screw up, and you have to recover somehow.
eg. IntelliJ IDEA's 'shelve' function uses platform default (MacRoman
in my case) and I've lost some text on things I shelved but never
committed anywhere.
2. There are characters that look all the same.
E.g. different whitespace/dashes. Or, (if you have cyrillic in your
fonts) I dare you to discern between a/а, c/с, e/е, o/о.
These are different characters from latin and cyrillic charsets (left
latin/right cyrillic), but in 99% fonts they are visually identical.
I had a filter that folded up similarily looking characters, and it
was documented in exactly this way - raw char+code.

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene Search with MoreLiktThis

2011-04-08 Thread bsargurunathan
Hi,
I need some ideas about Lucene MoreLikeThis. I want search the records
conditioning with more that one field name using MoreLikeThis. Right now i
did some codes like

IndexReader indexreader =
IndexReader.open(PropertyManager.getStringProperty(FAW.Lucene.index_path));
IndexSearcher searcher = new IndexSearcher(indexreader);
MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(hits.id(0));
mlt.setMinDocFreq(0);
mlt.setMinTermFreq(0);  
mlt.setFieldNames(new String[]{serviceNumber,env});
Hits list = searcher.search(query);
Iterator itr1 = list.iterator();
while()

But its fetching all the records with the matching of serviceNumber and
env. But i need to get the common records of this fields...

Like..
ServiceNumber envValue
1   env1 2
1   env2 3
2   env1 4
1   env1 5

If the data are like this means.. I want fetch
ServiceNumberenv  Value
1  env1  2
1  env1  5

This is my requirement. Can anyone help for me or if any ideasThanks in
advance 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lucene-Search-with-MoreLiktThis-tp2794419p2794419.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [HUDSON] Lucene-trunk - Build # 1523 - Failure

2011-04-08 Thread Michael McCandless
OOME on this one... we really need the dump heap on OOME JRE command
line set...

Mike

http://blog.mikemccandless.com

On Thu, Apr 7, 2011 at 10:34 PM, Apache Hudson Server
hud...@hudson.apache.org wrote:
 Build: https://hudson.apache.org/hudson/job/Lucene-trunk/1523/

 1 tests failed.
 REGRESSION:  org.apache.lucene.index.TestNRTThreads.testNRTThreads

 Error Message:
 Some threads threw uncaught exceptions!

 Stack Trace:
 junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
        at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1232)
        at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1160)
        at 
 org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:521)




 Build Log (for compile errors):
 [...truncated 11839 lines...]



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Google Summer Code 2011 participation

2011-04-08 Thread Michael McCandless
Anyone can participate in Lucene/Solr!  You don't need to be GSoC
student to do so...

Browse the issues in Jira (maybe focusing on the ones marked for GSoC
and not already taken), or open your own issues, discuss, post
patches, iterate, etc.

Find your itch and scratch it ;)

And there are a great many itches out there that need scratching...

Mike

http://blog.mikemccandless.com

On Thu, Apr 7, 2011 at 9:34 PM, Minh Doan daywed...@gmail.com wrote:
 Hi forks,
 Receiving a bunch of emails recently about GSOC, I really want to join but
 it seems like I'm not eligible to do even though I used to be a PhD student,
 and currently on leave (I will be probably back soon). I really want to
 contribute to lucene to implement some of my ideas. Can I have a lucene
 mentor like those mentor experts who are excited to GSOC ?

 Best,
 Minh
 On Tue, Apr 5, 2011 at 7:06 AM, Steven A Rowe sar...@syr.edu wrote:

 Hi Jayendra,

 From
 http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/faqs#who:

 In order to participate in the program, you must be a student. Google
 defines a student as an individual enrolled in or accepted into an
 accredited institution including (but not necessarily limited to) colleges,
 universities, masters programs, PhD programs and undergraduate programs. You
 are eligible to apply if you are enrolled in an accredited university
 educational program provided you meet all of the other eligibility
 requirements. You should be prepared, upon request, to provide Google with
 transcripts or other documentation from your accredited institution as proof
 of enrollment or admission status. Computer Science does not need to be your
 field of study in order to participate in the program.

 You may be enrolled as a full-time or part-time student. You must also be
 eligible to work in the country in which you'll reside throughout the
 duration of the program, e.g. if you are in the United States on an F-1
 visa, you are welcome to apply to Google Summer of Code as long as you have
 U.S. work authorization. For F-1 students applying for CPT, Google will
 furnish you with a letter you can provide to your university to get CPT
 established once your application to the program has been accepted.

  -Original Message-
  From: Jayendra Patil [mailto:jayendra.patil@gmail.com]
  Sent: Tuesday, April 05, 2011 9:56 AM
  To: dev@lucene.apache.org
  Subject: Google Summer Code 2011 participation
 
  Hi,
 
  Does the Google Summer Code 2011 apply only to students ??
  I have been working on Solr for quite some time now and would like to
  start contributing back.
  Have been using it to index structured and unstructured data and have
  a fair bit of knowledge of the internals as well. (Have a few jiras
  and patches submitted)
  I don't have a specific proposal in mind yet, but would like to start
  with any specific area or issues.
 
  Let me know if and how can i participate.
 
  Regards,
  Jayendra
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org




 --
 ---
 Minh



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: character escapes in source? ... was: Re: Eclipse: Invalid character constant

2011-04-08 Thread Robert Muir
On Fri, Apr 8, 2011 at 2:49 AM, Earwin Burrfoot ear...@gmail.com wrote:
 On Fri, Apr 8, 2011 at 03:01, Robert Muir rcm...@gmail.com wrote:
 On Thu, Apr 7, 2011 at 6:48 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:

 : -1. These files should be readable, for maintaining, debugging and
 : knowing whats going on.

 Readability is my main concern ... i don't know (and frequently can't
 tell) the differnece between a lot of non ascii characters -- and i'm
 guessing i'm not alone.  when it's spelled out explicitly using the
 character name or escape code, there is no ambiquity about what character
 was intended, or wether it got screwed up by some tool along the way (ie:
 the svn server, an svn client, the patch command, a text editor, an IDE,
 ant's fixcrlf task, etc...)

 Please take the time, just 5 or 10 minutes, to look thru some of this
 source code and tests.

 Imagine if you couldn't just look at the code to see what it does, but
 had to decode from some crazy numeric encoding scheme.
 Imagine if it were this way for things like stopword lists too.

 It would be basically impossible for you to look at the code and
 figure out what it does!
 For example, try looking at thai analyzer tests, if these were all
 numbers, how would you know wtf is going on?

 Although this comes up from time to time, I stand firm on my -1
 because its important to me for the source code to be readable.
 I'm not willing to give this up just because some people cannot read
 writing system XYZ.

 I have said before, i'm willing to change my -1 vote on this, if *ALL*
 string constants (including english ones) are changed to be character
 escapes.
 If you imagine what the code would look like if english string
 constants were instead codes, then I think you will understand my
 point of view!

 Its really really important to source code readability to be able to
 open a file and understand what it does, not to have to use some
 decoder because it uses characters other people dont understand.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 I think having both raw characters /and/ encoded representation is the
 best? (one of them in comments)
 I'm all for unicode sources, but at least two things hit me repeatedly:
 1. Tools do screw up, and you have to recover somehow.
 eg. IntelliJ IDEA's 'shelve' function uses platform default (MacRoman
 in my case) and I've lost some text on things I shelved but never
 committed anywhere.
 2. There are characters that look all the same.
 E.g. different whitespace/dashes. Or, (if you have cyrillic in your
 fonts) I dare you to discern between a/а, c/с, e/е, o/о.
 These are different characters from latin and cyrillic charsets (left
 latin/right cyrillic), but in 99% fonts they are visually identical.
 I had a filter that folded up similarily looking characters, and it
 was documented in exactly this way - raw char+code.


I've worked with a lot of characters on eclipse, and the ones that
confuse my eyes the most are l/1 and O/0

So again if we do this, then we must do it for all english text, too

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: My GSOC proposal

2011-04-08 Thread Varun Thacker
I have refined my proposal here : http://goo.gl/uYXrV

Are there any suggestions for which I need to update my proposal before
today's deadline .

On Thu, Apr 7, 2011 at 9:28 AM, Varun Thacker varunthacker1...@gmail.comwrote:

 I have updated my proposal online to mention the time I would be able to
 dedicate to the project .


 On Thu, Apr 7, 2011 at 7:05 AM, Adriano Crestani 
 adrianocrest...@gmail.com wrote:

 Hi Varun,

 Nice proposal, very complete. Only one thing missing, you should mention
 somewhere how many hours a week you are willing to spend working on the
 project and whether there is any holiday you won't be able to work.

 Good luck ;)


 On Wed, Apr 6, 2011 at 5:57 PM, Varun Thacker varunthacker1...@gmail.com
  wrote:

 I have drafted the proposal on the official GSoC website . This is the
 link to my proposal http://goo.gl/uYXrV . Please do let me know if
 anything needs to be changed ,added or removed.

 I will keep on working on it till the deadline on the 8th.

 On Wed, Apr 6, 2011 at 11:41 PM, Michael McCandless 
 luc...@mikemccandless.com wrote:

 That test code looks good -- you really should have seen awful
 performance had you used O_DIRECT since you read byte by byte.

 A more realistic test is to read a whole buffer (eg 4 KB is what
 Lucene now uses during merging, but we'd probably up this to like 1 MB
 when using O_DIRECT).

 Linus does hate O_DIRECT (see http://kerneltrap.org/node/7563), and
 for good reason: its existence means projects like ours can use it to
 work around limitations in the Linux IO apis that control the buffer
 cache when, otherwise, we might conceivably make patches to fix Linux
 correctly.  It's an escape hatch, and we all use the escape hatch
 instead of trying to fix Linux for real...

 For example the NOREUSE flag is a no-op now in Linux, which is a
 shame, because that's precisely the flag we'd want to use for merging
 (along with SEQUENTIAL).  Had that flag been implemented well, it'd
 give better results than our workaround using O_DIRECT.

 Anyway, giving how things are, until we can get more control (wy
 up in Javaland) over the buffer cache, O_DIRECT (via native directory
 impl through JNI) is our only real option, today.

 More details here:
 http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html

 Note that other OSs likely do a better job and actually implement
 NOREUSE, and similar APIs, so the generic Unix/WindowsNativeDirectory
 would simply use NOREUSE on these platforms for I/O during segment
 merging.

 Mike

 http://blog.mikemccandless.com

 On Wed, Apr 6, 2011 at 11:56 AM, Varun Thacker
  varunthacker1...@gmail.com wrote:
  Hi. I wrote a sample code to test out speed difference between
 SEQUENTIAL
  and O_DIRECT( I used the madvise flag-MADV_DONTNEED) reads .
 
  This is the link to the code: http://pastebin.com/8QywKGyS
 
  There was a speed difference which when i switched between the two
 flags. I
  have not used the O_DIRECT flag because Linus had criticized it.
 
  Is this what the flags are intended to be used for ? This is just a
 sample
  code with a test file .
 
  On Wed, Apr 6, 2011 at 12:11 PM, Simon Willnauer
  simon.willna...@googlemail.com wrote:
  Hey Varun,
  On Tue, Apr 5, 2011 at 11:07 PM, Michael McCandless
  luc...@mikemccandless.com wrote:
  Hi Varun,
 
  Those two issues would make a great GSoC!  Comments below...
  +1
 
  On Tue, Apr 5, 2011 at 1:56 PM, Varun Thacker
  varunthacker1...@gmail.com wrote:
 
  I would like to combine two tasks as part of my project
  namely-Directory createOutput and openInput should take an
 IOContext
  (Lucene-2793) and compliment it by Generalize DirectIOLinuxDir to
  UnixDir (Lucene-2795).
 
  The first part of the project is aimed at significantly reducing
 time
  taken to search during indexing by adding an IOContext which would
  store buffer size and have options to bypass the OS’s buffer cache
  (This is what causes the slowdown in search ) and other hints. Once
  completed I would move on to Lucene-2795 and generalize the
 Directory
  implementation to make a UnixDirectory .
 
  So, the first part (LUCENE-2793) should cause no change at all to
  performance, functionality, etc., because it's merely installing
 the
  plumbing (IOContext threaded throughout the low-level store APIs in
  Lucene) so that higher levels can send important details down to the
  Directory.  We'd fix IndexWriter/IndexReader to fill out this
  IOContext with the details (merging, flushing, new reader, etc.).
 
  There's some fun/freedom here in figuring out just what details
 should
  be included in IOContext... (eg: is it low level set buffer size to
 4
  KB
  or is it high level I am opening a new near-real-time reader).
 
  This first step is a rote cutover, just changing APIs but in no way
  taking advantage of the new APIs.
 
  The 2nd step (LUCENE-2795) would then take advantage of this
 plumbing,
  by creating a UnixDir impl that, using JNI (C code), passes advanced
  

[HUDSON] Lucene-Solr-tests-only-trunk - Build # 6867 - Failure

2011-04-08 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/6867/

1 tests failed.
REGRESSION:  org.apache.solr.cloud.ZkControllerTest.testUploadToCloud

Error Message:
KeeperErrorCode = ConnectionLoss for /configs/config1/schema-reversed.xml

Stack Trace:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /configs/config1/schema-reversed.xml
at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
at 
org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:224)
at 
org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:388)
at 
org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:410)
at org.apache.solr.cloud.ZkController.uploadToZK(ZkController.java:520)
at 
org.apache.solr.cloud.ZkControllerTest.testUploadToCloud(ZkControllerTest.java:191)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1232)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1160)




Build Log (for compile errors):
[...truncated 9073 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Google Summer Code 2011 participation

2011-04-08 Thread Simon Willnauer
On Fri, Apr 8, 2011 at 12:11 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 Anyone can participate in Lucene/Solr!  You don't need to be GSoC
 student to do so...

 Browse the issues in Jira (maybe focusing on the ones marked for GSoC
 and not already taken), or open your own issues, discuss, post
 patches, iterate, etc.

 Find your itch and scratch it ;)

+1 we are all around and jump on the issue to guide you. Find one, ask
questions if you have and start discussions / coding!

simon

 And there are a great many itches out there that need scratching...

 Mike

 http://blog.mikemccandless.com

 On Thu, Apr 7, 2011 at 9:34 PM, Minh Doan daywed...@gmail.com wrote:
 Hi forks,
 Receiving a bunch of emails recently about GSOC, I really want to join but
 it seems like I'm not eligible to do even though I used to be a PhD student,
 and currently on leave (I will be probably back soon). I really want to
 contribute to lucene to implement some of my ideas. Can I have a lucene
 mentor like those mentor experts who are excited to GSOC ?

 Best,
 Minh
 On Tue, Apr 5, 2011 at 7:06 AM, Steven A Rowe sar...@syr.edu wrote:

 Hi Jayendra,

 From
 http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/faqs#who:

 In order to participate in the program, you must be a student. Google
 defines a student as an individual enrolled in or accepted into an
 accredited institution including (but not necessarily limited to) colleges,
 universities, masters programs, PhD programs and undergraduate programs. You
 are eligible to apply if you are enrolled in an accredited university
 educational program provided you meet all of the other eligibility
 requirements. You should be prepared, upon request, to provide Google with
 transcripts or other documentation from your accredited institution as proof
 of enrollment or admission status. Computer Science does not need to be your
 field of study in order to participate in the program.

 You may be enrolled as a full-time or part-time student. You must also be
 eligible to work in the country in which you'll reside throughout the
 duration of the program, e.g. if you are in the United States on an F-1
 visa, you are welcome to apply to Google Summer of Code as long as you have
 U.S. work authorization. For F-1 students applying for CPT, Google will
 furnish you with a letter you can provide to your university to get CPT
 established once your application to the program has been accepted.

  -Original Message-
  From: Jayendra Patil [mailto:jayendra.patil@gmail.com]
  Sent: Tuesday, April 05, 2011 9:56 AM
  To: dev@lucene.apache.org
  Subject: Google Summer Code 2011 participation
 
  Hi,
 
  Does the Google Summer Code 2011 apply only to students ??
  I have been working on Solr for quite some time now and would like to
  start contributing back.
  Have been using it to index structured and unstructured data and have
  a fair bit of knowledge of the internals as well. (Have a few jiras
  and patches submitted)
  I don't have a specific proposal in mind yet, but would like to start
  with any specific area or issues.
 
  Let me know if and how can i participate.
 
  Regards,
  Jayendra
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org




 --
 ---
 Minh



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2459) LogLevelSelection Servlet outputs plain HTML

2011-04-08 Thread Stefan Matheis (steffkes) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Matheis (steffkes) updated SOLR-2459:


Description: 
The current available Output of the LogLevelSelection Servlet is plain HTML, 
which made it unpossible, to integrate the Logging-Information in the new 
Admin-UI. Format-Agnostic Output (like every [?] other Servlet offers) would be 
really nice!

Just as an Idea for a future structure, the new admin-ui is [actually based on 
that 
json-structure|https://github.com/steffkes/solr-admin/blob/master/logging.json] 
:)

  was:
The current available Output of the LogLevelSelection Servlet is plain HTML, 
which made it unpossible, to integrate the Logging-Information in the new 
Admin-UI. Format-Agnostic Output (like every [?] other Servlet offers) would be 
really nice!

Just as an Idea for a future structure, the new admin-ui is 
[https://github.com/steffkes/solr-admin/blob/master/logging.json|actually based 
on that json-structure] :)


 LogLevelSelection Servlet outputs plain HTML
 

 Key: SOLR-2459
 URL: https://issues.apache.org/jira/browse/SOLR-2459
 Project: Solr
  Issue Type: Wish
  Components: web gui
Reporter: Stefan Matheis (steffkes)
Priority: Trivial

 The current available Output of the LogLevelSelection Servlet is plain HTML, 
 which made it unpossible, to integrate the Logging-Information in the new 
 Admin-UI. Format-Agnostic Output (like every [?] other Servlet offers) would 
 be really nice!
 Just as an Idea for a future structure, the new admin-ui is [actually based 
 on that 
 json-structure|https://github.com/steffkes/solr-admin/blob/master/logging.json]
  :)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2193) Re-architect Update Handler

2011-04-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017461#comment-13017461
 ] 

Mark Miller commented on SOLR-2193:
---

{quote} I wonder how this should work with autocommit?
Someone may want a soft/reopen autocommit once every x seconds, but still may 
want a hard flush to stable storage in case I crash commit at some other 
duration. {quote}

Right - I agree. How about another simple start? Simply add another 
commitTracker that does soft commits - then you can schedule a mix of soft and 
hard commits.

{quote}
The other thing that might be cool is a client-specified freshness per request. 
For example, when they pass in a query, they specify that they need data that's 
no more than 1 second old... and if it's too old that will trigger a reopen 
(and block that specific request until the new searcher can be used). The 
benefit here is that big bulk uploads won't be interrupted if there is no time 
sensitive query traffic. The downside is that a high latency may be exposed to 
those requests if they depend on stuff that can take a lot of time the first 
time (like faceting).
{quote}

Yeah - I remember you mentioning this before - I def think this would be cool - 
perhaps as a follow on issue - though hopefully the affect on bulk updates will 
be minimized when Lucene takes care of the 'flush blocks the world' issue.

 Re-architect Update Handler
 ---

 Key: SOLR-2193
 URL: https://issues.apache.org/jira/browse/SOLR-2193
 Project: Solr
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
 Fix For: 4.0

 Attachments: SOLR-2193.patch


 The update handler needs an overhaul.
 A few goals I think we might want to look at:
 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like 
 UpdateHandler, DefaultUpdateHandler
 2. Expose the SolrIndexWriter in the api or add the proper abstractions to 
 get done what we now do with special casing:
 if (directupdatehandler2)
   success
  else
   failish
 3. Stop closing the IndexWriter and start using commit (still lazy IW init 
 though).
 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level.
 5. Keep NRT support in mind.
 6. Keep microsharding in mind (maintain logical index as multiple physical 
 indexes)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Want to learn solr development

2011-04-08 Thread Deepak Singh
hi

I want to learn solr development.

i have used solr 1.4.1 with full text search of pdf and doc files and
database search using multicore feature of solr.
i want do development of solr. how to start it. please help.


[jira] [Updated] (SOLR-1922) DocBuilder onImportError/Abort EventListener

2011-04-08 Thread Robert Zotter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Zotter updated SOLR-1922:


Affects Version/s: 4.0
   3.1

 DocBuilder onImportError/Abort EventListener
 

 Key: SOLR-1922
 URL: https://issues.apache.org/jira/browse/SOLR-1922
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Affects Versions: 1.4, 3.1, 3.1.1, 4.0
Reporter: Robert Zotter
Priority: Trivial
  Labels: DIH, DataImportHandler, DocBuilder, EventListener
 Attachments: SOLR-1922.patch


 The onImportEnd EventListener only fires off after a successful import.
 It would be useful to know when an import fails via an onImportError/Abort 
 EventListener.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-1922) DocBuilder onImportError/Abort EventListener

2011-04-08 Thread Robert Zotter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Zotter updated SOLR-1922:


Affects Version/s: 3.1.1

 DocBuilder onImportError/Abort EventListener
 

 Key: SOLR-1922
 URL: https://issues.apache.org/jira/browse/SOLR-1922
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Affects Versions: 1.4, 3.1, 3.1.1, 4.0
Reporter: Robert Zotter
Priority: Trivial
  Labels: DIH, DataImportHandler, DocBuilder, EventListener
 Attachments: SOLR-1922.patch


 The onImportEnd EventListener only fires off after a successful import.
 It would be useful to know when an import fails via an onImportError/Abort 
 EventListener.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1888) Provide Option to Store Payloads on the Term Vector

2011-04-08 Thread Peter Wilkins (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017483#comment-13017483
 ] 

Peter Wilkins commented on LUCENE-1888:
---

As someone new to Lucene, with a specific problem to solve, it is difficult to 
identify the appropriate Lucene feature to use.  Reading various online posts, 
I see I'm not alone.  I have a use case that I think this JIRA issue addresses; 
perhaps it will help refine what the issue resolution would do.

I'm indexing a lecture video transcript.  I want to store the text of the 
transcript and timecodes of when each  word occurs.  I want to search the text 
of the transcript and return the timecode so I can display the lecture video 
from that spot.

 Provide Option to Store Payloads on the Term Vector
 ---

 Key: LUCENE-1888
 URL: https://issues.apache.org/jira/browse/LUCENE-1888
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 4.0


 Would be nice to have the option to access the payloads in a document-centric 
 way by adding them to the Term Vectors.  Naturally, this makes the Term 
 Vectors bigger, but it may be just what one needs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2462) Using spellcheck.collate can result in extremely high memory usage

2011-04-08 Thread James Dyer (JIRA)
Using spellcheck.collate can result in extremely high memory usage
--

 Key: SOLR-2462
 URL: https://issues.apache.org/jira/browse/SOLR-2462
 Project: Solr
  Issue Type: Bug
  Components: spellchecker
Affects Versions: 3.1, 4.0
Reporter: James Dyer
Priority: Critical


When using spellcheck.collate, class SpellPossibilityIterator creates a 
ranked list of *every* possible correction combination.  But if returning 
several corrections per term, and if several words are misspelled, the existing 
algorithm uses a huge amount of memory.

This bug was introduced with SOLR-2010.  However, it is triggered anytime 
spellcheck.collate is used.  It is not necessary to use any features that 
were added with SOLR-2010.

We were in Production with Solr for 1 1/2 days and this bug started taking our 
Solr servers down with infinite GC loops.  It was pretty easy for this to 
happen as occasionally a user will accidently paste the URL into the Search box 
on our app.  This URL results in a search with ~12 misspelled words.  We have 
spellcheck.count set to 15. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2462) Using spellcheck.collate can result in extremely high memory usage

2011-04-08 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2462:
-

Attachment: SOLR-2462.patch

This sets the maximum limit to 1000 possibilities.  When this limit is reached, 
the list is sorted by rank then reduced to the top 100.  From then on, only 
collations with a rank equal or better than the 100th are added.  This process 
repeats until finished or until it has taken 50ms, at which time it quits.

I also added a maxTimeAllowed setting of 50ms to the collation test queries 
as an additional performance safeguard.

 Using spellcheck.collate can result in extremely high memory usage
 --

 Key: SOLR-2462
 URL: https://issues.apache.org/jira/browse/SOLR-2462
 Project: Solr
  Issue Type: Bug
  Components: spellchecker
Affects Versions: 3.1, 4.0
Reporter: James Dyer
Priority: Critical
 Attachments: SOLR-2462.patch


 When using spellcheck.collate, class SpellPossibilityIterator creates a 
 ranked list of *every* possible correction combination.  But if returning 
 several corrections per term, and if several words are misspelled, the 
 existing algorithm uses a huge amount of memory.
 This bug was introduced with SOLR-2010.  However, it is triggered anytime 
 spellcheck.collate is used.  It is not necessary to use any features that 
 were added with SOLR-2010.
 We were in Production with Solr for 1 1/2 days and this bug started taking 
 our Solr servers down with infinite GC loops.  It was pretty easy for this 
 to happen as occasionally a user will accidently paste the URL into the 
 Search box on our app.  This URL results in a search with ~12 misspelled 
 words.  We have spellcheck.count set to 15. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2462) Using spellcheck.collate can result in extremely high memory usage

2011-04-08 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated SOLR-2462:
--

Affects Version/s: (was: 4.0)
Fix Version/s: 4.0
   3.1.1

 Using spellcheck.collate can result in extremely high memory usage
 --

 Key: SOLR-2462
 URL: https://issues.apache.org/jira/browse/SOLR-2462
 Project: Solr
  Issue Type: Bug
  Components: spellchecker
Affects Versions: 3.1
Reporter: James Dyer
Priority: Critical
 Fix For: 3.1.1, 4.0

 Attachments: SOLR-2462.patch


 When using spellcheck.collate, class SpellPossibilityIterator creates a 
 ranked list of *every* possible correction combination.  But if returning 
 several corrections per term, and if several words are misspelled, the 
 existing algorithm uses a huge amount of memory.
 This bug was introduced with SOLR-2010.  However, it is triggered anytime 
 spellcheck.collate is used.  It is not necessary to use any features that 
 were added with SOLR-2010.
 We were in Production with Solr for 1 1/2 days and this bug started taking 
 our Solr servers down with infinite GC loops.  It was pretty easy for this 
 to happen as occasionally a user will accidently paste the URL into the 
 Search box on our app.  This URL results in a search with ~12 misspelled 
 words.  We have spellcheck.count set to 15. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [Lucene.Net] escaping single quotes when using query parser

2011-04-08 Thread Christopher Currens
I'm not sure you're issue is related to single quotes.  The only characters
that need to be escaped for the QueryParser are + - ! ( ) { } [ ] ^  ~ * ?
: \  You can do that using QueryParser.Escape(string).  It's possible you it
might be related to the analyzer that you're using.  In my experience,
sometimes using a different analyzer to index than you use to search can
*sometimes* cause unexpected behavior like this.  Since I haven't myself run
into this exact problem to the best of my knowledge, it's tough for me to
give a more specific answer without your code/test data.

Thanks,
Christopher

On Thu, Apr 7, 2011 at 2:01 AM, Ben Foster b...@planetcloud.co.uk wrote:

 Hi,



 How should we escape single quotes when working with the query parser?



 Currently we have a description field that may contain single quotes.



 Whilst this field is correctly indexed when we search the description no
 results are returned. I'm assuming it's because we need to replace the
 single quote in the search term with an escaped version.



 Many thanks,



 Ben Foster



 planetcloud
 The Elms, Hawton
 Newark-on-Trent
 Nottinghamshire
 NG24 3RL



  http://www.planetcloud.co.uk/ www.planetcloud.co.uk






[jira] [Commented] (LUCENE-2956) Support updateDocument() with DWPTs

2011-04-08 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017514#comment-13017514
 ] 

Simon Willnauer commented on LUCENE-2956:
-

FYI I have a working patch for this. it needs some cleanup so I will hopefully 
upload beginning next week

 Support updateDocument() with DWPTs
 ---

 Key: LUCENE-2956
 URL: https://issues.apache.org/jira/browse/LUCENE-2956
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: Realtime Branch
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch


 With separate DocumentsWriterPerThreads (DWPT) it can currently happen that 
 the delete part of an updateDocument() is flushed and committed separately 
 from the corresponding new document.
 We need to make sure that updateDocument() is always an atomic operation from 
 a IW.commit() and IW.getReader() perspective.  See LUCENE-2324 for more 
 details.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2355) simple distrib update processor

2011-04-08 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated SOLR-2355:
--

Fix Version/s: 4.0
   3.2

One thing we should probably address is the brittle cmd cloning. I don't like 
clone methods in general - but if we are going to do it in core code, better to 
put the clone in the cmd and be a bit less brittle.


 simple distrib update processor
 ---

 Key: SOLR-2355
 URL: https://issues.apache.org/jira/browse/SOLR-2355
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: DistributedUpdateProcessorFactory.java, 
 TestDistributedUpdate.java


 Here's a simple update processor for distributed indexing that I implemented 
 years ago.
 It implements a simple hash(id) MOD nservers and just fails if any servers 
 are down.
 Given the recent activity in distributed indexing, I thought this might be at 
 least a good source for ideas.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-04-08 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382.patch

This version fixes a bug involving the DIHCacheProcessor in the case of a 
many-to-[one|many] join between the parent entity and a child entity.  If the 
child entity used a DIHCacheProcessor and the same child joined to consecutive 
parents, only the first parent would join to the child.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior product's data import utility.  In order to get our 
 data into Solr, I created these caching enhancements.  Because I believe this 
 has 

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-04-08 Thread James Dyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017567#comment-13017567
 ] 

James Dyer commented on SOLR-2382:
--

In light of the recent discussion about the Spatial Contrib, I do wonder if 
seeking to get this committed is a non-starter because of its dependency on 
bdb-je.  I thought this wouldn't be an issue because we have an existing Lucene 
contrib (db) with this same dependency, but then I noticed that some of the 
committers regret the existence of the db contrib for this reason (and others).

In any case, even if the BerkleyBackedCache part of this patch could not be 
committed, having this framework in place so that developers can write their 
own persistent cache impls would be a major improvement in my opinion.  (I had 
originally started with a Lucene-backed cache but switch to bdb-je because I 
couldn't figure out how to achieve acceptable performance for gets from the 
cache).

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) 

[HUDSON] Lucene-Solr-tests-only-3.x - Build # 6870 - Failure

2011-04-08 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/6870/

1 tests failed.
REGRESSION:  org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe

Error Message:
Java heap space

Stack Trace:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2894)
at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:589)
at java.lang.StringBuffer.append(StringBuffer.java:337)
at 
java.text.RuleBasedCollator.getCollationKey(RuleBasedCollator.java:617)
at 
org.apache.lucene.collation.CollationKeyFilter.incrementToken(CollationKeyFilter.java:93)
at 
org.apache.lucene.collation.CollationTestBase.assertThreadSafe(CollationTestBase.java:304)
at 
org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe(TestCollationKeyAnalyzer.java:89)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1082)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1010)




Build Log (for compile errors):
[...truncated 5265 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-trunk - Build # 6880 - Failure

2011-04-08 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/6880/

1 tests failed.
REGRESSION:  org.apache.solr.cloud.BasicDistributedZkTest.testDistribSearch

Error Message:
Severe errors in solr configuration.  Check your log files for more detailed 
information on what may be wrong.  
- 
org.apache.solr.common.cloud.ZooKeeperException:   at 
org.apache.solr.core.CoreContainer.register(CoreContainer.java:517)  at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:406)  at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:290)  at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:239)
  at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93)  at 
org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)  at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)  at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)  
at 
org.mortbay.jetty.servlet.ServletHandler.updateMappings(ServletHandler.java:1104)
  at 
org.mortbay.jetty.servlet.ServletHandler.setFilterMappings(ServletHandler.java:1140)
  at 
org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:940)
  at 
org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:895)
  at org.mortbay.jetty.servlet.Context.addFilter(Context.java:207)  at 
org.apache.solr.client.solrj.embedded.JettySolrRunner$1.lifeCycleStarted(JettySolrRunner.java:98)
  at 
org.mortbay.component.AbstractLifeCycle.setStarted(AbstractLifeCycle.java:140)  
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:52)  at 
org.apache.solr.client.solrj.embedded.JettySolrRunner.start(JettySolrRunner.java:123)
  at 
org.apache.solr.client.solrj.embedded.JettySolrRunner.start(JettySolrRunner.java:118)
  at 
org.apache.solr.BaseDistributedSearchTestCase.createJetty(BaseDistributedSearchTestCase.java:245)
  at 
org.apache.solr.BaseDistributedSearchTestCase.createJetty(BaseDistributedSearchTestCase.java:236)
  at 
org.apache.solr.cloud.AbstractDistributedZkTestCase.createServers(AbstractDistributedZkTestCase.java:64)
  at org.apache.solr.BaseDistributedSearch  Severe errors in solr 
configuration.  Check your log files for more detailed information on what may 
be wrong.  - 
org.apache.solr.common.cloud.ZooKeeperException:   at 
org.apache.solr.core.CoreContainer.register(CoreContainer.java:517)  at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:406)  at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:290)  at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:239)
  at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93)  at 
org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)  at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)  at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)  
at 
org.mortbay.jetty.servlet.ServletHandler.updateMappings(ServletHandler.java:1104)
  at 
org.mortbay.jetty.servlet.ServletHandler.setFilterMappings(ServletHandler.java:1140)
  at 
org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:940)
  at 
org.mortbay.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:895)
  at org.mortbay.jetty.servlet.Context.addFilter(Context.java:207)  at 
org.apache.solr.client.solrj.embedded.JettySolrRunner$1.lifeCycleStarted(JettySolrRunner.java:98)
  at 
org.mortbay.component.AbstractLifeCycle.setStarted(AbstractLifeCycle.java:140)  
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:52)  at 
org.apache.solr.client.solrj.embedded.JettySolrRunner.start(JettySolrRunner.java:123)
  at 
org.apache.solr.client.solrj.embedded.JettySolrRunner.start(JettySolrRunner.java:118)
  at 
org.apache.solr.BaseDistributedSearchTestCase.createJetty(BaseDistributedSearchTestCase.java:245)
  at 
org.apache.solr.BaseDistributedSearchTestCase.createJetty(BaseDistributedSearchTestCase.java:236)
  at 
org.apache.solr.cloud.AbstractDistributedZkTestCase.createServers(AbstractDistributedZkTestCase.java:64)
  at org.apache.solr.BaseDistributedSearch  request: 
http://localhost:55435/solr/update?wt=javabinversion=2

Stack Trace:


request: http://localhost:55435/solr/update?wt=javabinversion=2
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:436)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:245)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at 
org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:111)
at 

(LUCENE-2793:Directory createOutput and openInput should take an IOContex and LUCENE-2795:Genericize DirectIOLinuxDir - UnixDir) as GSoC

2011-04-08 Thread Varun Thacker
I'm moving this discussion to the thread as suggested by the Lucene mentors.

This is my final proposal link:
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/varunthacker1989/1

Till now I have implemented the following to get a better understanding of
my project:

I wrote a sample code to test out speed difference between SEQUENTIAL and
O_DIRECT( I used the madvise flag-MADV_DONTNEED) reads .This is the link to
the code: http://pastebin.com/8QywKGyS. There was a noticeable speed
difference when i switched between the two flags. I did not use the O_DIRECT
flag because Linus Torvalds had criticized it.

This blog post by Micheal McCandless (
http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html)
helped me understand the problem for which LUCENE-2793 is needed.

I am currently reading up on the Lucene Documentation taking into account a
few pointers provided by Simon WIllnauer.

I want to keep everyone updated and also take into consideration the
comments made by the members of this community, which will help me
understand and implement these tasks .

-- 


Regards,
Varun Thacker
http://varunthacker.wordpress.com


Re: [HUDSON] Lucene-trunk - Build # 1523 - Failure

2011-04-08 Thread Michael McCandless
On Fri, Apr 8, 2011 at 2:36 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : OOME on this one... we really need the dump heap on OOME JRE command
 : line set...

 Ant runs the tests in forked JVMs right? so thta should just be a
 build.xml change.

OK I tried that, with the patch below (applies to 3.x), and then
provoked an OOME and it works great, though I think this is Sun
(Oracle!) JRE specific... which is OK for now (we use Oracle JRE on
Jenkins right?), but if we want to rotate JREs in the future this
won't work...

The problem is... the resulting dump is large (mine was ~400 MB).  We
can specify a location for the dump (-XX:HeapDumpPath=/some/path)... I
think we should somehow remove them after a few days?  How much disk
space can we use up?

Patch:

Index: solr/build.xml
===
--- solr/build.xml  (revision 1089906)
+++ solr/build.xml  (working copy)
@@ -464,6 +464,7 @@
  jvmarg line=${dir.prop}/
   --
   jvmarg line=${args}/
+  jvmarg line=-XX:+HeapDumpOnOutOfMemoryError/

   formatter classname=${junit.details.formatter}
usefile=false if=junit.details/
   classpath refid=test.run.classpath/
Index: lucene/common-build.xml
===
--- lucene/common-build.xml (revision 1089906)
+++ lucene/common-build.xml (working copy)
@@ -488,6 +488,7 @@
  /assertions

  jvmarg line=${args}/
+ jvmarg value=-XX:+HeapDumpOnOutOfMemoryError/

  !-- allow tests to control debug prints --
  sysproperty key=tests.verbose value=${tests.verbose}/

Mike

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2463) Using an evaluator outside the scope of an entity results in a null context

2011-04-08 Thread Robert Zotter (JIRA)
Using an evaluator outside the scope of an entity results in a null context
---

 Key: SOLR-2463
 URL: https://issues.apache.org/jira/browse/SOLR-2463
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 3.1, 3.1.1, 4.0
Reporter: Robert Zotter
Priority: Minor


When using an Evaluator outside an entity element the Context argument is null.

public class LowerCaseFunctionEvaluator extends Evaluator {
 public String evaluate(String expression, Context context) {
   List l = EvaluatorBag.parseParams(expression, context.getVariableResolver());
   if (l.size() != 1) {
 throw new RuntimeException('toLowerCase' must have only one parameter );
   }

   return l.get(0).toString().toLowerCase();
 }
}

dataSource   name=...
  type=...
  driver=...
  url=...
  user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')}
  password=...
  batchSize=.../


 entity name=...
dataSource=...
query=select * from 
${dataimporter.functions.toLowerCase('THIS_WILL_WORK')}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2463) Using an evaluator outside the scope of an entity results in a null context

2011-04-08 Thread Robert Zotter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Zotter updated SOLR-2463:


Description: 
When using an Evaluator outside an entity element the Context argument is null.

{code:title=foo.LowerCaseFunctionEvaluator.java|borderStyle=solid}
public class LowerCaseFunctionEvaluator extends Evaluator {
 public String evaluate(String expression, Context context) {
   List l = EvaluatorBag.parseParams(expression, context.getVariableResolver());
   if (l.size() != 1) {
 throw new RuntimeException('toLowerCase' must have only one parameter );
   }

   return l.get(0).toString().toLowerCase();
 }
}
{code}

{code:title=data-config.xml|borderStyle=solid}
dataSource name=...
type=...
driver=...
url=...
user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')}
password=.../
{code}

{code:title=data-config.xml|borderStyle=solid}
 entity name=...
 dataSource=...
 query=select * from 
${dataimporter.functions.toLowerCase('THIS_WILL_WORK')}/
{code}

  was:
When using an Evaluator outside an entity element the Context argument is null.

public class LowerCaseFunctionEvaluator extends Evaluator {
 public String evaluate(String expression, Context context) {
   List l = EvaluatorBag.parseParams(expression, context.getVariableResolver());
   if (l.size() != 1) {
 throw new RuntimeException('toLowerCase' must have only one parameter );
   }

   return l.get(0).toString().toLowerCase();
 }
}

dataSource   name=...
  type=...
  driver=...
  url=...
  user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')}
  password=...
  batchSize=.../


 entity name=...
dataSource=...
query=select * from 
${dataimporter.functions.toLowerCase('THIS_WILL_WORK')}


 Using an evaluator outside the scope of an entity results in a null context
 ---

 Key: SOLR-2463
 URL: https://issues.apache.org/jira/browse/SOLR-2463
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 3.1, 3.1.1, 4.0
Reporter: Robert Zotter
Priority: Minor

 When using an Evaluator outside an entity element the Context argument is 
 null.
 {code:title=foo.LowerCaseFunctionEvaluator.java|borderStyle=solid}
 public class LowerCaseFunctionEvaluator extends Evaluator {
  public String evaluate(String expression, Context context) {
List l = EvaluatorBag.parseParams(expression, 
 context.getVariableResolver());
if (l.size() != 1) {
  throw new RuntimeException('toLowerCase' must have only one parameter 
 );
}
return l.get(0).toString().toLowerCase();
  }
 }
 {code}
 {code:title=data-config.xml|borderStyle=solid}
 dataSource name=...
 type=...
 driver=...
 url=...
 user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')}
 password=.../
 {code}
 {code:title=data-config.xml|borderStyle=solid}
  entity name=...
  dataSource=...
  query=select * from 
 ${dataimporter.functions.toLowerCase('THIS_WILL_WORK')}/
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2463) Using an evaluator outside the scope of an entity results in a null context

2011-04-08 Thread Robert Zotter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Zotter updated SOLR-2463:


Description: 
When using an Evaluator outside an entity element the Context argument is null.

{code:title=foo.LowerCaseFunctionEvaluator.java|borderStyle=solid}
public class LowerCaseFunctionEvaluator extends Evaluator {
 public String evaluate(String expression, Context context) {
   List l = EvaluatorBag.parseParams(expression, context.getVariableResolver());
   if (l.size() != 1) {
 throw new RuntimeException('toLowerCase' must have only one parameter );
   }

   return l.get(0).toString().toLowerCase();
 }
}
{code}

{code:title=data-config.xml|borderStyle=solid}
dataSource name=...
type=...
driver=...
url=...
user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')}
password=.../
{code}

{code:title=data-config.xml|borderStyle=solid}
entity name=...
dataSource=...
query=select * from 
${dataimporter.functions.toLowerCase('THIS_WILL_WORK')}/
{code}

This use case worked in 1.4

  was:
When using an Evaluator outside an entity element the Context argument is null.

{code:title=foo.LowerCaseFunctionEvaluator.java|borderStyle=solid}
public class LowerCaseFunctionEvaluator extends Evaluator {
 public String evaluate(String expression, Context context) {
   List l = EvaluatorBag.parseParams(expression, context.getVariableResolver());
   if (l.size() != 1) {
 throw new RuntimeException('toLowerCase' must have only one parameter );
   }

   return l.get(0).toString().toLowerCase();
 }
}
{code}

{code:title=data-config.xml|borderStyle=solid}
dataSource name=...
type=...
driver=...
url=...
user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')}
password=.../
{code}

{code:title=data-config.xml|borderStyle=solid}
 entity name=...
 dataSource=...
 query=select * from 
${dataimporter.functions.toLowerCase('THIS_WILL_WORK')}/
{code}


 Using an evaluator outside the scope of an entity results in a null context
 ---

 Key: SOLR-2463
 URL: https://issues.apache.org/jira/browse/SOLR-2463
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 3.1, 3.1.1, 4.0
Reporter: Robert Zotter
Priority: Minor

 When using an Evaluator outside an entity element the Context argument is 
 null.
 {code:title=foo.LowerCaseFunctionEvaluator.java|borderStyle=solid}
 public class LowerCaseFunctionEvaluator extends Evaluator {
  public String evaluate(String expression, Context context) {
List l = EvaluatorBag.parseParams(expression, 
 context.getVariableResolver());
if (l.size() != 1) {
  throw new RuntimeException('toLowerCase' must have only one parameter 
 );
}
return l.get(0).toString().toLowerCase();
  }
 }
 {code}
 {code:title=data-config.xml|borderStyle=solid}
 dataSource name=...
 type=...
 driver=...
 url=...
 user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')}
 password=.../
 {code}
 {code:title=data-config.xml|borderStyle=solid}
 entity name=...
 dataSource=...
 query=select * from 
 ${dataimporter.functions.toLowerCase('THIS_WILL_WORK')}/
 {code}
 This use case worked in 1.4

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2463) Using an evaluator outside the scope of an entity results in a null context

2011-04-08 Thread Robert Zotter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Zotter updated SOLR-2463:


Fix Version/s: 3.1.1

 Using an evaluator outside the scope of an entity results in a null context
 ---

 Key: SOLR-2463
 URL: https://issues.apache.org/jira/browse/SOLR-2463
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Affects Versions: 3.1, 3.1.1, 4.0
Reporter: Robert Zotter
Priority: Minor
 Fix For: 3.1.1


 When using an Evaluator outside an entity element the Context argument is 
 null.
 {code:title=foo.LowerCaseFunctionEvaluator.java|borderStyle=solid}
 public class LowerCaseFunctionEvaluator extends Evaluator {
  public String evaluate(String expression, Context context) {
List l = EvaluatorBag.parseParams(expression, 
 context.getVariableResolver());
if (l.size() != 1) {
  throw new RuntimeException('toLowerCase' must have only one parameter 
 );
}
return l.get(0).toString().toLowerCase();
  }
 }
 {code}
 {code:title=data-config.xml|borderStyle=solid}
 dataSource name=...
 type=...
 driver=...
 url=...
 user=${dataimporter.functions.toLowerCase('THIS_WILL_NOT_WORK')}
 password=.../
 {code}
 {code:title=data-config.xml|borderStyle=solid}
 entity name=...
 dataSource=...
 query=select * from 
 ${dataimporter.functions.toLowerCase('THIS_WILL_WORK')}/
 {code}
 This use case worked in 1.4

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [HUDSON] Lucene-trunk - Build # 1523 - Failure

2011-04-08 Thread Steven A Rowe
AFAICT, these (Oracle/Sun-only) JVM parameters were introduced in 1.6 (that's 
what this parameters list says: 
http://blogs.sun.com/watt/resource/jvm-options-list.html) - we tell Jenkins to 
use 1.6 for Lucene/Solr testing, so this isn't an issue in practice, I guess.

Hopefully 1.5 JVMs, and non-Oracle/Sun JVMs, won't choke on these unknown 
parameters.

Steve

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Friday, April 08, 2011 3:55 PM
 To: dev@lucene.apache.org
 Cc: Chris Hostetter; Apache Hudson Server
 Subject: Re: [HUDSON] Lucene-trunk - Build # 1523 - Failure
 
 On Fri, Apr 8, 2011 at 2:36 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:
 
  : OOME on this one... we really need the dump heap on OOME JRE command
  : line set...
 
  Ant runs the tests in forked JVMs right? so thta should just be a
  build.xml change.
 
 OK I tried that, with the patch below (applies to 3.x), and then
 provoked an OOME and it works great, though I think this is Sun
 (Oracle!) JRE specific... which is OK for now (we use Oracle JRE on
 Jenkins right?), but if we want to rotate JREs in the future this
 won't work...
 
 The problem is... the resulting dump is large (mine was ~400 MB).  We
 can specify a location for the dump (-XX:HeapDumpPath=/some/path)... I
 think we should somehow remove them after a few days?  How much disk
 space can we use up?
 
 Patch:
 
 Index: solr/build.xml
 ===
 --- solr/build.xml(revision 1089906)
 +++ solr/build.xml(working copy)
 @@ -464,6 +464,7 @@
   jvmarg line=${dir.prop}/
--
jvmarg line=${args}/
 +  jvmarg line=-XX:+HeapDumpOnOutOfMemoryError/
 
formatter classname=${junit.details.formatter}
 usefile=false if=junit.details/
classpath refid=test.run.classpath/
 Index: lucene/common-build.xml
 ===
 --- lucene/common-build.xml   (revision 1089906)
 +++ lucene/common-build.xml   (working copy)
 @@ -488,6 +488,7 @@
 /assertions
 
 jvmarg line=${args}/
 +   jvmarg value=-XX:+HeapDumpOnOutOfMemoryError/
 
 !-- allow tests to control debug prints --
 sysproperty key=tests.verbose value=${tests.verbose}/
 
 Mike
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2186) First cut at column-stride fields (index values storage)

2011-04-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017679#comment-13017679
 ] 

Jason Rutherglen commented on LUCENE-2186:
--

I'm wondering if there is a limitation on whether or not we can randomly access 
the doc values from the underlying Directory implementation, rather than need 
to load all the values directly into the main heap space.  This seems doable, 
and if so let me know if I can provide a patch.

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



some questions about how Lucene works internally

2011-04-08 Thread Yang
(sorry for cross-posting here, but it seems this question on the
internal mechanisms of Lucene can not be answered on the user@ list ,
so I'm asking for more expert knowledge here, thanks a lot)

===


I'm new to lucene/search engine , and have been struggling with these
questions recently.
I'd appreciate a lot of you could shed some light on this.


let's say I do a query on

dog   greyhound

note that I did not quote them, i.e. this is not a phrase search.

what happens under the hood ?

which term does Lucene use to look up the inverted Index ?
I read somewhere that Lucene uses the term with the higher IDF (i.e.
the more distinguishing term), i.e. in this case
greyhound, but what about dog? does Lucene traverse down the doclist
of  dog at all? if I provide multiple terms in my query,
generally how does Lucene decide how many doclists to travel down?


I read that Lucene uses a combination of binary model and  VSM, then
it seems that in the above case, it finds
the full doclist of dog , and that of greyhound, (the binary model
part), then find the common docs from the two doclists,
then order them by scores ( the VSM part).  is it true that the FULL
doclists are fetched first? or is some pruning done on the individual
doclists? I see the
talk in http://www.slideshare.net/abial/eurocon2010  that talks about
pruning and tiered search, but is this the default behavior of Lucene?
how are the doclists sorted? (by  idf ?? --- sorry I'm just beginning
to sift through a lot of docs online, somehow got this impression but
can't form a precise conclusion)



also generally, could you please provide some good articles on how
lucene/search engines work? I've read the anatomy of a search engine
(google Sergey Brin  Larry Page paper),
introduction to information retrieval (Manning et al )   , Lucene
in action 


Thanks
Yang

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-trunk - Build # 6894 - Failure

2011-04-08 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/6894/

1 tests failed.
REGRESSION:  org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration

Error Message:
expected:2 but was:3

Stack Trace:
junit.framework.AssertionFailedError: expected:2 but was:3
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1232)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1160)
at 
org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration(CloudStateUpdateTest.java:208)




Build Log (for compile errors):
[...truncated 8828 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org