date:20110408

Re: character escapes in source? ... was: Re: Eclipse: Invalid character constant

2011-04-08 Thread Earwin Burrfoot

On Fri, Apr 8, 2011 at 03:01, Robert Muir rcm...@gmail.com wrote:
 On Thu, Apr 7, 2011 at 6:48 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:

 : -1. These files should be readable, for maintaining, debugging and
 : knowing whats going on.

 Readability is my main concern ... i don't know (and frequently can't
 tell) the differnece between a lot of non ascii characters -- and i'm
 guessing i'm not alone.  when it's spelled out explicitly using the
 character name or escape code, there is no ambiquity about what character
 was intended, or wether it got screwed up by some tool along the way (ie:
 the svn server, an svn client, the patch command, a text editor, an IDE,
 ant's fixcrlf task, etc...)

 Please take the time, just 5 or 10 minutes, to look thru some of this
 source code and tests.

 Imagine if you couldn't just look at the code to see what it does, but
 had to decode from some crazy numeric encoding scheme.
 Imagine if it were this way for things like stopword lists too.

 It would be basically impossible for you to look at the code and
 figure out what it does!
 For example, try looking at thai analyzer tests, if these were all
 numbers, how would you know wtf is going on?

 Although this comes up from time to time, I stand firm on my -1
 because its important to me for the source code to be readable.
 I'm not willing to give this up just because some people cannot read
 writing system XYZ.

 I have said before, i'm willing to change my -1 vote on this, if *ALL*
 string constants (including english ones) are changed to be character
 escapes.
 If you imagine what the code would look like if english string
 constants were instead codes, then I think you will understand my
 point of view!

 Its really really important to source code readability to be able to
 open a file and understand what it does, not to have to use some
 decoder because it uses characters other people dont understand.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



I think having both raw characters /and/ encoded representation is the
best? (one of them in comments)
I'm all for unicode sources, but at least two things hit me repeatedly:
1. Tools do screw up, and you have to recover somehow.
eg. IntelliJ IDEA's 'shelve' function uses platform default (MacRoman
in my case) and I've lost some text on things I shelved but never
committed anywhere.
2. There are characters that look all the same.
E.g. different whitespace/dashes. Or, (if you have cyrillic in your
fonts) I dare you to discern between a/а, c/с, e/е, o/о.
These are different characters from latin and cyrillic charsets (left
latin/right cyrillic), but in 99% fonts they are visually identical.
I had a filter that folded up similarily looking characters, and it
was documented in exactly this way - raw char+code.

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene Search with MoreLiktThis

2011-04-08 Thread bsargurunathan

Hi,
I need some ideas about Lucene MoreLikeThis. I want search the records
conditioning with more that one field name using MoreLikeThis. Right now i
did some codes like

IndexReader indexreader =
IndexReader.open(PropertyManager.getStringProperty(FAW.Lucene.index_path));
IndexSearcher searcher = new IndexSearcher(indexreader);
MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(hits.id(0));
mlt.setMinDocFreq(0);
mlt.setMinTermFreq(0);  
mlt.setFieldNames(new String[]{serviceNumber,env});
Hits list = searcher.search(query);
Iterator itr1 = list.iterator();
while()

But its fetching all the records with the matching of serviceNumber and
env. But i need to get the common records of this fields...

Like..
ServiceNumber envValue
1   env1 2
1   env2 3
2   env1 4
1   env1 5

If the data are like this means.. I want fetch
ServiceNumberenv  Value
1  env1  2
1  env1  5

This is my requirement. Can anyone help for me or if any ideasThanks in
advance 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lucene-Search-with-MoreLiktThis-tp2794419p2794419.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [HUDSON] Lucene-trunk - Build # 1523 - Failure

2011-04-08 Thread Michael McCandless

OOME on this one... we really need the dump heap on OOME JRE command
line set...

Mike

http://blog.mikemccandless.com

On Thu, Apr 7, 2011 at 10:34 PM, Apache Hudson Server
hud...@hudson.apache.org wrote:
 Build: https://hudson.apache.org/hudson/job/Lucene-trunk/1523/

 1 tests failed.
 REGRESSION:  org.apache.lucene.index.TestNRTThreads.testNRTThreads

 Error Message:
 Some threads threw uncaught exceptions!

 Stack Trace:
 junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
        at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1232)
        at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1160)
        at 
 org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:521)




 Build Log (for compile errors):
 [...truncated 11839 lines...]



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Google Summer Code 2011 participation

2011-04-08 Thread Michael McCandless

Anyone can participate in Lucene/Solr!  You don't need to be GSoC
student to do so...

Browse the issues in Jira (maybe focusing on the ones marked for GSoC
and not already taken), or open your own issues, discuss, post
patches, iterate, etc.

Find your itch and scratch it ;)

And there are a great many itches out there that need scratching...

Mike

http://blog.mikemccandless.com

On Thu, Apr 7, 2011 at 9:34 PM, Minh Doan daywed...@gmail.com wrote:
 Hi forks,
 Receiving a bunch of emails recently about GSOC, I really want to join but
 it seems like I'm not eligible to do even though I used to be a PhD student,
 and currently on leave (I will be probably back soon). I really want to
 contribute to lucene to implement some of my ideas. Can I have a lucene
 mentor like those mentor experts who are excited to GSOC ?

 Best,
 Minh
 On Tue, Apr 5, 2011 at 7:06 AM, Steven A Rowe sar...@syr.edu wrote:

 Hi Jayendra,

 From
 http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/faqs#who:

 In order to participate in the program, you must be a student. Google
 defines a student as an individual enrolled in or accepted into an
 accredited institution including (but not necessarily limited to) colleges,
 universities, masters programs, PhD programs and undergraduate programs. You
 are eligible to apply if you are enrolled in an accredited university
 educational program provided you meet all of the other eligibility
 requirements. You should be prepared, upon request, to provide Google with
 transcripts or other documentation from your accredited institution as proof
 of enrollment or admission status. Computer Science does not need to be your
 field of study in order to participate in the program.

 You may be enrolled as a full-time or part-time student. You must also be
 eligible to work in the country in which you'll reside throughout the
 duration of the program, e.g. if you are in the United States on an F-1
 visa, you are welcome to apply to Google Summer of Code as long as you have
 U.S. work authorization. For F-1 students applying for CPT, Google will
 furnish you with a letter you can provide to your university to get CPT
 established once your application to the program has been accepted.

  -Original Message-
  From: Jayendra Patil [mailto:jayendra.patil@gmail.com]
  Sent: Tuesday, April 05, 2011 9:56 AM
  To: dev@lucene.apache.org
  Subject: Google Summer Code 2011 participation
 
  Hi,
 
  Does the Google Summer Code 2011 apply only to students ??
  I have been working on Solr for quite some time now and would like to
  start contributing back.
  Have been using it to index structured and unstructured data and have
  a fair bit of knowledge of the internals as well. (Have a few jiras
  and patches submitted)
  I don't have a specific proposal in mind yet, but would like to start
  with any specific area or issues.
 
  Let me know if and how can i participate.
 
  Regards,
  Jayendra
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org




 --
 ---
 Minh



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: character escapes in source? ... was: Re: Eclipse: Invalid character constant

2011-04-08 Thread Robert Muir

On Fri, Apr 8, 2011 at 2:49 AM, Earwin Burrfoot ear...@gmail.com wrote:
 On Fri, Apr 8, 2011 at 03:01, Robert Muir rcm...@gmail.com wrote:
 On Thu, Apr 7, 2011 at 6:48 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:

 : -1. These files should be readable, for maintaining, debugging and
 : knowing whats going on.

 Readability is my main concern ... i don't know (and frequently can't
 tell) the differnece between a lot of non ascii characters -- and i'm
 guessing i'm not alone.  when it's spelled out explicitly using the
 character name or escape code, there is no ambiquity about what character
 was intended, or wether it got screwed up by some tool along the way (ie:
 the svn server, an svn client, the patch command, a text editor, an IDE,
 ant's fixcrlf task, etc...)

 Please take the time, just 5 or 10 minutes, to look thru some of this
 source code and tests.

 Imagine if you couldn't just look at the code to see what it does, but
 had to decode from some crazy numeric encoding scheme.
 Imagine if it were this way for things like stopword lists too.

 It would be basically impossible for you to look at the code and
 figure out what it does!
 For example, try looking at thai analyzer tests, if these were all
 numbers, how would you know wtf is going on?

 Although this comes up from time to time, I stand firm on my -1
 because its important to me for the source code to be readable.
 I'm not willing to give this up just because some people cannot read
 writing system XYZ.

 I have said before, i'm willing to change my -1 vote on this, if *ALL*
 string constants (including english ones) are changed to be character
 escapes.
 If you imagine what the code would look like if english string
 constants were instead codes, then I think you will understand my
 point of view!

 Its really really important to source code readability to be able to
 open a file and understand what it does, not to have to use some
 decoder because it uses characters other people dont understand.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 I think having both raw characters /and/ encoded representation is the
 best? (one of them in comments)
 I'm all for unicode sources, but at least two things hit me repeatedly:
 1. Tools do screw up, and you have to recover somehow.
 eg. IntelliJ IDEA's 'shelve' function uses platform default (MacRoman
 in my case) and I've lost some text on things I shelved but never
 committed anywhere.
 2. There are characters that look all the same.
 E.g. different whitespace/dashes. Or, (if you have cyrillic in your
 fonts) I dare you to discern between a/а, c/с, e/е, o/о.
 These are different characters from latin and cyrillic charsets (left
 latin/right cyrillic), but in 99% fonts they are visually identical.
 I had a filter that folded up similarily looking characters, and it
 was documented in exactly this way - raw char+code.


I've worked with a lot of characters on eclipse, and the ones that
confuse my eyes the most are l/1 and O/0

So again if we do this, then we must do it for all english text, too

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: My GSOC proposal

2011-04-08 Thread Varun Thacker

I have refined my proposal here : http://goo.gl/uYXrV

Are there any suggestions for which I need to update my proposal before
today's deadline .

On Thu, Apr 7, 2011 at 9:28 AM, Varun Thacker varunthacker1...@gmail.comwrote:

 I have updated my proposal online to mention the time I would be able to
 dedicate to the project .


 On Thu, Apr 7, 2011 at 7:05 AM, Adriano Crestani 
 adrianocrest...@gmail.com wrote:

 Hi Varun,

 Nice proposal, very complete. Only one thing missing, you should mention
 somewhere how many hours a week you are willing to spend working on the
 project and whether there is any holiday you won't be able to work.

 Good luck ;)


 On Wed, Apr 6, 2011 at 5:57 PM, Varun Thacker varunthacker1...@gmail.com
  wrote:

 I have drafted the proposal on the official GSoC website . This is the
 link to my proposal http://goo.gl/uYXrV . Please do let me know if
 anything needs to be changed ,added or removed.

 I will keep on working on it till the deadline on the 8th.

 On Wed, Apr 6, 2011 at 11:41 PM, Michael McCandless 
 luc...@mikemccandless.com wrote:

 That test code looks good -- you really should have seen awful
 performance had you used O_DIRECT since you read byte by byte.

 A more realistic test is to read a whole buffer (eg 4 KB is what
 Lucene now uses during merging, but we'd probably up this to like 1 MB
 when using O_DIRECT).

 Linus does hate O_DIRECT (see http://kerneltrap.org/node/7563), and
 for good reason: its existence means projects like ours can use it to
 work around limitations in the Linux IO apis that control the buffer
 cache when, otherwise, we might conceivably make patches to fix Linux
 correctly.  It's an escape hatch, and we all use the escape hatch
 instead of trying to fix Linux for real...

 For example the NOREUSE flag is a no-op now in Linux, which is a
 shame, because that's precisely the flag we'd want to use for merging
 (along with SEQUENTIAL).  Had that flag been implemented well, it'd
 give better results than our workaround using O_DIRECT.

 Anyway, giving how things are, until we can get more control (wy
 up in Javaland) over the buffer cache, O_DIRECT (via native directory
 impl through JNI) is our only real option, today.

 More details here:
 http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html

 Note that other OSs likely do a better job and actually implement
 NOREUSE, and similar APIs, so the generic Unix/WindowsNativeDirectory
 would simply use NOREUSE on these platforms for I/O during segment
 merging.

 Mike

 http://blog.mikemccandless.com

 On Wed, Apr 6, 2011 at 11:56 AM, Varun Thacker
  varunthacker1...@gmail.com wrote:
  Hi. I wrote a sample code to test out speed difference between
 SEQUENTIAL
  and O_DIRECT( I used the madvise flag-MADV_DONTNEED) reads .
 
  This is the link to the code: http://pastebin.com/8QywKGyS
 
  There was a speed difference which when i switched between the two
 flags. I
  have not used the O_DIRECT flag because Linus had criticized it.
 
  Is this what the flags are intended to be used for ? This is just a
 sample
  code with a test file .
 
  On Wed, Apr 6, 2011 at 12:11 PM, Simon Willnauer
  simon.willna...@googlemail.com wrote:
  Hey Varun,
  On Tue, Apr 5, 2011 at 11:07 PM, Michael McCandless
  luc...@mikemccandless.com wrote:
  Hi Varun,
 
  Those two issues would make a great GSoC!  Comments below...
  +1
 
  On Tue, Apr 5, 2011 at 1:56 PM, Varun Thacker
  varunthacker1...@gmail.com wrote:
 
  I would like to combine two tasks as part of my project
  namely-Directory createOutput and openInput should take an
 IOContext
  (Lucene-2793) and compliment it by Generalize DirectIOLinuxDir to
  UnixDir (Lucene-2795).
 
  The first part of the project is aimed at significantly reducing
 time
  taken to search during indexing by adding an IOContext which would
  store buffer size and have options to bypass the OS’s buffer cache
  (This is what causes the slowdown in search ) and other hints. Once
  completed I would move on to Lucene-2795 and generalize the
 Directory
  implementation to make a UnixDirectory .
 
  So, the first part (LUCENE-2793) should cause no change at all to
  performance, functionality, etc., because it's merely installing
 the
  plumbing (IOContext threaded throughout the low-level store APIs in
  Lucene) so that higher levels can send important details down to the
  Directory.  We'd fix IndexWriter/IndexReader to fill out this
  IOContext with the details (merging, flushing, new reader, etc.).
 
  There's some fun/freedom here in figuring out just what details
 should
  be included in IOContext... (eg: is it low level set buffer size to
 4
  KB
  or is it high level I am opening a new near-real-time reader).
 
  This first step is a rote cutover, just changing APIs but in no way
  taking advantage of the new APIs.
 
  The 2nd step (LUCENE-2795) would then take advantage of this
 plumbing,
  by creating a UnixDir impl that, using JNI (C code), passes advanced

[HUDSON] Lucene-Solr-tests-only-trunk - Build # 6867 - Failure

2011-04-08 Thread Apache Hudson Server

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/6867/

1 tests failed.
REGRESSION:  org.apache.solr.cloud.ZkControllerTest.testUploadToCloud

Error Message:
KeeperErrorCode = ConnectionLoss for /configs/config1/schema-reversed.xml

Stack Trace:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /configs/config1/schema-reversed.xml
at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
at 
org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:224)
at 
org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:388)
at 
org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:410)
at org.apache.solr.cloud.ZkController.uploadToZK(ZkController.java:520)
at 
org.apache.solr.cloud.ZkControllerTest.testUploadToCloud(ZkControllerTest.java:191)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1232)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1160)




Build Log (for compile errors):
[...truncated 9073 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Google Summer Code 2011 participation

2011-04-08 Thread Simon Willnauer

On Fri, Apr 8, 2011 at 12:11 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 Anyone can participate in Lucene/Solr!  You don't need to be GSoC
 student to do so...

 Browse the issues in Jira (maybe focusing on the ones marked for GSoC
 and not already taken), or open your own issues, discuss, post
 patches, iterate, etc.

 Find your itch and scratch it ;)

+1 we are all around and jump on the issue to guide you. Find one, ask
questions if you have and start discussions / coding!

simon

 And there are a great many itches out there that need scratching...

 Mike

 http://blog.mikemccandless.com

 On Thu, Apr 7, 2011 at 9:34 PM, Minh Doan daywed...@gmail.com wrote:
 Hi forks,
 Receiving a bunch of emails recently about GSOC, I really want to join but
 it seems like I'm not eligible to do even though I used to be a PhD student,
 and currently on leave (I will be probably back soon). I really want to
 contribute to lucene to implement some of my ideas. Can I have a lucene
 mentor like those mentor experts who are excited to GSOC ?

 Best,
 Minh
 On Tue, Apr 5, 2011 at 7:06 AM, Steven A Rowe sar...@syr.edu wrote:

 Hi Jayendra,

 From
 http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/faqs#who:

 In order to participate in the program, you must be a student. Google
 defines a student as an individual enrolled in or accepted into an
 accredited institution including (but not necessarily limited to) colleges,
 universities, masters programs, PhD programs and undergraduate programs. You
 are eligible to apply if you are enrolled in an accredited university
 educational program provided you meet all of the other eligibility
 requirements. You should be prepared, upon request, to provide Google with
 transcripts or other documentation from your accredited institution as proof
 of enrollment or admission status. Computer Science does not need to be your
 field of study in order to participate in the program.

 You may be enrolled as a full-time or part-time student. You must also be
 eligible to work in the country in which you'll reside throughout the
 duration of the program, e.g. if you are in the United States on an F-1
 visa, you are welcome to apply to Google Summer of Code as long as you have
 U.S. work authorization. For F-1 students applying for CPT, Google will
 furnish you with a letter you can provide to your university to get CPT
 established once your application to the program has been accepted.

  -Original Message-
  From: Jayendra Patil [mailto:jayendra.patil@gmail.com]
  Sent: Tuesday, April 05, 2011 9:56 AM
  To: dev@lucene.apache.org
  Subject: Google Summer Code 2011 participation
 
  Hi,
 
  Does the Google Summer Code 2011 apply only to students ??
  I have been working on Solr for quite some time now and would like to
  start contributing back.
  Have been using it to index structured and unstructured data and have
  a fair bit of knowledge of the internals as well. (Have a few jiras
  and patches submitted)
  I don't have a specific proposal in mind yet, but would like to start
  with any specific area or issues.
 
  Let me know if and how can i participate.
 
  Regards,
  Jayendra
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org




 --
 ---
 Minh



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2459) LogLevelSelection Servlet outputs plain HTML

2011-04-08 Thread Stefan Matheis (steffkes) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Matheis (steffkes) updated SOLR-2459:


Description: 
The current available Output of the LogLevelSelection Servlet is plain HTML, 
which made it unpossible, to integrate the Logging-Information in the new 
Admin-UI. Format-Agnostic Output (like every [?] other Servlet offers) would be 
really nice!

Just as an Idea for a future structure, the new admin-ui is [actually based on 
that 
json-structure|https://github.com/steffkes/solr-admin/blob/master/logging.json] 
:)

  was:
The current available Output of the LogLevelSelection Servlet is plain HTML, 
which made it unpossible, to integrate the Logging-Information in the new 
Admin-UI. Format-Agnostic Output (like every [?] other Servlet offers) would be 
really nice!

Just as an Idea for a future structure, the new admin-ui is 
[https://github.com/steffkes/solr-admin/blob/master/logging.json|actually based 
on that json-structure] :)


 LogLevelSelection Servlet outputs plain HTML
 

 Key: SOLR-2459
 URL: https://issues.apache.org/jira/browse/SOLR-2459
 Project: Solr
  Issue Type: Wish
  Components: web gui
Reporter: Stefan Matheis (steffkes)
Priority: Trivial

 The current available Output of the LogLevelSelection Servlet is plain HTML, 
 which made it unpossible, to integrate the Logging-Information in the new 
 Admin-UI. Format-Agnostic Output (like every [?] other Servlet offers) would 
 be really nice!
 Just as an Idea for a future structure, the new admin-ui is [actually based 
 on that 
 json-structure|https://github.com/steffkes/solr-admin/blob/master/logging.json]
  :)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2193) Re-architect Update Handler

2011-04-08 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017461#comment-13017461
]

Mark Miller commented on SOLR-2193:
---

{quote} I wonder how this should work with autocommit?
Someone may want a soft/reopen autocommit once every x seconds, but still may
want a hard flush to stable storage in case I crash commit at some other
duration. {quote}

Right - I agree. How about another simple start? Simply add another
commitTracker that does soft commits - then you can schedule a mix of soft and
hard commits.

{quote}
The other thing that might be cool is a client-specified freshness per request.
For example, when they pass in a query, they specify that they need data that's
no more than 1 second old... and if it's too old that will trigger a reopen
(and block that specific request until the new searcher can be used). The
benefit here is that big bulk uploads won't be interrupted if there is no time
sensitive query traffic. The downside is that a high latency may be exposed to
those requests if they depend on stuff that can take a lot of time the first
time (like faceting).
{quote}

Yeah - I remember you mentioning this before - I def think this would be cool -
perhaps as a follow on issue - though hopefully the affect on bulk updates will
be minimized when Lucene takes care of the 'flush blocks the world' issue.

Re-architect Update Handler
---

Key: SOLR-2193
URL: https://issues.apache.org/jira/browse/SOLR-2193
Project: Solr
Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
Fix For: 4.0

Attachments: SOLR-2193.patch

The update handler needs an overhaul.
A few goals I think we might want to look at:
1. Cleanup - drop DirectUpdateHandler(2) line - move to something like
UpdateHandler, DefaultUpdateHandler
2. Expose the SolrIndexWriter in the api or add the proper abstractions to
get done what we now do with special casing:
if (directupdatehandler2)
success
else
failish
3. Stop closing the IndexWriter and start using commit (still lazy IW init
though).
4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level.
5. Keep NRT support in mind.
6. Keep microsharding in mind (maintain logical index as multiple physical
indexes)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Want to learn solr development

2011-04-08 Thread Deepak Singh

hi

I want to learn solr development.

i have used solr 1.4.1 with full text search of pdf and doc files and
database search using multicore feature of solr.
i want do development of solr. how to start it. please help.

[jira] [Updated] (SOLR-1922) DocBuilder onImportError/Abort EventListener

2011-04-08 Thread Robert Zotter (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Zotter updated SOLR-1922:


Affects Version/s: 4.0
   3.1

 DocBuilder onImportError/Abort EventListener
 

 Key: SOLR-1922
 URL: https://issues.apache.org/jira/browse/SOLR-1922
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Affects Versions: 1.4, 3.1, 3.1.1, 4.0
Reporter: Robert Zotter
Priority: Trivial
  Labels: DIH, DataImportHandler, DocBuilder, EventListener
 Attachments: SOLR-1922.patch


 The onImportEnd EventListener only fires off after a successful import.
 It would be useful to know when an import fails via an onImportError/Abort 
 EventListener.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1922) DocBuilder onImportError/Abort EventListener

2011-04-08 Thread Robert Zotter (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Zotter updated SOLR-1922:


Affects Version/s: 3.1.1

 DocBuilder onImportError/Abort EventListener
 

 Key: SOLR-1922
 URL: https://issues.apache.org/jira/browse/SOLR-1922
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Affects Versions: 1.4, 3.1, 3.1.1, 4.0
Reporter: Robert Zotter
Priority: Trivial
  Labels: DIH, DataImportHandler, DocBuilder, EventListener
 Attachments: SOLR-1922.patch


 The onImportEnd EventListener only fires off after a successful import.
 It would be useful to know when an import fails via an onImportError/Abort 
 EventListener.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1888) Provide Option to Store Payloads on the Term Vector

2011-04-08 Thread Peter Wilkins (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017483#comment-13017483
 ] 

Peter Wilkins commented on LUCENE-1888:
---

As someone new to Lucene, with a specific problem to solve, it is difficult to 
identify the appropriate Lucene feature to use.  Reading various online posts, 
I see I'm not alone.  I have a use case that I think this JIRA issue addresses; 
perhaps it will help refine what the issue resolution would do.

I'm indexing a lecture video transcript.  I want to store the text of the 
transcript and timecodes of when each  word occurs.  I want to search the text 
of the transcript and return the timecode so I can display the lecture video 
from that spot.

 Provide Option to Store Payloads on the Term Vector
 ---

 Key: LUCENE-1888
 URL: https://issues.apache.org/jira/browse/LUCENE-1888
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 4.0


 Would be nice to have the option to access the payloads in a document-centric 
 way by adding them to the Term Vectors.  Naturally, this makes the Term 
 Vectors bigger, but it may be just what one needs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2462) Using spellcheck.collate can result in extremely high memory usage

2011-04-08 Thread James Dyer (JIRA)

Using spellcheck.collate can result in extremely high memory usage
--

 Key: SOLR-2462
 URL: https://issues.apache.org/jira/browse/SOLR-2462
 Project: Solr
  Issue Type: Bug
  Components: spellchecker
Affects Versions: 3.1, 4.0
Reporter: James Dyer
Priority: Critical


When using spellcheck.collate, class SpellPossibilityIterator creates a 
ranked list of *every* possible correction combination.  But if returning 
several corrections per term, and if several words are misspelled, the existing 
algorithm uses a huge amount of memory.

This bug was introduced with SOLR-2010.  However, it is triggered anytime 
spellcheck.collate is used.  It is not necessary to use any features that 
were added with SOLR-2010.

We were in Production with Solr for 1 1/2 days and this bug started taking our 
Solr servers down with infinite GC loops.  It was pretty easy for this to 
happen as occasionally a user will accidently paste the URL into the Search box 
on our app.  This URL results in a search with ~12 misspelled words.  We have 
spellcheck.count set to 15. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2462) Using spellcheck.collate can result in extremely high memory usage

2011-04-08 Thread James Dyer (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

James Dyer updated SOLR-2462:
-

Attachment: SOLR-2462.patch

This sets the maximum limit to 1000 possibilities. When this limit is reached,
the list is sorted by rank then reduced to the top 100. From then on, only
collations with a rank equal or better than the 100th are added. This process
repeats until finished or until it has taken 50ms, at which time it quits.

I also added a maxTimeAllowed setting of 50ms to the collation test queries
as an additional performance safeguard.

Using spellcheck.collate can result in extremely high memory usage
--

Key: SOLR-2462
URL: https://issues.apache.org/jira/browse/SOLR-2462
Project: Solr
Issue Type: Bug
Components: spellchecker
Affects Versions: 3.1, 4.0
Reporter: James Dyer
Priority: Critical
Attachments: SOLR-2462.patch

When using spellcheck.collate, class SpellPossibilityIterator creates a
ranked list of *every* possible correction combination. But if returning
several corrections per term, and if several words are misspelled, the
existing algorithm uses a huge amount of memory.
This bug was introduced with SOLR-2010. However, it is triggered anytime
spellcheck.collate is used. It is not necessary to use any features that
were added with SOLR-2010.
We were in Production with Solr for 1 1/2 days and this bug started taking
our Solr servers down with infinite GC loops. It was pretty easy for this
to happen as occasionally a user will accidently paste the URL into the
Search box on our app. This URL results in a search with ~12 misspelled
words. We have spellcheck.count set to 15.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2462) Using spellcheck.collate can result in extremely high memory usage

2011-04-08 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mark Miller updated SOLR-2462:
--

Affects Version/s: (was: 4.0)
Fix Version/s: 4.0
3.1.1

Using spellcheck.collate can result in extremely high memory usage
--

Key: SOLR-2462
URL: https://issues.apache.org/jira/browse/SOLR-2462
Project: Solr
Issue Type: Bug
Components: spellchecker
Affects Versions: 3.1
Reporter: James Dyer
Priority: Critical
Fix For: 3.1.1, 4.0

Attachments: SOLR-2462.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [Lucene.Net] escaping single quotes when using query parser

2011-04-08 Thread Christopher Currens

I'm not sure you're issue is related to single quotes.  The only characters
that need to be escaped for the QueryParser are + - ! ( ) { } [ ] ^  ~ * ?
: \  You can do that using QueryParser.Escape(string).  It's possible you it
might be related to the analyzer that you're using.  In my experience,
sometimes using a different analyzer to index than you use to search can
*sometimes* cause unexpected behavior like this.  Since I haven't myself run
into this exact problem to the best of my knowledge, it's tough for me to
give a more specific answer without your code/test data.

Thanks,
Christopher

On Thu, Apr 7, 2011 at 2:01 AM, Ben Foster b...@planetcloud.co.uk wrote:

 Hi,



 How should we escape single quotes when working with the query parser?



 Currently we have a description field that may contain single quotes.



 Whilst this field is correctly indexed when we search the description no
 results are returned. I'm assuming it's because we need to replace the
 single quote in the search term with an escaped version.



 Many thanks,



 Ben Foster



 planetcloud
 The Elms, Hawton
 Newark-on-Trent
 Nottinghamshire
 NG24 3RL



  http://www.planetcloud.co.uk/ www.planetcloud.co.uk

[jira] [Commented] (LUCENE-2956) Support updateDocument() with DWPTs

2011-04-08 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017514#comment-13017514
 ] 

Simon Willnauer commented on LUCENE-2956:
-

FYI I have a working patch for this. it needs some cleanup so I will hopefully 
upload beginning next week

 Support updateDocument() with DWPTs
 ---

 Key: LUCENE-2956
 URL: https://issues.apache.org/jira/browse/LUCENE-2956
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: Realtime Branch
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch


 With separate DocumentsWriterPerThreads (DWPT) it can currently happen that 
 the delete part of an updateDocument() is flushed and committed separately 
 from the corresponding new document.
 We need to make sure that updateDocument() is always an atomic operation from 
 a IW.commit() and IW.getReader() perspective.  See LUCENE-2324 for more 
 details.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2355) simple distrib update processor

2011-04-08 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated SOLR-2355:
--

Fix Version/s: 4.0
   3.2

One thing we should probably address is the brittle cmd cloning. I don't like 
clone methods in general - but if we are going to do it in core code, better to 
put the clone in the cmd and be a bit less brittle.


 simple distrib update processor
 ---

 Key: SOLR-2355
 URL: https://issues.apache.org/jira/browse/SOLR-2355
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: DistributedUpdateProcessorFactory.java, 
 TestDistributedUpdate.java


 Here's a simple update processor for distributed indexing that I implemented 
 years ago.
 It implements a simple hash(id) MOD nservers and just fails if any servers 
 are down.
 Given the recent activity in distributed indexing, I thought this might be at 
 least a good source for ideas.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-04-08 Thread James Dyer (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382.patch

This version fixes a bug involving the DIHCacheProcessor in the case of a
many-to-[one|many] join between the parent entity and a child entity. If the
child entity used a DIHCacheProcessor and the same child joined to consecutive
parents, only the first parent would join to the child.

DIH Cache Improvements
--

Key: SOLR-2382
URL: https://issues.apache.org/jira/browse/SOLR-2382
Project: Solr
Issue Type: New Feature
Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch,
SOLR-2382.patch, SOLR-2382.patch

Functionality:
1. Provide a pluggable caching framework for DIH so that users can choose a
cache implementation that best suits their data and application.

2. Provide a means to temporarily cache a child Entity's data without
needing to create a special cached implementation of the Entity Processor
(such as CachedSqlEntityProcessor).

3. Provide a means to write the final (root entity) DIH output to a cache
rather than to Solr. Then provide a way for a subsequent DIH call to use the
cache as an Entity input. Also provide the ability to do delta updates on
such persistent caches.

4. Provide the ability to partition data across multiple caches that can
then be fed back into DIH and indexed either to varying Solr Shards, or to
the same Core in parallel.
Use Cases:
1. We needed a flexible scalable way to temporarily cache child-entity
data prior to joining to parent entities.
- Using SqlEntityProcessor with Child Entities can cause an n+1 select
problem.
- CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching
mechanism and does not scale.
- There is no way to cache non-SQL inputs (ex: flat files, xml, etc).

2. We needed the ability to gather data from long-running entities by a
process that runs separate from our main indexing process.

3. We wanted the ability to do a delta import of only the entities that
changed.
- Lucene/Solr requires entire documents to be re-indexed, even if only a
few fields changed.
- Our data comes from 50+ complex sql queries and/or flat files.
- We do not want to incur overhead re-gathering all of this data if only 1
entity's data changed.
- Persistent DIH caches solve this problem.

4. We want the ability to index several documents in parallel (using 1.4.1,
which did not have the threads parameter).

5. In the future, we may need to use Shards, creating a need to easily
partition our source data into Shards.
Implementation Details:
1. De-couple EntityProcessorBase from caching.
- Created a new interface, DIHCache two implementations:
- SortedMapBackedCache - An in-memory cache, used as default with
CachedSqlEntityProcessor (now deprecated).
- BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested
with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.
I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar,
so to use or evaluate this patch, download bdb-je from
http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html

2. Allow Entity Processors to take a cacheImpl parameter to cause the
entity data to be cached (see EntityProcessorBase DIHCacheProperties).

3. Partially De-couple SolrWriter from DocBuilder
- Created a new interface DIHWriter, two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

4. Create a new Entity Processor, DIHCacheProcessor, which reads a
persistent Cache as DIH Entity Input.

5. Support a partition parameter with both DIHCacheWriter and
DIHCacheProcessor to allow for easy partitioning of source entity data.

6. Change the semantics of entity.destroy()
- Previously, it was being called on each iteration of
DocBuilder.buildDocument().
- Now it is does one-time cleanup tasks (like closing or deleting a
disk-backed cache) once the entity processor is completed.
- The only out-of-the-box entity processor that previously implemented
destroy() was LineEntitiyProcessor, so this is not a very invasive change.
General Notes:
We are near completion in converting our search functionality from a legacy
search engine to Solr. However, I found that DIH did not support caching to
the level of our prior product's data import utility. In order to get our
data into Solr, I created these caching enhancements. Because I believe this
has

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-04-08 Thread James Dyer (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017567#comment-13017567
]

James Dyer commented on SOLR-2382:
--

In light of the recent discussion about the Spatial Contrib, I do wonder if
seeking to get this committed is a non-starter because of its dependency on
bdb-je. I thought this wouldn't be an issue because we have an existing Lucene
contrib (db) with this same dependency, but then I noticed that some of the
committers regret the existence of the db contrib for this reason (and others).

In any case, even if the BerkleyBackedCache part of this patch could not be
committed, having this framework in place so that developers can write their
own persistent cache impls would be a major improvement in my opinion. (I had
originally started with a Lucene-backed cache but switch to bdb-je because I
couldn't figure out how to achieve acceptable performance for gets from the
cache).