[jira] Created: (SOLR-1744) Streams retrieved from ContenStream#getStream are not always closed

2010-02-02 Thread Mark Miller (JIRA)
Streams retrieved from ContenStream#getStream are not always closed
---

 Key: SOLR-1744
 URL: https://issues.apache.org/jira/browse/SOLR-1744
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Mark Miller
 Fix For: 1.5


Doesn't look like BinaryUpdateRequestHandler or CommonsHttpSolrServer close 
streams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1745) MoreLikeThisHandler gets a Reader from a ContentStream and doesn't close it

2010-02-02 Thread Mark Miller (JIRA)
MoreLikeThisHandler gets a Reader from a ContentStream and doesn't close it
---

 Key: SOLR-1745
 URL: https://issues.apache.org/jira/browse/SOLR-1745
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Mark Miller
 Fix For: 1.5




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1746) CommonsHttpSolrServer passes a ContentStream reader to IOUtils.copy, but doesnt close it.

2010-02-02 Thread Mark Miller (JIRA)
CommonsHttpSolrServer passes a ContentStream reader to IOUtils.copy, but doesnt 
close it.
-

 Key: SOLR-1746
 URL: https://issues.apache.org/jira/browse/SOLR-1746
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Mark Miller
 Fix For: 1.5


IOUtils.copy will not close your reader for you:

{code}
@Override
protected void sendData(OutputStream out)
throws IOException {
  IOUtils.copy(c.getReader(), out);
}
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1747) DumpRequestHandler doesn't close Stream

2010-02-02 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated SOLR-1747:
--

Affects Version/s: 1.4
Fix Version/s: 1.5

 DumpRequestHandler doesn't close Stream
 ---

 Key: SOLR-1747
 URL: https://issues.apache.org/jira/browse/SOLR-1747
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Mark Miller
Priority: Minor
 Fix For: 1.5


 {code}
 stream.add( stream, IOUtils.toString( content.getStream() ) );
 {code}
 IOUtils.toString won't close the stream for you.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1747) DumpRequestHandler doesn't close Stream

2010-02-02 Thread Mark Miller (JIRA)
DumpRequestHandler doesn't close Stream
---

 Key: SOLR-1747
 URL: https://issues.apache.org/jira/browse/SOLR-1747
 Project: Solr
  Issue Type: Bug
Reporter: Mark Miller
Priority: Minor


{code}
stream.add( stream, IOUtils.toString( content.getStream() ) );
{code}

IOUtils.toString won't close the stream for you.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1748) RawResponseWriter doesn't close Reader

2010-02-02 Thread Mark Miller (JIRA)
RawResponseWriter doesn't close Reader
--

 Key: SOLR-1748
 URL: https://issues.apache.org/jira/browse/SOLR-1748
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Mark Miller
 Fix For: 1.5


{code}
 IOUtils.copy( content.getReader(), writer );
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1301) Solr + Hadoop

2010-02-02 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated SOLR-1301:
---

Attachment: SOLR-1301.patch

I added the following to the SRW.close method's finally clause:

{code}
FileUtils.forceDelete(new File(temp.toString()));
{code}

 Solr + Hadoop
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.5

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-02-02 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828835#action_12828835
 ] 

Hoss Man commented on SOLR-1677:


bq. I guess I could care less what the default is, if you care about such 
things you shouldn't be using the defaults and instead specifying this yourself 
in the schema, and Version has no effect.

...which is all well and good, but it just re-iterates the need for really good 
documentation about what is impacted by changing a global Version setting -- 
otherwise users might be depending on a default behavior that is going to 
change when Version as bumped, and they may not even realize it.

Bear in mind: these are just the nuances that people need to worry about when 
considering a switch from 2.4 to 2.9 to 3.0 ... there will likely be a lot more 
of these over time.

And just to be as crystal clear as i possibly can:
* my concern is purely about how to document this stuff.
* i do in fact agree that a global luceneVersionMatch option is a good idea

 Add support for o.a.lucene.util.Version for BaseTokenizerFactory and 
 BaseTokenFilterFactory
 ---

 Key: SOLR-1677
 URL: https://issues.apache.org/jira/browse/SOLR-1677
 Project: Solr
  Issue Type: Sub-task
  Components: Schema and Analysis
Reporter: Uwe Schindler
 Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, 
 SOLR-1677.patch


 Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards 
 compatibility with old indexes created using older versions of Lucene. The 
 most important example is StandardTokenizer, which changed its behaviour with 
 posIncr and incorrect host token types in 2.4 and also in 2.9.
 In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with 
 much more Unicode support, almost every Tokenizer/TokenFilter needs this 
 Version parameter. In 2.9, the deprecated old ctors without Version take 
 LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
 This patch adds basic support for the Lucene Version property to the base 
 factories. Subclasses then can use the luceneMatchVersion decoded enum (in 
 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently 
 contains a helper map to decode the version strings, but in 3.0 is can be 
 replaced by Version.valueOf(String), as the Version is a subclass of Java5 
 enums. The default value is Version.LUCENE_24 (as this is the default for the 
 no-version ctors in Lucene).
 This patch also removes unneeded conversions to CharArraySet from 
 StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed 
 to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1718) Carriage return should submit query admin form

2010-02-02 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828842#action_12828842
 ] 

Hoss Man commented on SOLR-1718:


bq. Consider the JIRA interface we are using to comment on this issue. 

Sure, but that's an {{input type=text /}}, not a {{textarea /}} ... the 
expected semantics are completely different.  With a {{input type=text /}} 
box the browser already takes care of submitting the form if you hit Enter (and 
FWIW: most browsers i know of also submit forms if you use Shift-Enter in a 
{{textarea /}})

It sounds like what you are really suggesting is that we change the 
/admin/index.jsp form to use a {{input type=text /}} instead of a 
{{textarea /}} for the q param, and not that we add special (javascript) 
logic to the form to submit if someone presses Enter inside the existing 
{{textarea /}}  ... which i have a lot less objection to then going out of 
our way to violate standard form convention.

 Carriage return should submit query admin form
 --

 Key: SOLR-1718
 URL: https://issues.apache.org/jira/browse/SOLR-1718
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Affects Versions: 1.4
Reporter: David Smiley
Priority: Minor

 Hitting the carriage return on the keyboard should submit the search query on 
 the admin front screen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1729) Date Facet now override time parameter

2010-02-02 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828846#action_12828846
 ] 

Hoss Man commented on SOLR-1729:


Peter: I think you may have misconstrued my comments -- they were not 
criticisms of your patch, they were a clarification of why the functionality 
you are proposing is important.

bq. Can you point me toward the class(es) where filter queries' date math lives

it's all handled internally by DateField, at which point it has no notion of 
the request -- I believe this is why yonik suggested using a ThreadLocal 
variable to track a consistent NOW that any method anywhere in Solr could use 
(if set) for the current request ... then we just need something like SolrCore 
to set it on each request (or accept it as a parm if specified)

bq. As filter queries are cached separately, can you think of any potential 
caching issues relating to filter queries?

The cache keys for things like that are the Query objects themselves, and at 
that point the DateMath strings (including NOW) have already been resolved 
into realy time values so that shouldn't be an issue.


 Date Facet now override time parameter
 --

 Key: SOLR-1729
 URL: https://issues.apache.org/jira/browse/SOLR-1729
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.4
 Environment: Solr 1.4
Reporter: Peter Sturge
Priority: Minor
 Attachments: FacetParams.java, SimpleFacets.java


 This PATCH introduces a new query parameter that tells a (typically, but not 
 necessarily) remote server what time to use as 'NOW' when calculating date 
 facets for a query (and, for the moment, date facets *only*) - overriding the 
 default behaviour of using the local server's current time.
 This gets 'round a problem whereby an explicit time range is specified in a 
 query (e.g. timestamp:[then0 TO then1]), and date facets are required for the 
 given time range (in fact, any explicit time range). 
 Because DateMathParser performs all its calculations from 'NOW', remote 
 callers have to work out how long ago 'then0' and 'then1' are from 'now', and 
 use the relative-to-now values in the facet.date.xxx parameters. If a remote 
 server has a different opinion of NOW compared to the caller, the results 
 will be skewed (e.g. they are in a different time-zone, not time-synced etc.).
 This becomes particularly salient when performing distributed date faceting 
 (see SOLR-1709), where multiple shards may all be running with different 
 times, and the faceting needs to be aligned.
 The new parameter is called 'facet.date.now', and takes as a parameter a 
 (stringified) long that is the number of milliseconds from the epoch (1 Jan 
 1970 00:00) - i.e. the returned value from a System.currentTimeMillis() call. 
 This was chosen over a formatted date to delineate it from a 'searchable' 
 time and to avoid superfluous date parsing. This makes the value generally a 
 programatically-set value, but as that is where the use-case is for this type 
 of parameter, this should be ok.
 NOTE: This parameter affects date facet timing only. If there are other areas 
 of a query that rely on 'NOW', these will not interpret this value. This is a 
 broader issue about setting a 'query-global' NOW that all parts of query 
 analysis can share.
 Source files affected:
 FacetParams.java   (holds the new constant FACET_DATE_NOW)
 SimpleFacets.java  getFacetDateCounts() NOW parameter modified
 This PATCH is mildly related to SOLR-1709 (Distributed Date Faceting), but as 
 it's a general change for date faceting, it was deemed deserving of its own 
 patch. I will be updating SOLR-1709 in due course to include the use of this 
 new parameter, after some rfc acceptance.
 A possible enhancement to this is to detect facet.date fields, look for and 
 match these fields in queries (if they exist), and potentially determine 
 automatically the required time skew, if any. There are a whole host of 
 reasons why this could be problematic to implement, so an explicit 
 facet.date.now parameter is the safest route.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: svn commit: r899979 - /lucene/solr/trunk/example/solr/conf/solrconfig.xml

2010-02-02 Thread Chris Hostetter

:  So what/how should we document all of this?
...
:  I've got more info on this.

Mark: most of what you wrote is above my head, but since you fixed a 
grammar error in my updated example solrconfig.xml comment w/o making any 
content changes, I'm assuming you feel what i put there is sufficient.

Most of your comments feel like they should be raised over in Lucene-Java 
land, at a minimum in documentation (added to the AvailableLockFactories 
page perhaps) or possibly in some code changes (should we changed the 
default LockFactory depending on Java version?)

I'll leave that up to you, since (as i mentioned) i didnt' understand half 
of it.

:  Checking for OverlappingFileLockException *should* actually work when
:  using Java 1.6. Java 1.6 started using a *system wide* thread safe check
:  for this.
: 
:  Previous to Java 1.6, checks for this *were* limited to an instance of
:  FileChannel - the FileChannel maintained its own personal lock list. So
:  you have to use
:  the same Channel to even have any hope of seeing an
:  OverlappingFileLockException. Even then though, its not properly thread
:  safe. They did not sync across
:  checking if the lock exists and acquiring the lock - they separately
:  sync each action - leaving room to acquire the lock twice from two
:  different threads like I was seeing.
: 
:  Interestingly, Java 1.6 has a back compat mode you can turn on that
:  doesn't use the system wide lock list, and they have fixed this thread
:  safety issue in that impl - there is a sync across checking
:  and getting the lock so that it is properly thread safe - but not in
:  Java 1.4, 1.5.
: 
:  Looking at GCC - uh ... I don't think you want to use GCC - they don't
:  appear to use a lock list and check for this at all :)
: 
:  But the point is, this is fixable on Java 6 if we check for
:  OverlappingFileLockException - it *should* work across webapps, and it
:  is actually thread safe, unlike Java 1.4,1.5.
: 
:
: Another interesting fact:
: 
: On Windows, if you attempt to lock the same file with different channel
: instances pre Java 1.6 - the code will deadlock.
: 
: -- 
: - Mark
: 
: http://www.lucidimagination.com
: 
: 
: 



-Hoss



Re: Problem with German Wordendings

2010-02-02 Thread Chris Hostetter

http://people.apache.org/~hossman/#solr-dev
Please Use solr-u...@lucene Not solr-...@lucene

Your question is better suited for the solr-u...@lucene mailing list ...
not the solr-...@lucene list.  solr-dev is for discussing development of
the internals of the Solr application ... it is *not* the appropriate
place to ask questions about how to use Solr (or write Solr plugins) 
when developing your own applications.  Please resend your message to
the solr-user mailing list, where you are likely to get more/better
responses since that list also has a larger number of subscribers.


: Date: Tue, 26 Jan 2010 17:13:51 +0100
: From: David Rühr d...@web-factory.de
: Reply-To: solr-dev@lucene.apache.org
: To: solr-dev@lucene.apache.org
: Subject: Problem with German Wordendings
: 
: Hi List.
: 
: We have made a suggest search and send this query with a facet.prefix
: kinderzim:
: 
: facet=on
: facet.prefix=kinderzim
: facet.mincount=1
: facet.field=content
: facet.limit=10
: fl=content
: omitHeader=true
: bf=log%28supplier_faktor%29
: version=1.2
: wt=json
: json.nl=map
: q=
: start=0
: rows=0
: 
: 
: Now we get:
: lst name=content
:  int name=kinderzimm7/int
: /lst
: 
: SolR doesn't return the endings of the output Words. It must be kinderzimmer
: same with kindermode, we get kindermod.
: We add the words in our protwords.txt and include them with this line in
: schema.xml.
: filter class=solr.SnowballPorterFilterFactory language=German
: protected=protwords.txt/
: 
: Can anybody help us?
: 
: 
: Thanks and sorry about my english.
: So Long , David
: 
: 
: 



-Hoss


[jira] Created: (SOLR-1749) debug output should include explanation of what input strings were passed to the analzyers for each field

2010-02-02 Thread Hoss Man (JIRA)
debug output should include explanation of what input strings were passed to 
the analzyers for each field
-

 Key: SOLR-1749
 URL: https://issues.apache.org/jira/browse/SOLR-1749
 Project: Solr
  Issue Type: Wish
  Components: search
Reporter: Hoss Man


Users are frequently confused by the interplay between Query Parsing and 
Query Time Analysis (ie: markup meta-characters like whitespace and quotes, 
multi-word synonyms, Shingles, etc...)  It would be nice if we had more 
debugging output available that would help eliminate this confusion.  The ideal 
API that comes to mind would be to include in the debug output of SearchHandler 
a list of every string that was Analyzed, and what list of field names it was 
analyzed against.  

This info would not only make it clear to users what exactly they should 
cut/paste into the analysis.jsp tool to see how their Analyzer is getting used, 
but also what exactly is being done to their input strings prior to their 
Analyzer being used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1749) debug output should include explanation of what input strings were passed to the analzyers for each field

2010-02-02 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1282#action_1282
 ] 

Hoss Man commented on SOLR-1749:


This is an idea that's been rolling arround in my head for a while, and today I 
thought i'd spend some time experimenting with it.

It seemed like the main impelmentation challenge would be that by the time you 
are deep enough down in the code to be using an Analyzer, you don't have access 
to the SolrQueryRequest to record the debugging info.

I thought of two potential solutions...

 * Use ThreadLocal to track the debugging info if needed
 * Use Proxy Wrapper classes to record the debugging info if needed

I initially figured that writing proxy classes for SolrQueryRequest, 
IndexSchema, and Analyzer would be relatively straight forward, so i started 
down that path and discovered two anoying problems...

 # IndexSchema is currently final
 # not all code paths use IndexSchema.getQueryAnalyzer(), many fetch the 
FieldTypes and ask them for their Analyzer directly.

The second problem isn't insurmountable, but it complicates things in that it 
would require Proxy wrappers for FieldType as well.  The first problem requires 
a simple change, but carries with it some baggage that i wasn't ready to 
embrace.  In both cases i started to be very bothered by the long term 
maintenance something like this would introduce.  It would be very easy to 
write these Proxy classes that extend IndexSchema, FieldType, and Analyzer but 
it would be just as easy to forget to add the appropriate Proxy methods to them 
down the road when new methods are added to those base classes.

The issue with the FieldType also exposed a flaw in the idea of using 
ThreadLocal: if we only had to worry about IndexSearcher.getQueryAnalyzer(), we 
could modify it to check ThreadLocal easily enough, but at the FieldType level 
we would only be able to modify FieldTypes that ship with Solr, and we'd be 
missing any plugin FieldTypes,


So i aborted the experiment but i figured i should post the feature idea, and 
my existing thoughts, here in case anyone had other suggestions on how it could 
be implemented feasibly.

 debug output should include explanation of what input strings were passed to 
 the analzyers for each field
 -

 Key: SOLR-1749
 URL: https://issues.apache.org/jira/browse/SOLR-1749
 Project: Solr
  Issue Type: Wish
  Components: search
Reporter: Hoss Man

 Users are frequently confused by the interplay between Query Parsing and 
 Query Time Analysis (ie: markup meta-characters like whitespace and quotes, 
 multi-word synonyms, Shingles, etc...)  It would be nice if we had more 
 debugging output available that would help eliminate this confusion.  The 
 ideal API that comes to mind would be to include in the debug output of 
 SearchHandler a list of every string that was Analyzed, and what list of 
 field names it was analyzed against.  
 This info would not only make it clear to users what exactly they should 
 cut/paste into the analysis.jsp tool to see how their Analyzer is getting 
 used, but also what exactly is being done to their input strings prior to 
 their Analyzer being used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-02-02 Thread shyjuThomas (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828915#action_12828915
 ] 

shyjuThomas commented on SOLR-1301:
---

I have a need to perform Solr indexing in MapReduce task, to achive 
parallelism. I have noticed 2 Jira issues related to that: SOLR-1045  
SOLR-1301. 

I have tried out the patches available with both the issues, and my observation 
is given below:
1. The SOLR-1301 patch, performs  input-record to key-value conversion in Map 
phase; Hadoop (key, value) to SolrInputDocument conversion and the actual 
indexing will happen in the Reduce phase.
Meanwhile, SOLR-1045 patch performs the record-to-Doc conversion and the actual 
indexing in the Map phase; User can make use of the Reducer to perform merging 
of multiple indices (if required). In another way we can configure the number 
of reducers as same as the number of Shards. 
2. The SOLR-1301 patch doesn't supports merging of the indices, while SOLR-1045 
patch supports.
3. As per SOLR-1301 patch, no big activity happens in the Map phase (only 
input-record to key-value conversion). Most of the heavy jobs (esp. the 
indexing) are happening in the Reduce phase. If we need the final output as a 
single index, we can use only one reducer, which means bottleneck at Reducer  
almost the whole operation happens non-paralelly. 
   But the case is different with SOLR-1045 patch. It 
achieves better parallelism when the number of map tasks is greater than the 
number of reduce tasks, which is usually the case.

Based on these observation, I have few questions. (I am a beginner to the 
Hadoop  Solr world. So, please forgive me if my questions are silly):
1. As per above observation, SOLR-1045 patch is functionally better 
(performance I have not verified yet ). Can anyone tell me, whats the actual 
advantage SOLR-1301 patch offers over SOLR-1045 patch?
2. If both the jira issues are trying to solve the same problem, do we really 
need 2 separate issues?

NOTE : I felt this Jira issue is more active than SOLR-1045. Thats why I posted 
my comment here.

 Solr + Hadoop
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.5

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You 

indexing a csv file with a multivalued field

2010-02-02 Thread Seffie Schwartz
I am not having luck doing this.  Even though I am specifying -F 
fieldname.separator='|' the fields are 
stored as one field not as multi fields.  If I specify -F 
f.fieldname.separator='|' I get a null pointer exception;




[jira] Commented: (SOLR-1045) Build Solr index using Hadoop MapReduce

2010-02-02 Thread Kevin Peterson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828962#action_12828962
 ] 

Kevin Peterson commented on SOLR-1045:
--

Can anyone using this code comment on how this relates to SOLR-1301?

https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828915#action_12828915

These seem to have identical goals but very different approaches.

 Build Solr index using Hadoop MapReduce
 ---

 Key: SOLR-1045
 URL: https://issues.apache.org/jira/browse/SOLR-1045
 Project: Solr
  Issue Type: New Feature
Reporter: Ning Li
 Fix For: 1.5

 Attachments: SOLR-1045.0.patch


 The goal is a contrib module that builds Solr index using Hadoop MapReduce.
 It is different from the Solr support in Nutch. The Solr support in Nutch 
 sends a document to a Solr server in a reduce task. Here, the goal is to 
 build/update Solr index within map/reduce tasks. Also, it achieves better 
 parallelism when the number of map tasks is greater than the number of reduce 
 tasks, which is usually the case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-02-02 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828961#action_12828961
 ] 

Ted Dunning commented on SOLR-1301:
---

{quote}
Based on these observation, I have few questions. (I am a beginner to the 
Hadoop  Solr world. So, please forgive me if my questions are silly):
1. As per above observation, SOLR-1045 patch is functionally better 
(performance I have not verified yet ). Can anyone tell me, whats the actual 
advantage SOLR-1301 patch offers over SOLR-1045 patch?
2. If both the jira issues are trying to solve the same problem, do we really 
need 2 separate issues?
{quote}

In the katta community, the recommended practice started with SOLR-1045 (what I 
call map-side indexing) behavior, but I think that the consensus now is that 
SOLR-1301 behavior (what I call reduce side indexing) is much, much better.  
This is not necessarily the obvious result given your observations.  There are 
some operational differences between katta and SOLR that might make the 
conclusions different, but what I have observed is the following:

a) index merging is a really bad idea that seems very attractive to begin with 
because it is actually pretty expensive and doesn't solve the real problems of 
bad document distribution across shards.  It is much better to simply have lots 
of shards per machine (aka micro-sharding) and use one reducer per shard.  For 
large indexes, this gives entirely acceptable performance.  On a pretty small 
cluster, we can index 50-100 million large documents in multiple ways in 2-3 
hours.  Index merging gives you no benefit compared to reduce side indexing and 
just increases code complexity.

b) map-side indexing leaves you with indexes that are heavily skewed by being 
composed of of documents from a single input split.  At retrieval time, this 
means that different shards have very different term frequency profiles and 
very different numbers of relevant documents.  This makes lots of statistics 
very difficult including term frequency computation, term weighting and 
determining the number of documents to retrieve.  Map-side merge virtually 
guarantees that you have to do two cluster queries, one to gather term 
frequency statistics and another to do the actual query.  With reduce side 
indexing, you can provide strong probabilistic bounds on how different the 
statistics in each shard can be so you can use local term statistics and you 
can depend on the score distribution being this same which radically decreases 
the number of documents you need to retrieve from each shard.

c) reduce-side indexing improves the balance of computation during retrieval.  
If (as is the rule) some document subset is hotter than other document subset 
due, say to data-source boosting or recency boosting, you will have very bad 
cluster utilization with skewed shards from map-side indexing while all shards 
will cost about the same for any query leading to good cluster utilization and 
faster queries with reduce-side indexing.

d) with reduce-side indexing has properties that can be mathematically stated 
and proved.  Map-side indexing only has comparable properties if you make 
unrealistic assumptions about your original data.

e) micro-sharding allows very simple and very effective use of multiple cores 
on multiple machines in a search cluster.  This can be very difficult to do 
with large shards or a single index.

Now, as you say, these advantages may evaporate if you are looking to produce a 
single output index.  That seems, however, to contradict the whole point of 
scaling.   If you need to scale indexing, presumably you also need to scale 
search speed and throughput.  As such you probably want to have many shards 
rather than few.  Conversely, if you can stand to search a single index, then 
you probably can stand to index on a single machine. 

Another thing to think about is the fact SOLR doesn't yet do micro-sharding or 
clustering very well and, in particular, doesn't handle multiple shards per 
core.  That will be changing before long, however, and it is very dangerous to 
design for the past rather than the future.

In case, you didn't notice, I strongly suggest you stick with reduce-side 
indexing.

 Solr + Hadoop
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.5

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop)