date:20100202

[jira] Created: (SOLR-1744) Streams retrieved from ContenStream#getStream are not always closed

2010-02-02 Thread Mark Miller (JIRA)

Streams retrieved from ContenStream#getStream are not always closed
---

 Key: SOLR-1744
 URL: https://issues.apache.org/jira/browse/SOLR-1744
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Mark Miller
 Fix For: 1.5


Doesn't look like BinaryUpdateRequestHandler or CommonsHttpSolrServer close 
streams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1745) MoreLikeThisHandler gets a Reader from a ContentStream and doesn't close it

2010-02-02 Thread Mark Miller (JIRA)

MoreLikeThisHandler gets a Reader from a ContentStream and doesn't close it
---

 Key: SOLR-1745
 URL: https://issues.apache.org/jira/browse/SOLR-1745
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Mark Miller
 Fix For: 1.5




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1746) CommonsHttpSolrServer passes a ContentStream reader to IOUtils.copy, but doesnt close it.

2010-02-02 Thread Mark Miller (JIRA)

CommonsHttpSolrServer passes a ContentStream reader to IOUtils.copy, but doesnt 
close it.
-

 Key: SOLR-1746
 URL: https://issues.apache.org/jira/browse/SOLR-1746
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Mark Miller
 Fix For: 1.5


IOUtils.copy will not close your reader for you:

{code}
@Override
protected void sendData(OutputStream out)
throws IOException {
  IOUtils.copy(c.getReader(), out);
}
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1747) DumpRequestHandler doesn't close Stream

2010-02-02 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated SOLR-1747:
--

Affects Version/s: 1.4
Fix Version/s: 1.5

 DumpRequestHandler doesn't close Stream
 ---

 Key: SOLR-1747
 URL: https://issues.apache.org/jira/browse/SOLR-1747
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Mark Miller
Priority: Minor
 Fix For: 1.5


 {code}
 stream.add( stream, IOUtils.toString( content.getStream() ) );
 {code}
 IOUtils.toString won't close the stream for you.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1747) DumpRequestHandler doesn't close Stream

2010-02-02 Thread Mark Miller (JIRA)

DumpRequestHandler doesn't close Stream
---

 Key: SOLR-1747
 URL: https://issues.apache.org/jira/browse/SOLR-1747
 Project: Solr
  Issue Type: Bug
Reporter: Mark Miller
Priority: Minor


{code}
stream.add( stream, IOUtils.toString( content.getStream() ) );
{code}

IOUtils.toString won't close the stream for you.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1748) RawResponseWriter doesn't close Reader

2010-02-02 Thread Mark Miller (JIRA)

RawResponseWriter doesn't close Reader
--

 Key: SOLR-1748
 URL: https://issues.apache.org/jira/browse/SOLR-1748
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Mark Miller
 Fix For: 1.5


{code}
 IOUtils.copy( content.getReader(), writer );
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1301) Solr + Hadoop

2010-02-02 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Rutherglen updated SOLR-1301:
---

Attachment: SOLR-1301.patch

I added the following to the SRW.close method's finally clause:

{code}
FileUtils.forceDelete(new File(temp.toString()));
{code}

Solr + Hadoop
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki
Fix For: 1.5

Attachments: commons-logging-1.0.4.jar,
commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch,
log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SolrRecordWriter.java

This patch contains a contrib module that provides distributed indexing
(using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is
twofold:
* provide an API that is familiar to Hadoop developers, i.e. that of
OutputFormat
* avoid unnecessary export and (de)serialization of data maintained on HDFS.
SolrOutputFormat consumes data produced by reduce tasks directly, without
storing it in intermediate files. Furthermore, by using an
EmbeddedSolrServer, the indexing task is split into as many parts as there
are reducers, and the data to be indexed is not sent over the network.
Design
--
Key/value pairs produced by reduce tasks are passed to SolrOutputFormat,
which in turn uses SolrRecordWriter to write this data. SolrRecordWriter
instantiates an EmbeddedSolrServer, and it also instantiates an
implementation of SolrDocumentConverter, which is responsible for turning
Hadoop (key, value) into a SolrInputDocument. This data is then added to a
batch, which is periodically submitted to EmbeddedSolrServer. When reduce
task completes, and the OutputFormat is closed, SolrRecordWriter calls
commit() and optimize() on the EmbeddedSolrServer.
The API provides facilities to specify an arbitrary existing solr.home
directory, from which the conf/ and lib/ files will be taken.
This process results in the creation of as many partial Solr home directories
as there were reduce tasks. The output shards are placed in the output
directory on the default filesystem (e.g. HDFS). Such part-N directories
can be used to run N shard servers. Additionally, users can specify the
number of reduce tasks, in particular 1 reduce task, in which case the output
will consist of a single shard.
An example application is provided that processes large CSV files and uses
this API. It uses a custom CSV processing to avoid (de)serialization overhead.
This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this
issue, you should put it in contrib/hadoop/lib.
Note: the development of this patch was sponsored by an anonymous contributor
and approved for release under Apache License.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-02-02 Thread Hoss Man (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828835#action_12828835
]

Hoss Man commented on SOLR-1677:

bq. I guess I could care less what the default is, if you care about such
things you shouldn't be using the defaults and instead specifying this yourself
in the schema, and Version has no effect.

...which is all well and good, but it just re-iterates the need for really good
documentation about what is impacted by changing a global Version setting --
otherwise users might be depending on a default behavior that is going to
change when Version as bumped, and they may not even realize it.

Bear in mind: these are just the nuances that people need to worry about when
considering a switch from 2.4 to 2.9 to 3.0 ... there will likely be a lot more
of these over time.

And just to be as crystal clear as i possibly can:
* my concern is purely about how to document this stuff.
* i do in fact agree that a global luceneVersionMatch option is a good idea

Add support for o.a.lucene.util.Version for BaseTokenizerFactory and
BaseTokenFilterFactory
---

Key: SOLR-1677
URL: https://issues.apache.org/jira/browse/SOLR-1677
Project: Solr
Issue Type: Sub-task
Components: Schema and Analysis
Reporter: Uwe Schindler
Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch,
SOLR-1677.patch

Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards
compatibility with old indexes created using older versions of Lucene. The
most important example is StandardTokenizer, which changed its behaviour with
posIncr and incorrect host token types in 2.4 and also in 2.9.
In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with
much more Unicode support, almost every Tokenizer/TokenFilter needs this
Version parameter. In 2.9, the deprecated old ctors without Version take
LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
This patch adds basic support for the Lucene Version property to the base
factories. Subclasses then can use the luceneMatchVersion decoded enum (in
3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently
contains a helper map to decode the version strings, but in 3.0 is can be
replaced by Version.valueOf(String), as the Version is a subclass of Java5
enums. The default value is Version.LUCENE_24 (as this is the default for the
no-version ctors in Lucene).
This patch also removes unneeded conversions to CharArraySet from
StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed
to match Lucene 3.0.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1718) Carriage return should submit query admin form

2010-02-02 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828842#action_12828842
 ] 

Hoss Man commented on SOLR-1718:


bq. Consider the JIRA interface we are using to comment on this issue. 

Sure, but that's an {{input type=text /}}, not a {{textarea /}} ... the 
expected semantics are completely different.  With a {{input type=text /}} 
box the browser already takes care of submitting the form if you hit Enter (and 
FWIW: most browsers i know of also submit forms if you use Shift-Enter in a 
{{textarea /}})

It sounds like what you are really suggesting is that we change the 
/admin/index.jsp form to use a {{input type=text /}} instead of a 
{{textarea /}} for the q param, and not that we add special (javascript) 
logic to the form to submit if someone presses Enter inside the existing 
{{textarea /}}  ... which i have a lot less objection to then going out of 
our way to violate standard form convention.

 Carriage return should submit query admin form
 --

 Key: SOLR-1718
 URL: https://issues.apache.org/jira/browse/SOLR-1718
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Affects Versions: 1.4
Reporter: David Smiley
Priority: Minor

 Hitting the carriage return on the keyboard should submit the search query on 
 the admin front screen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1729) Date Facet now override time parameter

2010-02-02 Thread Hoss Man (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828846#action_12828846
]

Hoss Man commented on SOLR-1729:

Peter: I think you may have misconstrued my comments -- they were not
criticisms of your patch, they were a clarification of why the functionality
you are proposing is important.

bq. Can you point me toward the class(es) where filter queries' date math lives

it's all handled internally by DateField, at which point it has no notion of
the request -- I believe this is why yonik suggested using a ThreadLocal
variable to track a consistent NOW that any method anywhere in Solr could use
(if set) for the current request ... then we just need something like SolrCore
to set it on each request (or accept it as a parm if specified)

bq. As filter queries are cached separately, can you think of any potential
caching issues relating to filter queries?

The cache keys for things like that are the Query objects themselves, and at
that point the DateMath strings (including NOW) have already been resolved
into realy time values so that shouldn't be an issue.

Date Facet now override time parameter
--

Key: SOLR-1729
URL: https://issues.apache.org/jira/browse/SOLR-1729
Project: Solr
Issue Type: Improvement
Components: search
Affects Versions: 1.4
Environment: Solr 1.4
Reporter: Peter Sturge
Priority: Minor
Attachments: FacetParams.java, SimpleFacets.java

This PATCH introduces a new query parameter that tells a (typically, but not
necessarily) remote server what time to use as 'NOW' when calculating date
facets for a query (and, for the moment, date facets *only*) - overriding the
default behaviour of using the local server's current time.
This gets 'round a problem whereby an explicit time range is specified in a
query (e.g. timestamp:[then0 TO then1]), and date facets are required for the
given time range (in fact, any explicit time range).
Because DateMathParser performs all its calculations from 'NOW', remote
callers have to work out how long ago 'then0' and 'then1' are from 'now', and
use the relative-to-now values in the facet.date.xxx parameters. If a remote
server has a different opinion of NOW compared to the caller, the results
will be skewed (e.g. they are in a different time-zone, not time-synced etc.).
This becomes particularly salient when performing distributed date faceting
(see SOLR-1709), where multiple shards may all be running with different
times, and the faceting needs to be aligned.
The new parameter is called 'facet.date.now', and takes as a parameter a
(stringified) long that is the number of milliseconds from the epoch (1 Jan
1970 00:00) - i.e. the returned value from a System.currentTimeMillis() call.
This was chosen over a formatted date to delineate it from a 'searchable'
time and to avoid superfluous date parsing. This makes the value generally a
programatically-set value, but as that is where the use-case is for this type
of parameter, this should be ok.
NOTE: This parameter affects date facet timing only. If there are other areas
of a query that rely on 'NOW', these will not interpret this value. This is a
broader issue about setting a 'query-global' NOW that all parts of query
analysis can share.
Source files affected:
FacetParams.java (holds the new constant FACET_DATE_NOW)
SimpleFacets.java getFacetDateCounts() NOW parameter modified
This PATCH is mildly related to SOLR-1709 (Distributed Date Faceting), but as
it's a general change for date faceting, it was deemed deserving of its own
patch. I will be updating SOLR-1709 in due course to include the use of this
new parameter, after some rfc acceptance.
A possible enhancement to this is to detect facet.date fields, look for and
match these fields in queries (if they exist), and potentially determine
automatically the required time skew, if any. There are a whole host of
reasons why this could be problematic to implement, so an explicit
facet.date.now parameter is the safest route.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: svn commit: r899979 - /lucene/solr/trunk/example/solr/conf/solrconfig.xml

2010-02-02 Thread Chris Hostetter


:  So what/how should we document all of this?
...
:  I've got more info on this.

Mark: most of what you wrote is above my head, but since you fixed a 
grammar error in my updated example solrconfig.xml comment w/o making any 
content changes, I'm assuming you feel what i put there is sufficient.

Most of your comments feel like they should be raised over in Lucene-Java 
land, at a minimum in documentation (added to the AvailableLockFactories 
page perhaps) or possibly in some code changes (should we changed the 
default LockFactory depending on Java version?)

I'll leave that up to you, since (as i mentioned) i didnt' understand half 
of it.

:  Checking for OverlappingFileLockException *should* actually work when
:  using Java 1.6. Java 1.6 started using a *system wide* thread safe check
:  for this.
: 
:  Previous to Java 1.6, checks for this *were* limited to an instance of
:  FileChannel - the FileChannel maintained its own personal lock list. So
:  you have to use
:  the same Channel to even have any hope of seeing an
:  OverlappingFileLockException. Even then though, its not properly thread
:  safe. They did not sync across
:  checking if the lock exists and acquiring the lock - they separately
:  sync each action - leaving room to acquire the lock twice from two
:  different threads like I was seeing.
: 
:  Interestingly, Java 1.6 has a back compat mode you can turn on that
:  doesn't use the system wide lock list, and they have fixed this thread
:  safety issue in that impl - there is a sync across checking
:  and getting the lock so that it is properly thread safe - but not in
:  Java 1.4, 1.5.
: 
:  Looking at GCC - uh ... I don't think you want to use GCC - they don't
:  appear to use a lock list and check for this at all :)
: 
:  But the point is, this is fixable on Java 6 if we check for
:  OverlappingFileLockException - it *should* work across webapps, and it
:  is actually thread safe, unlike Java 1.4,1.5.
: 
:
: Another interesting fact:
: 
: On Windows, if you attempt to lock the same file with different channel
: instances pre Java 1.6 - the code will deadlock.
: 
: -- 
: - Mark
: 
: http://www.lucidimagination.com
: 
: 
: 



-Hoss

Re: Problem with German Wordendings

2010-02-02 Thread Chris Hostetter


http://people.apache.org/~hossman/#solr-dev
Please Use solr-u...@lucene Not solr-...@lucene

Your question is better suited for the solr-u...@lucene mailing list ...
not the solr-...@lucene list.  solr-dev is for discussing development of
the internals of the Solr application ... it is *not* the appropriate
place to ask questions about how to use Solr (or write Solr plugins) 
when developing your own applications.  Please resend your message to
the solr-user mailing list, where you are likely to get more/better
responses since that list also has a larger number of subscribers.


: Date: Tue, 26 Jan 2010 17:13:51 +0100
: From: David Rühr d...@web-factory.de
: Reply-To: solr-dev@lucene.apache.org
: To: solr-dev@lucene.apache.org
: Subject: Problem with German Wordendings
: 
: Hi List.
: 
: We have made a suggest search and send this query with a facet.prefix
: kinderzim:
: 
: facet=on
: facet.prefix=kinderzim
: facet.mincount=1
: facet.field=content
: facet.limit=10
: fl=content
: omitHeader=true
: bf=log%28supplier_faktor%29
: version=1.2
: wt=json
: json.nl=map
: q=
: start=0
: rows=0
: 
: 
: Now we get:
: lst name=content
:  int name=kinderzimm7/int
: /lst
: 
: SolR doesn't return the endings of the output Words. It must be kinderzimmer
: same with kindermode, we get kindermod.
: We add the words in our protwords.txt and include them with this line in
: schema.xml.
: filter class=solr.SnowballPorterFilterFactory language=German
: protected=protwords.txt/
: 
: Can anybody help us?
: 
: 
: Thanks and sorry about my english.
: So Long , David
: 
: 
: 



-Hoss

[jira] Created: (SOLR-1749) debug output should include explanation of what input strings were passed to the analzyers for each field

2010-02-02 Thread Hoss Man (JIRA)

debug output should include explanation of what input strings were passed to 
the analzyers for each field
-

 Key: SOLR-1749
 URL: https://issues.apache.org/jira/browse/SOLR-1749
 Project: Solr
  Issue Type: Wish
  Components: search
Reporter: Hoss Man


Users are frequently confused by the interplay between Query Parsing and 
Query Time Analysis (ie: markup meta-characters like whitespace and quotes, 
multi-word synonyms, Shingles, etc...)  It would be nice if we had more 
debugging output available that would help eliminate this confusion.  The ideal 
API that comes to mind would be to include in the debug output of SearchHandler 
a list of every string that was Analyzed, and what list of field names it was 
analyzed against.  

This info would not only make it clear to users what exactly they should 
cut/paste into the analysis.jsp tool to see how their Analyzer is getting used, 
but also what exactly is being done to their input strings prior to their 
Analyzer being used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1749) debug output should include explanation of what input strings were passed to the analzyers for each field

2010-02-02 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1282#action_1282
 ] 

Hoss Man commented on SOLR-1749:


This is an idea that's been rolling arround in my head for a while, and today I 
thought i'd spend some time experimenting with it.

It seemed like the main impelmentation challenge would be that by the time you 
are deep enough down in the code to be using an Analyzer, you don't have access 
to the SolrQueryRequest to record the debugging info.

I thought of two potential solutions...

 * Use ThreadLocal to track the debugging info if needed
 * Use Proxy Wrapper classes to record the debugging info if needed

I initially figured that writing proxy classes for SolrQueryRequest, 
IndexSchema, and Analyzer would be relatively straight forward, so i started 
down that path and discovered two anoying problems...

 # IndexSchema is currently final
 # not all code paths use IndexSchema.getQueryAnalyzer(), many fetch the 
FieldTypes and ask them for their Analyzer directly.

The second problem isn't insurmountable, but it complicates things in that it 
would require Proxy wrappers for FieldType as well.  The first problem requires 
a simple change, but carries with it some baggage that i wasn't ready to 
embrace.  In both cases i started to be very bothered by the long term 
maintenance something like this would introduce.  It would be very easy to 
write these Proxy classes that extend IndexSchema, FieldType, and Analyzer but 
it would be just as easy to forget to add the appropriate Proxy methods to them 
down the road when new methods are added to those base classes.

The issue with the FieldType also exposed a flaw in the idea of using 
ThreadLocal: if we only had to worry about IndexSearcher.getQueryAnalyzer(), we 
could modify it to check ThreadLocal easily enough, but at the FieldType level 
we would only be able to modify FieldTypes that ship with Solr, and we'd be 
missing any plugin FieldTypes,


So i aborted the experiment but i figured i should post the feature idea, and 
my existing thoughts, here in case anyone had other suggestions on how it could 
be implemented feasibly.

 debug output should include explanation of what input strings were passed to 
 the analzyers for each field
 -

 Key: SOLR-1749
 URL: https://issues.apache.org/jira/browse/SOLR-1749
 Project: Solr
  Issue Type: Wish
  Components: search
Reporter: Hoss Man

 Users are frequently confused by the interplay between Query Parsing and 
 Query Time Analysis (ie: markup meta-characters like whitespace and quotes, 
 multi-word synonyms, Shingles, etc...)  It would be nice if we had more 
 debugging output available that would help eliminate this confusion.  The 
 ideal API that comes to mind would be to include in the debug output of 
 SearchHandler a list of every string that was Analyzed, and what list of 
 field names it was analyzed against.  
 This info would not only make it clear to users what exactly they should 
 cut/paste into the analysis.jsp tool to see how their Analyzer is getting 
 used, but also what exactly is being done to their input strings prior to 
 their Analyzer being used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-02-02 Thread shyjuThomas (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828915#action_12828915
]

shyjuThomas commented on SOLR-1301:
---

I have a need to perform Solr indexing in MapReduce task, to achive
parallelism. I have noticed 2 Jira issues related to that: SOLR-1045
SOLR-1301.

I have tried out the patches available with both the issues, and my observation
is given below:
1. The SOLR-1301 patch, performs input-record to key-value conversion in Map
phase; Hadoop (key, value) to SolrInputDocument conversion and the actual
indexing will happen in the Reduce phase.
Meanwhile, SOLR-1045 patch performs the record-to-Doc conversion and the actual
indexing in the Map phase; User can make use of the Reducer to perform merging
of multiple indices (if required). In another way we can configure the number
of reducers as same as the number of Shards.
2. The SOLR-1301 patch doesn't supports merging of the indices, while SOLR-1045
patch supports.
3. As per SOLR-1301 patch, no big activity happens in the Map phase (only
input-record to key-value conversion). Most of the heavy jobs (esp. the
indexing) are happening in the Reduce phase. If we need the final output as a
single index, we can use only one reducer, which means bottleneck at Reducer
almost the whole operation happens non-paralelly.
But the case is different with SOLR-1045 patch. It
achieves better parallelism when the number of map tasks is greater than the
number of reduce tasks, which is usually the case.

Based on these observation, I have few questions. (I am a beginner to the
Hadoop Solr world. So, please forgive me if my questions are silly):
1. As per above observation, SOLR-1045 patch is functionally better
(performance I have not verified yet ). Can anyone tell me, whats the actual
advantage SOLR-1301 patch offers over SOLR-1045 patch?
2. If both the jira issues are trying to solve the same problem, do we really
need 2 separate issues?

NOTE : I felt this Jira issue is more active than SOLR-1045. Thats why I posted
my comment here.

Solr + Hadoop
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki
Fix For: 1.5

--
This message is automatically generated by JIRA.
-
You

indexing a csv file with a multivalued field

2010-02-02 Thread Seffie Schwartz

I am not having luck doing this.  Even though I am specifying -F 
fieldname.separator='|' the fields are 
stored as one field not as multi fields.  If I specify -F 
f.fieldname.separator='|' I get a null pointer exception;

[jira] Commented: (SOLR-1045) Build Solr index using Hadoop MapReduce

2010-02-02 Thread Kevin Peterson (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828962#action_12828962
 ] 

Kevin Peterson commented on SOLR-1045:
--

Can anyone using this code comment on how this relates to SOLR-1301?

https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828915#action_12828915

These seem to have identical goals but very different approaches.

 Build Solr index using Hadoop MapReduce
 ---

 Key: SOLR-1045
 URL: https://issues.apache.org/jira/browse/SOLR-1045
 Project: Solr
  Issue Type: New Feature
Reporter: Ning Li
 Fix For: 1.5

 Attachments: SOLR-1045.0.patch


 The goal is a contrib module that builds Solr index using Hadoop MapReduce.
 It is different from the Solr support in Nutch. The Solr support in Nutch 
 sends a document to a Solr server in a reduce task. Here, the goal is to 
 build/update Solr index within map/reduce tasks. Also, it achieves better 
 parallelism when the number of map tasks is greater than the number of reduce 
 tasks, which is usually the case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-02-02 Thread Ted Dunning (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828961#action_12828961
]

Ted Dunning commented on SOLR-1301:
---

{quote}
Based on these observation, I have few questions. (I am a beginner to the
Hadoop Solr world. So, please forgive me if my questions are silly):
1. As per above observation, SOLR-1045 patch is functionally better
(performance I have not verified yet ). Can anyone tell me, whats the actual
advantage SOLR-1301 patch offers over SOLR-1045 patch?
2. If both the jira issues are trying to solve the same problem, do we really
need 2 separate issues?
{quote}

In the katta community, the recommended practice started with SOLR-1045 (what I
call map-side indexing) behavior, but I think that the consensus now is that
SOLR-1301 behavior (what I call reduce side indexing) is much, much better.
This is not necessarily the obvious result given your observations. There are
some operational differences between katta and SOLR that might make the
conclusions different, but what I have observed is the following:

a) index merging is a really bad idea that seems very attractive to begin with
because it is actually pretty expensive and doesn't solve the real problems of
bad document distribution across shards. It is much better to simply have lots
of shards per machine (aka micro-sharding) and use one reducer per shard. For
large indexes, this gives entirely acceptable performance. On a pretty small
cluster, we can index 50-100 million large documents in multiple ways in 2-3
hours. Index merging gives you no benefit compared to reduce side indexing and
just increases code complexity.

b) map-side indexing leaves you with indexes that are heavily skewed by being
composed of of documents from a single input split. At retrieval time, this
means that different shards have very different term frequency profiles and
very different numbers of relevant documents. This makes lots of statistics
very difficult including term frequency computation, term weighting and
determining the number of documents to retrieve. Map-side merge virtually
guarantees that you have to do two cluster queries, one to gather term
frequency statistics and another to do the actual query. With reduce side
indexing, you can provide strong probabilistic bounds on how different the
statistics in each shard can be so you can use local term statistics and you
can depend on the score distribution being this same which radically decreases
the number of documents you need to retrieve from each shard.

c) reduce-side indexing improves the balance of computation during retrieval.
If (as is the rule) some document subset is hotter than other document subset
due, say to data-source boosting or recency boosting, you will have very bad
cluster utilization with skewed shards from map-side indexing while all shards
will cost about the same for any query leading to good cluster utilization and
faster queries with reduce-side indexing.

d) with reduce-side indexing has properties that can be mathematically stated
and proved. Map-side indexing only has comparable properties if you make
unrealistic assumptions about your original data.

e) micro-sharding allows very simple and very effective use of multiple cores
on multiple machines in a search cluster. This can be very difficult to do
with large shards or a single index.

Now, as you say, these advantages may evaporate if you are looking to produce a
single output index. That seems, however, to contradict the whole point of
scaling. If you need to scale indexing, presumably you also need to scale
search speed and throughput. As such you probably want to have many shards
rather than few. Conversely, if you can stand to search a single index, then
you probably can stand to index on a single machine.

Another thing to think about is the fact SOLR doesn't yet do micro-sharding or
clustering very well and, in particular, doesn't handle multiple shards per
core. That will be changing before long, however, and it is very dangerous to
design for the past rather than the future.

In case, you didn't notice, I strongly suggest you stick with reduce-side
indexing.

Solr + Hadoop
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki
Fix For: 1.5

This patch contains a contrib module that provides distributed indexing
(using Hadoop)

[jira] Created: (SOLR-1744) Streams retrieved from ContenStream#getStream are not always closed

[jira] Created: (SOLR-1745) MoreLikeThisHandler gets a Reader from a ContentStream and doesn't close it

[jira] Created: (SOLR-1746) CommonsHttpSolrServer passes a ContentStream reader to IOUtils.copy, but doesnt close it.

[jira] Updated: (SOLR-1747) DumpRequestHandler doesn't close Stream

[jira] Created: (SOLR-1747) DumpRequestHandler doesn't close Stream

[jira] Created: (SOLR-1748) RawResponseWriter doesn't close Reader

[jira] Updated: (SOLR-1301) Solr + Hadoop

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

[jira] Commented: (SOLR-1718) Carriage return should submit query admin form

[jira] Commented: (SOLR-1729) Date Facet now override time parameter

Re: svn commit: r899979 - /lucene/solr/trunk/example/solr/conf/solrconfig.xml

Re: Problem with German Wordendings

[jira] Created: (SOLR-1749) debug output should include explanation of what input strings were passed to the analzyers for each field

[jira] Commented: (SOLR-1749) debug output should include explanation of what input strings were passed to the analzyers for each field

[jira] Commented: (SOLR-1301) Solr + Hadoop

indexing a csv file with a multivalued field

[jira] Commented: (SOLR-1045) Build Solr index using Hadoop MapReduce

[jira] Commented: (SOLR-1301) Solr + Hadoop

18 matches

Site Navigation

Mail list logo

Footer information