[jira] Created: (SOLR-1744) Streams retrieved from ContenStream#getStream are not always closed
Streams retrieved from ContenStream#getStream are not always closed --- Key: SOLR-1744 URL: https://issues.apache.org/jira/browse/SOLR-1744 Project: Solr Issue Type: Bug Affects Versions: 1.4 Reporter: Mark Miller Fix For: 1.5 Doesn't look like BinaryUpdateRequestHandler or CommonsHttpSolrServer close streams. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1745) MoreLikeThisHandler gets a Reader from a ContentStream and doesn't close it
MoreLikeThisHandler gets a Reader from a ContentStream and doesn't close it --- Key: SOLR-1745 URL: https://issues.apache.org/jira/browse/SOLR-1745 Project: Solr Issue Type: Bug Affects Versions: 1.4 Reporter: Mark Miller Fix For: 1.5 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1746) CommonsHttpSolrServer passes a ContentStream reader to IOUtils.copy, but doesnt close it.
CommonsHttpSolrServer passes a ContentStream reader to IOUtils.copy, but doesnt close it. - Key: SOLR-1746 URL: https://issues.apache.org/jira/browse/SOLR-1746 Project: Solr Issue Type: Bug Affects Versions: 1.4 Reporter: Mark Miller Fix For: 1.5 IOUtils.copy will not close your reader for you: {code} @Override protected void sendData(OutputStream out) throws IOException { IOUtils.copy(c.getReader(), out); } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1747) DumpRequestHandler doesn't close Stream
[ https://issues.apache.org/jira/browse/SOLR-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated SOLR-1747: -- Affects Version/s: 1.4 Fix Version/s: 1.5 DumpRequestHandler doesn't close Stream --- Key: SOLR-1747 URL: https://issues.apache.org/jira/browse/SOLR-1747 Project: Solr Issue Type: Bug Affects Versions: 1.4 Reporter: Mark Miller Priority: Minor Fix For: 1.5 {code} stream.add( stream, IOUtils.toString( content.getStream() ) ); {code} IOUtils.toString won't close the stream for you. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1747) DumpRequestHandler doesn't close Stream
DumpRequestHandler doesn't close Stream --- Key: SOLR-1747 URL: https://issues.apache.org/jira/browse/SOLR-1747 Project: Solr Issue Type: Bug Reporter: Mark Miller Priority: Minor {code} stream.add( stream, IOUtils.toString( content.getStream() ) ); {code} IOUtils.toString won't close the stream for you. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1748) RawResponseWriter doesn't close Reader
RawResponseWriter doesn't close Reader -- Key: SOLR-1748 URL: https://issues.apache.org/jira/browse/SOLR-1748 Project: Solr Issue Type: Bug Affects Versions: 1.4 Reporter: Mark Miller Fix For: 1.5 {code} IOUtils.copy( content.getReader(), writer ); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1301: --- Attachment: SOLR-1301.patch I added the following to the SRW.close method's finally clause: {code} FileUtils.forceDelete(new File(temp.toString())); {code} Solr + Hadoop - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.5 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828835#action_12828835 ] Hoss Man commented on SOLR-1677: bq. I guess I could care less what the default is, if you care about such things you shouldn't be using the defaults and instead specifying this yourself in the schema, and Version has no effect. ...which is all well and good, but it just re-iterates the need for really good documentation about what is impacted by changing a global Version setting -- otherwise users might be depending on a default behavior that is going to change when Version as bumped, and they may not even realize it. Bear in mind: these are just the nuances that people need to worry about when considering a switch from 2.4 to 2.9 to 3.0 ... there will likely be a lot more of these over time. And just to be as crystal clear as i possibly can: * my concern is purely about how to document this stuff. * i do in fact agree that a global luceneVersionMatch option is a good idea Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1718) Carriage return should submit query admin form
[ https://issues.apache.org/jira/browse/SOLR-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828842#action_12828842 ] Hoss Man commented on SOLR-1718: bq. Consider the JIRA interface we are using to comment on this issue. Sure, but that's an {{input type=text /}}, not a {{textarea /}} ... the expected semantics are completely different. With a {{input type=text /}} box the browser already takes care of submitting the form if you hit Enter (and FWIW: most browsers i know of also submit forms if you use Shift-Enter in a {{textarea /}}) It sounds like what you are really suggesting is that we change the /admin/index.jsp form to use a {{input type=text /}} instead of a {{textarea /}} for the q param, and not that we add special (javascript) logic to the form to submit if someone presses Enter inside the existing {{textarea /}} ... which i have a lot less objection to then going out of our way to violate standard form convention. Carriage return should submit query admin form -- Key: SOLR-1718 URL: https://issues.apache.org/jira/browse/SOLR-1718 Project: Solr Issue Type: Improvement Components: web gui Affects Versions: 1.4 Reporter: David Smiley Priority: Minor Hitting the carriage return on the keyboard should submit the search query on the admin front screen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1729) Date Facet now override time parameter
[ https://issues.apache.org/jira/browse/SOLR-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828846#action_12828846 ] Hoss Man commented on SOLR-1729: Peter: I think you may have misconstrued my comments -- they were not criticisms of your patch, they were a clarification of why the functionality you are proposing is important. bq. Can you point me toward the class(es) where filter queries' date math lives it's all handled internally by DateField, at which point it has no notion of the request -- I believe this is why yonik suggested using a ThreadLocal variable to track a consistent NOW that any method anywhere in Solr could use (if set) for the current request ... then we just need something like SolrCore to set it on each request (or accept it as a parm if specified) bq. As filter queries are cached separately, can you think of any potential caching issues relating to filter queries? The cache keys for things like that are the Query objects themselves, and at that point the DateMath strings (including NOW) have already been resolved into realy time values so that shouldn't be an issue. Date Facet now override time parameter -- Key: SOLR-1729 URL: https://issues.apache.org/jira/browse/SOLR-1729 Project: Solr Issue Type: Improvement Components: search Affects Versions: 1.4 Environment: Solr 1.4 Reporter: Peter Sturge Priority: Minor Attachments: FacetParams.java, SimpleFacets.java This PATCH introduces a new query parameter that tells a (typically, but not necessarily) remote server what time to use as 'NOW' when calculating date facets for a query (and, for the moment, date facets *only*) - overriding the default behaviour of using the local server's current time. This gets 'round a problem whereby an explicit time range is specified in a query (e.g. timestamp:[then0 TO then1]), and date facets are required for the given time range (in fact, any explicit time range). Because DateMathParser performs all its calculations from 'NOW', remote callers have to work out how long ago 'then0' and 'then1' are from 'now', and use the relative-to-now values in the facet.date.xxx parameters. If a remote server has a different opinion of NOW compared to the caller, the results will be skewed (e.g. they are in a different time-zone, not time-synced etc.). This becomes particularly salient when performing distributed date faceting (see SOLR-1709), where multiple shards may all be running with different times, and the faceting needs to be aligned. The new parameter is called 'facet.date.now', and takes as a parameter a (stringified) long that is the number of milliseconds from the epoch (1 Jan 1970 00:00) - i.e. the returned value from a System.currentTimeMillis() call. This was chosen over a formatted date to delineate it from a 'searchable' time and to avoid superfluous date parsing. This makes the value generally a programatically-set value, but as that is where the use-case is for this type of parameter, this should be ok. NOTE: This parameter affects date facet timing only. If there are other areas of a query that rely on 'NOW', these will not interpret this value. This is a broader issue about setting a 'query-global' NOW that all parts of query analysis can share. Source files affected: FacetParams.java (holds the new constant FACET_DATE_NOW) SimpleFacets.java getFacetDateCounts() NOW parameter modified This PATCH is mildly related to SOLR-1709 (Distributed Date Faceting), but as it's a general change for date faceting, it was deemed deserving of its own patch. I will be updating SOLR-1709 in due course to include the use of this new parameter, after some rfc acceptance. A possible enhancement to this is to detect facet.date fields, look for and match these fields in queries (if they exist), and potentially determine automatically the required time skew, if any. There are a whole host of reasons why this could be problematic to implement, so an explicit facet.date.now parameter is the safest route. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: svn commit: r899979 - /lucene/solr/trunk/example/solr/conf/solrconfig.xml
: So what/how should we document all of this? ... : I've got more info on this. Mark: most of what you wrote is above my head, but since you fixed a grammar error in my updated example solrconfig.xml comment w/o making any content changes, I'm assuming you feel what i put there is sufficient. Most of your comments feel like they should be raised over in Lucene-Java land, at a minimum in documentation (added to the AvailableLockFactories page perhaps) or possibly in some code changes (should we changed the default LockFactory depending on Java version?) I'll leave that up to you, since (as i mentioned) i didnt' understand half of it. : Checking for OverlappingFileLockException *should* actually work when : using Java 1.6. Java 1.6 started using a *system wide* thread safe check : for this. : : Previous to Java 1.6, checks for this *were* limited to an instance of : FileChannel - the FileChannel maintained its own personal lock list. So : you have to use : the same Channel to even have any hope of seeing an : OverlappingFileLockException. Even then though, its not properly thread : safe. They did not sync across : checking if the lock exists and acquiring the lock - they separately : sync each action - leaving room to acquire the lock twice from two : different threads like I was seeing. : : Interestingly, Java 1.6 has a back compat mode you can turn on that : doesn't use the system wide lock list, and they have fixed this thread : safety issue in that impl - there is a sync across checking : and getting the lock so that it is properly thread safe - but not in : Java 1.4, 1.5. : : Looking at GCC - uh ... I don't think you want to use GCC - they don't : appear to use a lock list and check for this at all :) : : But the point is, this is fixable on Java 6 if we check for : OverlappingFileLockException - it *should* work across webapps, and it : is actually thread safe, unlike Java 1.4,1.5. : : : Another interesting fact: : : On Windows, if you attempt to lock the same file with different channel : instances pre Java 1.6 - the code will deadlock. : : -- : - Mark : : http://www.lucidimagination.com : : : -Hoss
Re: Problem with German Wordendings
http://people.apache.org/~hossman/#solr-dev Please Use solr-u...@lucene Not solr-...@lucene Your question is better suited for the solr-u...@lucene mailing list ... not the solr-...@lucene list. solr-dev is for discussing development of the internals of the Solr application ... it is *not* the appropriate place to ask questions about how to use Solr (or write Solr plugins) when developing your own applications. Please resend your message to the solr-user mailing list, where you are likely to get more/better responses since that list also has a larger number of subscribers. : Date: Tue, 26 Jan 2010 17:13:51 +0100 : From: David Rühr d...@web-factory.de : Reply-To: solr-dev@lucene.apache.org : To: solr-dev@lucene.apache.org : Subject: Problem with German Wordendings : : Hi List. : : We have made a suggest search and send this query with a facet.prefix : kinderzim: : : facet=on : facet.prefix=kinderzim : facet.mincount=1 : facet.field=content : facet.limit=10 : fl=content : omitHeader=true : bf=log%28supplier_faktor%29 : version=1.2 : wt=json : json.nl=map : q= : start=0 : rows=0 : : : Now we get: : lst name=content : int name=kinderzimm7/int : /lst : : SolR doesn't return the endings of the output Words. It must be kinderzimmer : same with kindermode, we get kindermod. : We add the words in our protwords.txt and include them with this line in : schema.xml. : filter class=solr.SnowballPorterFilterFactory language=German : protected=protwords.txt/ : : Can anybody help us? : : : Thanks and sorry about my english. : So Long , David : : : -Hoss
[jira] Created: (SOLR-1749) debug output should include explanation of what input strings were passed to the analzyers for each field
debug output should include explanation of what input strings were passed to the analzyers for each field - Key: SOLR-1749 URL: https://issues.apache.org/jira/browse/SOLR-1749 Project: Solr Issue Type: Wish Components: search Reporter: Hoss Man Users are frequently confused by the interplay between Query Parsing and Query Time Analysis (ie: markup meta-characters like whitespace and quotes, multi-word synonyms, Shingles, etc...) It would be nice if we had more debugging output available that would help eliminate this confusion. The ideal API that comes to mind would be to include in the debug output of SearchHandler a list of every string that was Analyzed, and what list of field names it was analyzed against. This info would not only make it clear to users what exactly they should cut/paste into the analysis.jsp tool to see how their Analyzer is getting used, but also what exactly is being done to their input strings prior to their Analyzer being used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1749) debug output should include explanation of what input strings were passed to the analzyers for each field
[ https://issues.apache.org/jira/browse/SOLR-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1282#action_1282 ] Hoss Man commented on SOLR-1749: This is an idea that's been rolling arround in my head for a while, and today I thought i'd spend some time experimenting with it. It seemed like the main impelmentation challenge would be that by the time you are deep enough down in the code to be using an Analyzer, you don't have access to the SolrQueryRequest to record the debugging info. I thought of two potential solutions... * Use ThreadLocal to track the debugging info if needed * Use Proxy Wrapper classes to record the debugging info if needed I initially figured that writing proxy classes for SolrQueryRequest, IndexSchema, and Analyzer would be relatively straight forward, so i started down that path and discovered two anoying problems... # IndexSchema is currently final # not all code paths use IndexSchema.getQueryAnalyzer(), many fetch the FieldTypes and ask them for their Analyzer directly. The second problem isn't insurmountable, but it complicates things in that it would require Proxy wrappers for FieldType as well. The first problem requires a simple change, but carries with it some baggage that i wasn't ready to embrace. In both cases i started to be very bothered by the long term maintenance something like this would introduce. It would be very easy to write these Proxy classes that extend IndexSchema, FieldType, and Analyzer but it would be just as easy to forget to add the appropriate Proxy methods to them down the road when new methods are added to those base classes. The issue with the FieldType also exposed a flaw in the idea of using ThreadLocal: if we only had to worry about IndexSearcher.getQueryAnalyzer(), we could modify it to check ThreadLocal easily enough, but at the FieldType level we would only be able to modify FieldTypes that ship with Solr, and we'd be missing any plugin FieldTypes, So i aborted the experiment but i figured i should post the feature idea, and my existing thoughts, here in case anyone had other suggestions on how it could be implemented feasibly. debug output should include explanation of what input strings were passed to the analzyers for each field - Key: SOLR-1749 URL: https://issues.apache.org/jira/browse/SOLR-1749 Project: Solr Issue Type: Wish Components: search Reporter: Hoss Man Users are frequently confused by the interplay between Query Parsing and Query Time Analysis (ie: markup meta-characters like whitespace and quotes, multi-word synonyms, Shingles, etc...) It would be nice if we had more debugging output available that would help eliminate this confusion. The ideal API that comes to mind would be to include in the debug output of SearchHandler a list of every string that was Analyzed, and what list of field names it was analyzed against. This info would not only make it clear to users what exactly they should cut/paste into the analysis.jsp tool to see how their Analyzer is getting used, but also what exactly is being done to their input strings prior to their Analyzer being used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828915#action_12828915 ] shyjuThomas commented on SOLR-1301: --- I have a need to perform Solr indexing in MapReduce task, to achive parallelism. I have noticed 2 Jira issues related to that: SOLR-1045 SOLR-1301. I have tried out the patches available with both the issues, and my observation is given below: 1. The SOLR-1301 patch, performs input-record to key-value conversion in Map phase; Hadoop (key, value) to SolrInputDocument conversion and the actual indexing will happen in the Reduce phase. Meanwhile, SOLR-1045 patch performs the record-to-Doc conversion and the actual indexing in the Map phase; User can make use of the Reducer to perform merging of multiple indices (if required). In another way we can configure the number of reducers as same as the number of Shards. 2. The SOLR-1301 patch doesn't supports merging of the indices, while SOLR-1045 patch supports. 3. As per SOLR-1301 patch, no big activity happens in the Map phase (only input-record to key-value conversion). Most of the heavy jobs (esp. the indexing) are happening in the Reduce phase. If we need the final output as a single index, we can use only one reducer, which means bottleneck at Reducer almost the whole operation happens non-paralelly. But the case is different with SOLR-1045 patch. It achieves better parallelism when the number of map tasks is greater than the number of reduce tasks, which is usually the case. Based on these observation, I have few questions. (I am a beginner to the Hadoop Solr world. So, please forgive me if my questions are silly): 1. As per above observation, SOLR-1045 patch is functionally better (performance I have not verified yet ). Can anyone tell me, whats the actual advantage SOLR-1301 patch offers over SOLR-1045 patch? 2. If both the jira issues are trying to solve the same problem, do we really need 2 separate issues? NOTE : I felt this Jira issue is more active than SOLR-1045. Thats why I posted my comment here. Solr + Hadoop - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.5 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. - You
indexing a csv file with a multivalued field
I am not having luck doing this. Even though I am specifying -F fieldname.separator='|' the fields are stored as one field not as multi fields. If I specify -F f.fieldname.separator='|' I get a null pointer exception;
[jira] Commented: (SOLR-1045) Build Solr index using Hadoop MapReduce
[ https://issues.apache.org/jira/browse/SOLR-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828962#action_12828962 ] Kevin Peterson commented on SOLR-1045: -- Can anyone using this code comment on how this relates to SOLR-1301? https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828915#action_12828915 These seem to have identical goals but very different approaches. Build Solr index using Hadoop MapReduce --- Key: SOLR-1045 URL: https://issues.apache.org/jira/browse/SOLR-1045 Project: Solr Issue Type: New Feature Reporter: Ning Li Fix For: 1.5 Attachments: SOLR-1045.0.patch The goal is a contrib module that builds Solr index using Hadoop MapReduce. It is different from the Solr support in Nutch. The Solr support in Nutch sends a document to a Solr server in a reduce task. Here, the goal is to build/update Solr index within map/reduce tasks. Also, it achieves better parallelism when the number of map tasks is greater than the number of reduce tasks, which is usually the case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828961#action_12828961 ] Ted Dunning commented on SOLR-1301: --- {quote} Based on these observation, I have few questions. (I am a beginner to the Hadoop Solr world. So, please forgive me if my questions are silly): 1. As per above observation, SOLR-1045 patch is functionally better (performance I have not verified yet ). Can anyone tell me, whats the actual advantage SOLR-1301 patch offers over SOLR-1045 patch? 2. If both the jira issues are trying to solve the same problem, do we really need 2 separate issues? {quote} In the katta community, the recommended practice started with SOLR-1045 (what I call map-side indexing) behavior, but I think that the consensus now is that SOLR-1301 behavior (what I call reduce side indexing) is much, much better. This is not necessarily the obvious result given your observations. There are some operational differences between katta and SOLR that might make the conclusions different, but what I have observed is the following: a) index merging is a really bad idea that seems very attractive to begin with because it is actually pretty expensive and doesn't solve the real problems of bad document distribution across shards. It is much better to simply have lots of shards per machine (aka micro-sharding) and use one reducer per shard. For large indexes, this gives entirely acceptable performance. On a pretty small cluster, we can index 50-100 million large documents in multiple ways in 2-3 hours. Index merging gives you no benefit compared to reduce side indexing and just increases code complexity. b) map-side indexing leaves you with indexes that are heavily skewed by being composed of of documents from a single input split. At retrieval time, this means that different shards have very different term frequency profiles and very different numbers of relevant documents. This makes lots of statistics very difficult including term frequency computation, term weighting and determining the number of documents to retrieve. Map-side merge virtually guarantees that you have to do two cluster queries, one to gather term frequency statistics and another to do the actual query. With reduce side indexing, you can provide strong probabilistic bounds on how different the statistics in each shard can be so you can use local term statistics and you can depend on the score distribution being this same which radically decreases the number of documents you need to retrieve from each shard. c) reduce-side indexing improves the balance of computation during retrieval. If (as is the rule) some document subset is hotter than other document subset due, say to data-source boosting or recency boosting, you will have very bad cluster utilization with skewed shards from map-side indexing while all shards will cost about the same for any query leading to good cluster utilization and faster queries with reduce-side indexing. d) with reduce-side indexing has properties that can be mathematically stated and proved. Map-side indexing only has comparable properties if you make unrealistic assumptions about your original data. e) micro-sharding allows very simple and very effective use of multiple cores on multiple machines in a search cluster. This can be very difficult to do with large shards or a single index. Now, as you say, these advantages may evaporate if you are looking to produce a single output index. That seems, however, to contradict the whole point of scaling. If you need to scale indexing, presumably you also need to scale search speed and throughput. As such you probably want to have many shards rather than few. Conversely, if you can stand to search a single index, then you probably can stand to index on a single machine. Another thing to think about is the fact SOLR doesn't yet do micro-sharding or clustering very well and, in particular, doesn't handle multiple shards per core. That will be changing before long, however, and it is very dangerous to design for the past rather than the future. In case, you didn't notice, I strongly suggest you stick with reduce-side indexing. Solr + Hadoop - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.5 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop)