Re: MRIT's morphline mapper doesn't co-locate with data

2014-09-24 Thread Wolfgang Hoschek
Based on our measurements, Lucene indexing is so CPU intensive that it wouldn’t 
really help much to exploit data locality on read. The overwhelming bottleneck 
remains the same. Having said that, we have an ingestion tool in the works that 
will take advantage of data locality for splitable files as well.

Wolfgang.

On Sep 24, 2014, at 9:38 AM, Tom Chen tomchen1...@gmail.com wrote:

 Hi,
 
 The MRIT (MapReduceIndexerTool) uses NLineInputFormat for the morphline
 mapper. The mapper doesn't co-locate with the input data that it process.
 Isn't this a performance hit?
 
 Ideally, morphline mapper should be run on those hosts that contain most
 data blocks for the input files it process.
 
 Regards,
 Tom



Re: DIH on Solr

2014-06-26 Thread Wolfgang Hoschek
Try this: 
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/Cloudera-Search-User-Guide.html

Wolfgang.

On Jun 24, 2014, at 11:14 PM, atp annamalai...@hcl.com wrote:

 Thanks Ahmet,  Walfgang , i have installed hbase-indexer on one the server
 but here also im unable to start the hbase indexer server. 
 
 Error: Could not find or load main class com.ngdata.hbaseindexer.Main
 
 properly set the JAVA_HOME and INDEXER_HOME environmental variables.
 
 
 please guide
 
 Thanks.
 ATP 
 
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/DIH-on-Solr-tp4143669p4143955.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: DIH on Solr

2014-06-24 Thread Wolfgang Hoschek
Check out the HBase Indexer http://ngdata.github.io/hbase-indexer/

Wolfgang.

On Jun 24, 2014, at 3:55 AM, Ahmet Arslan iori...@yahoo.com.INVALID wrote:

 Hi,
 
 There is no DataSource or EntityProcessor for HBase, I think.
 
 May be http://www.lilyproject.org/lily/index.html works for you?
 
 AHmet
 
 
 On Tuesday, June 24, 2014 1:27 PM, atp annamalai...@hcl.com wrote:
 Hi experts,
 
 We have a requirement to import the data from hbase tables using solr, we
 have tried with help of Dataimporthandler, we couldn't find the
 configuration streps or document for dataimporthandler for HBASE, can
 anybody please share the steps to configure, 
 
 we tried with basic configuration but while select full import its throwing
 error ,  please share the docs or links to configure DIH for hbase table. 
 
 6/24/2014 3:44:00 PM
 WARN
 ZKPropertiesWriter
 Could not read DIH properties from
 /configs/collection1/dataimport.properties :class
 org.apache.zookeeper.KeeperException$NoNodeException
 6/24/2014 3:44:00 PM
 ERROR
 DataImporter
 Full Import failed:java.lang.RuntimeException:
 org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
 load EntityProcessor implementation for entity:msg Processing Document # 1
 
 
 Thanks  in Advance
 
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/DIH-on-Solr-tp4143669.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 



Re: MergeReduceIndexerTool takes a lot of time for a limited number of documents

2014-06-18 Thread Wolfgang Hoschek
Consider giving the MR tasks more RAM, for example via 

hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar
org.apache.solr.hadoop.MapReduceIndexerTool -D 
'mapred.child.java.opts=-Xmx2000m’ ...

Wolfgang.

On May 26, 2014, at 10:48 AM, Costi Muraru costimur...@gmail.com wrote:

 Hey Erick,
 
 The job reducers began to die with Error: Java heap space, after 1h and
 22 minutes being stucked at ~80%.
 
 I did a few more tests:
 
 Test 1.
 80,000 documents
 Each document had *20* fields. The field names were* the same *for all the
 documents. Values were different.
 Job status: successful
 Execution time: 33 seconds.
 
 Test 2.
 80,000 documents
 Each document had *20* fields. The field names were *different* for all the
 documents. Values were also different.
 Job status: successful
 Execution time: 643 seconds.
 
 Test 3.
 80,000 documents
 Each document had *50* fields. The field names were *the same* for all the
 documents. Values were different.
 Job status: successful
 Execution time: 45.96 seconds.
 
 Test 4.
 80,000 documents
 Each document had *50* fields. The field names were *different* for all the
 documents. Values were also different.
 Job status: failed
 Execution time: after 1h reducers failed.
 Unfortunately, this is my use case.
 
 My guess is that the reduce time (to perform the merges) depends if the
 field names are the same across the documents. If they are different the
 merge time increases very much. I don't have any knowledge behind the solr
 merge operation, but is it possible that it tries to group the fields with
 the same name across all the documents?
 In the first case, when the field names are the same across documents, the
 number of buckets is equal to the number of unique field names which is 20.
 In the second case, where all the field names are different (my use case),
 it creates a lot more buckets (80k documents * 50 different field names = 4
 million buckets) and the process gets slowed down significantly.
 Is this assumption correct / Is there any way to get around it?
 
 Thanks again for reaching out. Hope this is more clear now.
 
 This is how one of the 80k documents looks like (json format):
 {
 id : 442247098240414508034066540706561683636,
 items : {
   IT49597_1180_i : 76,
   IT25363_1218_i : 4,
   IT12418_1291_i : 95,
   IT55979_1051_i : 31,
   IT9841_1224_i : 36,
   IT40463_1010_i : 87,
   IT37932_1346_i : 11,
   IT17653_1054_i : 37,
   IT59414_1025_i : 96,
   IT51080_1133_i : 5,
   IT7369_1395_i : 90,
   IT59974_1245_i : 25,
   IT25374_1345_i : 75,
   IT16825_1458_i : 28,
   IT56643_1050_i : 76,
   IT46274_1398_i : 50,
   IT47411_1275_i : 11,
   IT2791_1000_i : 97,
   IT7708_1053_i : 96,
   IT46622_1112_i : 90,
   IT47161_1382_i : 64
   }
 }
 
 Costi
 
 
 On Mon, May 26, 2014 at 7:45 PM, Erick Erickson 
 erickerick...@gmail.comwrote:
 
 The MapReduceIndexerTool is really intended for very large data sets,
 and by today's standards 80K doesn't qualify :).
 
 Basically, MRIT creates N sub-indexes, then merges them, which it
 may to in a tiered fashion. That is, it may merge gen1 to gen2, then
 merge gen2 to gen3 etc. Which is great when indexing a bazillion
 documents into 20 shards, but all that copying around may take
 more time than you really gain for 80K docs.
 
 Also be aware that MRIT does NOT update docs with the same ID, this
 is due to the inherent limitation of the Lucene mergeIndex process.
 
 How long is a long time? attachments tend to get filtered out, so if you
 want us to see the graph you might paste it somewhere and provide a link.
 
 Best,
 Erick
 
 On Mon, May 26, 2014 at 8:51 AM, Costi Muraru costimur...@gmail.com
 wrote:
 Hey guys,
 
 I'm using the MergeReduceIndexerTool to import data into a SolrCloud
 cluster made out of 3 decent machines.
 Looking in the JobTracker, I can see that the mapper jobs finish quite
 fast. The reduce jobs get to ~80% quite fast as well. It is here where
 they get stucked for a long period of time (picture + log attached).
 I'm only trying to insert ~80k documents with 10-50 different fields
 each. Why is this happening? Am I not setting something correctly? Is
 the fact that most of the documents have different field names, or too
 many for that matter?
 Any tips are gladly appreciated.
 
 Thanks,
 Costi
 
 From the reduce logs:
 60208 [main] INFO  org.apache.solr.update.UpdateHandler  - start
 
 commit{,optimize=false,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=false,prepareCommit=false}
 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
 [IW][main]: commit: start
 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
 [IW][main]: commit: enter lock
 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
 [IW][main]: commit: now prepare
 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
 [IW][main]: prepareCommit: flush
 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
 [IW][main]:   index before 

Re: Offline Indexes Update to Shard

2014-06-03 Thread Wolfgang Hoschek
Hi see comments inline below…

On Jun 2, 2014, at 6:49 AM, Vineet Mishra clearmido...@gmail.com wrote:

 Hi Wolfgang,
 
 Thanks for your response, can you quote some running example of
 MapReduceIndexerTool
 for indexing through csv files.
 If you are referring to
 http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_mapreduceindexertool.html?scroll=csug_topic_6_1
 
 I had a few points to clarify,
 *what is the morphline?

See http://kitesdk.org/docs/current/kite-morphlines/index.html and 
http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html#/readCSV

 *Is it necessary to use morphline for indexing, if yes how to create one?

Yes, it requires a morphline (which is basically a chain of plugins), and you 
can plug in any custom java code and custom commands into a morphline.

 *can Index only reside on HDFS and not on LocalFS?

The implementation is only on HDFS.

 *what is the minimum cdh version supported for it?

CDH 4 or CDH 5.

Wolfgang.

 
 Looking forward to your response.
 
 Thanks!
 
 
 On Mon, Jun 2, 2014 at 2:24 PM, Wolfgang Hoschek whosc...@cloudera.com
 wrote:
 
 Sounds like you should consider using MapReduceIndexerTool. AFAIK, this is
 the most scalable indexing (and merging) solution out there.
 
 Wolfgang.
 
 On Jun 2, 2014, at 10:33 AM, Vineet Mishra clearmido...@gmail.com wrote:
 
 Hi Erick,
 
 Thanks for your mail, please let me go through with my use case.
 I am having around 20-40 Billion Records to index with each record is
 having around 200-400 fields, the data is sensor data so it can be easily
 stored in Integer or Float. Now to index this huge amount of data I am
 going with the indexing through EmbeddedSolrServer which was working fine
 but I was looking out for a way to move these generated indexes to
 different shards possibly without copying pasting it to each machines but
 some other approach as to submit this indexes to some shard and let the
 shard take care of it distributing it over leader and replica.
 I want to mention one more thing, as I started indexing with
 EmbeddedSolrServer
 it went fine for some million of starting documents but there after the
 indexing speed is pathetically slow, it indexed around 20GB in a day and
 just have indexed 9 GB in another 2 days.
 Any indexing optimization approach also requested.
 
 Hope this makes things much clearer.
 Looking forward to soon hear from you.
 
 Thanks and Regards!
 
 
 On Fri, May 30, 2014 at 9:09 PM, Erick Erickson erickerick...@gmail.com
 
 wrote:
 
 You can copy to the shards and use the mergindexes command, the
 MapReduceIndexerTool follows that approach.
 
 But really, what is the higher-level use-case you're trying to support?
 This feels a little like an XY problem. You could do things like
 1 index to a different collection then use collection aliasing to
 switch
 2 just re-index to the current collection.
 3 use the MapReduceIndexerTool (admittedly it needs Hadoop).
 
 All in all, it feels like you're doing work you don't need to do. But
 that's a guess since you haven't told us what the use-case is.
 
 Best,
 Erick
 
 
 On Thu, May 29, 2014 at 7:22 AM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:
 
 Hi,
 
 On Wed, May 28, 2014 at 4:25 AM, Vineet Mishra clearmido...@gmail.com
 wrote:
 
 Hi All,
 
 Has anyone tried with building Offline indexes with EmbeddedSolrServer
 and
 posting it to Shards.
 
 
 What do you mean by posting it to shards?  How is that different than
 copying them manually to the right location in FS?  Could you please
 elaborate?
 
 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/
 
 
 
 FYI, I am done building the indexes but looking out for a way to post
 these
 index files on shards.
 Copying the indexes manually to each shard's replica is possible and
 is
 working fine but I don't want to go with that approach.
 
 Thanks!
 
 
 
 
 



Re: Offline Indexes Update to Shard

2014-06-02 Thread Wolfgang Hoschek
Sounds like you should consider using MapReduceIndexerTool. AFAIK, this is the 
most scalable indexing (and merging) solution out there.

Wolfgang.

On Jun 2, 2014, at 10:33 AM, Vineet Mishra clearmido...@gmail.com wrote:

 Hi Erick,
 
 Thanks for your mail, please let me go through with my use case.
 I am having around 20-40 Billion Records to index with each record is
 having around 200-400 fields, the data is sensor data so it can be easily
 stored in Integer or Float. Now to index this huge amount of data I am
 going with the indexing through EmbeddedSolrServer which was working fine
 but I was looking out for a way to move these generated indexes to
 different shards possibly without copying pasting it to each machines but
 some other approach as to submit this indexes to some shard and let the
 shard take care of it distributing it over leader and replica.
 I want to mention one more thing, as I started indexing with 
 EmbeddedSolrServer
 it went fine for some million of starting documents but there after the
 indexing speed is pathetically slow, it indexed around 20GB in a day and
 just have indexed 9 GB in another 2 days.
 Any indexing optimization approach also requested.
 
 Hope this makes things much clearer.
 Looking forward to soon hear from you.
 
 Thanks and Regards!
 
 
 On Fri, May 30, 2014 at 9:09 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
 You can copy to the shards and use the mergindexes command, the
 MapReduceIndexerTool follows that approach.
 
 But really, what is the higher-level use-case you're trying to support?
 This feels a little like an XY problem. You could do things like
 1 index to a different collection then use collection aliasing to switch
 2 just re-index to the current collection.
 3 use the MapReduceIndexerTool (admittedly it needs Hadoop).
 
 All in all, it feels like you're doing work you don't need to do. But
 that's a guess since you haven't told us what the use-case is.
 
 Best,
 Erick
 
 
 On Thu, May 29, 2014 at 7:22 AM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:
 
 Hi,
 
 On Wed, May 28, 2014 at 4:25 AM, Vineet Mishra clearmido...@gmail.com
 wrote:
 
 Hi All,
 
 Has anyone tried with building Offline indexes with EmbeddedSolrServer
 and
 posting it to Shards.
 
 
 What do you mean by posting it to shards?  How is that different than
 copying them manually to the right location in FS?  Could you please
 elaborate?
 
 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/
 
 
 
 FYI, I am done building the indexes but looking out for a way to post
 these
 index files on shards.
 Copying the indexes manually to each shard's replica is possible and is
 working fine but I don't want to go with that approach.
 
 Thanks!
 
 
 



Re: Update existing documents using MapReduceIndexerTool?

2014-05-06 Thread Wolfgang Hoschek
Yes, this is a known issue. Repeatedly running the MapReduceIndexerTool on the 
same set of input files can result in duplicate entries in the Solr collection. 
This occurs because currently the tool can only insert documents and cannot 
update or delete existing Solr documents.

Wolfgang.

On May 6, 2014, at 3:08 PM, Costi Muraru costimur...@gmail.com wrote:

 Hi guys,
 
 I've used the MapReduceIndexerTool [1] in order to import data into SOLR
 and seem to stumbled upon something. I've followed the tutorial [2] and
 managed to import data into a SolrCloud cluster using the map reduce job.
 I ran the job a second time in order to update some of the existing
 documents. The job itself was successful, but the documents maintained the
 same field values as before.
 In order to update some fields for the existing IDs, I've decompiled the
 AVRO sample file
 (examples/test-documents/sample-statuses-20120906-141433-medium.avro),
 updated some of the fields with new values, while maintaining the same IDs
 and packaged the AVRO back. After this I ran the MapReduceIndexerTool and,
 although successful, the records were not updated.
 I've tried this several times. Even with a few documents the result is the
 same - the documents are not being updated with the new values. Instead,
 the old field values are kept.
 If I manually delete the old document from SOLR and after this I run the
 job, the document is inserted with the new values.
 
 Do you guys have any experience with this tool? Is this something by design
 / Am I missing something? Can this behavior be overwritten to force an
 update? Any feedback is gladly appreciated.
 
 Thanks,
 Constantin
 
 [1]
 http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_mapreduceindexertool.html#csug_topic_6_1
 
 [2]
 http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_batch_index_to_solr_servers_using_golive.html



Re: What's the actual story with new morphline and hadoop contribs?

2014-04-15 Thread Wolfgang Hoschek
The solr morphline jars are integrated with solr by way of the solr specific 
solr/contrib/map-reduce module.

Ingestion from Flume into Solr is available here: 
http://flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

FWIW, for our purposes we see no role for DataImportHandler anymore.

Wolfgang.

On Apr 15, 2014, at 6:01 AM, Alexandre Rafalovitch arafa...@gmail.com wrote:

 The use case I keep thinking about is Flue/Morphline replacing
 DataImportHandler. So, when I saw morphline shipped with Solr, I tried
 to understand whether it is a step towards it.
 
 As it is, I am still not sure I understand why those jars are shipped
 with Solr, if it is not actually integrating into Solr.
 
 Regards,
   Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency
 
 
 On Mon, Apr 14, 2014 at 8:36 PM, Wolfgang Hoschek whosc...@cloudera.com 
 wrote:
 Currently all Solr morphline use cases I’m aware of run in processes outside 
 of the Solr JVM, e.g. in Flume, in MapReduce, in HBase Lily Indexer, etc. 
 These ingestion processes generate Solr documents for Solr updates. Running 
 in external processes is done to improve scalability, reliability, 
 flexibility and reusability. Not everything needs to run inside of the Solr 
 JVM.
 
 We haven’t found a use case for it so far, but it would be easy to add an 
 UpdateRequestProcessor that runs a morphline inside of the Solr JVM.
 
 Here is more background info:
 
 http://kitesdk.org/docs/current/kite-morphlines/index.html
 
 http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html
 
 http://files.meetup.com/5139282/SHUG10%20-%20Search%20On%20Hadoop.pdf
 
 Wolfgang.
 
 On Apr 14, 2014, at 2:26 PM, Alexandre Rafalovitch arafa...@gmail.com 
 wrote:
 
 Hello,
 
 I saw that 4.7.1 has morphline and hadoop contribution libraries, but
 I can't figure out the degree to which they are useful to _Solr_
 users. I found one hadoop example in the readme that does some sort
 injection into Solr. Is that the only use case supported?
 
 I thought that maybe there is a UpdateRequestProcessor or Handler
 end-point or something that hooks into morphline to do
 similar/alternative work to DataImportHandler. But I can't see any
 entry points or examples for that.
 
 Anybody knows what the story is and/or what the future holds?
 
 Regards,
   Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency
 



Re: What's the actual story with new morphline and hadoop contribs?

2014-04-14 Thread Wolfgang Hoschek
Currently all Solr morphline use cases I’m aware of run in processes outside of 
the Solr JVM, e.g. in Flume, in MapReduce, in HBase Lily Indexer, etc. These 
ingestion processes generate Solr documents for Solr updates. Running in 
external processes is done to improve scalability, reliability, flexibility and 
reusability. Not everything needs to run inside of the Solr JVM.

We haven’t found a use case for it so far, but it would be easy to add an 
UpdateRequestProcessor that runs a morphline inside of the Solr JVM.

Here is more background info: 

http://kitesdk.org/docs/current/kite-morphlines/index.html

http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

http://files.meetup.com/5139282/SHUG10%20-%20Search%20On%20Hadoop.pdf

Wolfgang.

On Apr 14, 2014, at 2:26 PM, Alexandre Rafalovitch arafa...@gmail.com wrote:

 Hello,
 
 I saw that 4.7.1 has morphline and hadoop contribution libraries, but
 I can't figure out the degree to which they are useful to _Solr_
 users. I found one hadoop example in the readme that does some sort
 injection into Solr. Is that the only use case supported?
 
 I thought that maybe there is a UpdateRequestProcessor or Handler
 end-point or something that hooks into morphline to do
 similar/alternative work to DataImportHandler. But I can't see any
 entry points or examples for that.
 
 Anybody knows what the story is and/or what the future holds?
 
 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency



Re: MapReduceIndexerTool does not respect Lucene version in solrconfig Was: converting 4.7 index to 4.3.1

2014-04-10 Thread Wolfgang Hoschek
There’s no such other location in there. BTW, you can disable the mtree merge 
via --reducers=-2 (or --reducers=0 in old versions) .

Wolfgang.

On Apr 10, 2014, at 3:44 PM, Dmitry Kan solrexp...@gmail.com wrote:

 a correction: actually when I tested the above change I had so little data,
 that it didn't trigger sub-shard slicing and thus merging of the slices.
 Still, looks as if somewhere in the map-reduce contrib code there is a
 link to what lucene version to use.
 
 Wolfgang, do you happen to know where that other Version.* is specified?
 
 
 On Thu, Apr 10, 2014 at 12:59 PM, Dmitry Kan solrexp...@gmail.com wrote:
 
 Thanks for responding, Wolfgang.
 
 Changing to LUCENE_43:
 
 IndexWriterConfig writerConfig = new IndexWriterConfig(Version.LUCENE_43,
 null);
 
 didn't affect on the index format version, because, I believe, if the
 format of the index to merge has been of higher version (4.1 in this case),
 it will merge to the same and not lower version (4.0). But format version
 certainly could be read from the solrconfig, you are right.
 
 Dmitry
 
 
 On Wed, Apr 9, 2014 at 11:51 PM, Wolfgang Hoschek 
 whosc...@cloudera.comwrote:
 
 There is a current limitation in that the code doesn't actually look into
 solrconfig.xml for the version. We should fix this, indeed. See
 
 
 https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/map-reduce/src/java/org/apache/solr/hadoop/TreeMergeOutputFormat.java#L100-101
 
 Wolfgang.
 
 On Apr 8, 2014, at 11:49 AM, Dmitry Kan solrexp...@gmail.com wrote:
 
 Hello,
 
 When we instantiate the MapReduceIndexerTool with the collections' conf
 directory, we expect, that the Lucene version is respected and the index
 gets generated in a format compatible with the defined version.
 
 This does not seem to happen, however.
 
 Checking with luke:
 
 the expected Lucene index format: Lucene 4.0
 the output Lucene index format: Lucene 4.1
 
 Can anybody shed some light onto the semantics behind specifying the
 Lucene
 version in this context? Does this have something to do with what
 version
 of solr core is used by the morphline library?
 
 Thanks,
 
 Dmitry
 
 -- Forwarded message --
 
 Dear list,
 
 We have been generating solr indices with the solr-hadoop contrib module
 (SOLR-1301). Our current solr in use is of 4.3.1 version. Is there any
 tool
 that could do the backward conversion, i.e. 4.7-4.3.1? Or is the
 upgrade
 the only way to go?
 
 --
 Dmitry
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 
 
 
 --
 Dmitry
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 
 
 
 
 --
 Dmitry
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 
 
 
 
 -- 
 Dmitry
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan



Re: MapReduceIndexerTool does not respect Lucene version in solrconfig Was: converting 4.7 index to 4.3.1

2014-04-09 Thread Wolfgang Hoschek
There is a current limitation in that the code doesn’t actually look into 
solrconfig.xml for the version. We should fix this, indeed. See

https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/map-reduce/src/java/org/apache/solr/hadoop/TreeMergeOutputFormat.java#L100-101

Wolfgang.

On Apr 8, 2014, at 11:49 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Hello,
 
 When we instantiate the MapReduceIndexerTool with the collections' conf
 directory, we expect, that the Lucene version is respected and the index
 gets generated in a format compatible with the defined version.
 
 This does not seem to happen, however.
 
 Checking with luke:
 
 the expected Lucene index format: Lucene 4.0
 the output Lucene index format: Lucene 4.1
 
 Can anybody shed some light onto the semantics behind specifying the Lucene
 version in this context? Does this have something to do with what version
 of solr core is used by the morphline library?
 
 Thanks,
 
 Dmitry
 
 -- Forwarded message --
 
 Dear list,
 
 We have been generating solr indices with the solr-hadoop contrib module
 (SOLR-1301). Our current solr in use is of 4.3.1 version. Is there any tool
 that could do the backward conversion, i.e. 4.7-4.3.1? Or is the upgrade
 the only way to go?
 
 -- 
 Dmitry
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 
 
 
 -- 
 Dmitry
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan