Re: MRIT's morphline mapper doesn't co-locate with data
Based on our measurements, Lucene indexing is so CPU intensive that it wouldn’t really help much to exploit data locality on read. The overwhelming bottleneck remains the same. Having said that, we have an ingestion tool in the works that will take advantage of data locality for splitable files as well. Wolfgang. On Sep 24, 2014, at 9:38 AM, Tom Chen tomchen1...@gmail.com wrote: Hi, The MRIT (MapReduceIndexerTool) uses NLineInputFormat for the morphline mapper. The mapper doesn't co-locate with the input data that it process. Isn't this a performance hit? Ideally, morphline mapper should be run on those hosts that contain most data blocks for the input files it process. Regards, Tom
Re: DIH on Solr
Try this: http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/Cloudera-Search-User-Guide.html Wolfgang. On Jun 24, 2014, at 11:14 PM, atp annamalai...@hcl.com wrote: Thanks Ahmet, Walfgang , i have installed hbase-indexer on one the server but here also im unable to start the hbase indexer server. Error: Could not find or load main class com.ngdata.hbaseindexer.Main properly set the JAVA_HOME and INDEXER_HOME environmental variables. please guide Thanks. ATP -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-on-Solr-tp4143669p4143955.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH on Solr
Check out the HBase Indexer http://ngdata.github.io/hbase-indexer/ Wolfgang. On Jun 24, 2014, at 3:55 AM, Ahmet Arslan iori...@yahoo.com.INVALID wrote: Hi, There is no DataSource or EntityProcessor for HBase, I think. May be http://www.lilyproject.org/lily/index.html works for you? AHmet On Tuesday, June 24, 2014 1:27 PM, atp annamalai...@hcl.com wrote: Hi experts, We have a requirement to import the data from hbase tables using solr, we have tried with help of Dataimporthandler, we couldn't find the configuration streps or document for dataimporthandler for HBASE, can anybody please share the steps to configure, we tried with basic configuration but while select full import its throwing error , please share the docs or links to configure DIH for hbase table. 6/24/2014 3:44:00 PM WARN ZKPropertiesWriter Could not read DIH properties from /configs/collection1/dataimport.properties :class org.apache.zookeeper.KeeperException$NoNodeException 6/24/2014 3:44:00 PM ERROR DataImporter Full Import failed:java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to load EntityProcessor implementation for entity:msg Processing Document # 1 Thanks in Advance -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-on-Solr-tp4143669.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MergeReduceIndexerTool takes a lot of time for a limited number of documents
Consider giving the MR tasks more RAM, for example via hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx2000m’ ... Wolfgang. On May 26, 2014, at 10:48 AM, Costi Muraru costimur...@gmail.com wrote: Hey Erick, The job reducers began to die with Error: Java heap space, after 1h and 22 minutes being stucked at ~80%. I did a few more tests: Test 1. 80,000 documents Each document had *20* fields. The field names were* the same *for all the documents. Values were different. Job status: successful Execution time: 33 seconds. Test 2. 80,000 documents Each document had *20* fields. The field names were *different* for all the documents. Values were also different. Job status: successful Execution time: 643 seconds. Test 3. 80,000 documents Each document had *50* fields. The field names were *the same* for all the documents. Values were different. Job status: successful Execution time: 45.96 seconds. Test 4. 80,000 documents Each document had *50* fields. The field names were *different* for all the documents. Values were also different. Job status: failed Execution time: after 1h reducers failed. Unfortunately, this is my use case. My guess is that the reduce time (to perform the merges) depends if the field names are the same across the documents. If they are different the merge time increases very much. I don't have any knowledge behind the solr merge operation, but is it possible that it tries to group the fields with the same name across all the documents? In the first case, when the field names are the same across documents, the number of buckets is equal to the number of unique field names which is 20. In the second case, where all the field names are different (my use case), it creates a lot more buckets (80k documents * 50 different field names = 4 million buckets) and the process gets slowed down significantly. Is this assumption correct / Is there any way to get around it? Thanks again for reaching out. Hope this is more clear now. This is how one of the 80k documents looks like (json format): { id : 442247098240414508034066540706561683636, items : { IT49597_1180_i : 76, IT25363_1218_i : 4, IT12418_1291_i : 95, IT55979_1051_i : 31, IT9841_1224_i : 36, IT40463_1010_i : 87, IT37932_1346_i : 11, IT17653_1054_i : 37, IT59414_1025_i : 96, IT51080_1133_i : 5, IT7369_1395_i : 90, IT59974_1245_i : 25, IT25374_1345_i : 75, IT16825_1458_i : 28, IT56643_1050_i : 76, IT46274_1398_i : 50, IT47411_1275_i : 11, IT2791_1000_i : 97, IT7708_1053_i : 96, IT46622_1112_i : 90, IT47161_1382_i : 64 } } Costi On Mon, May 26, 2014 at 7:45 PM, Erick Erickson erickerick...@gmail.comwrote: The MapReduceIndexerTool is really intended for very large data sets, and by today's standards 80K doesn't qualify :). Basically, MRIT creates N sub-indexes, then merges them, which it may to in a tiered fashion. That is, it may merge gen1 to gen2, then merge gen2 to gen3 etc. Which is great when indexing a bazillion documents into 20 shards, but all that copying around may take more time than you really gain for 80K docs. Also be aware that MRIT does NOT update docs with the same ID, this is due to the inherent limitation of the Lucene mergeIndex process. How long is a long time? attachments tend to get filtered out, so if you want us to see the graph you might paste it somewhere and provide a link. Best, Erick On Mon, May 26, 2014 at 8:51 AM, Costi Muraru costimur...@gmail.com wrote: Hey guys, I'm using the MergeReduceIndexerTool to import data into a SolrCloud cluster made out of 3 decent machines. Looking in the JobTracker, I can see that the mapper jobs finish quite fast. The reduce jobs get to ~80% quite fast as well. It is here where they get stucked for a long period of time (picture + log attached). I'm only trying to insert ~80k documents with 10-50 different fields each. Why is this happening? Am I not setting something correctly? Is the fact that most of the documents have different field names, or too many for that matter? Any tips are gladly appreciated. Thanks, Costi From the reduce logs: 60208 [main] INFO org.apache.solr.update.UpdateHandler - start commit{,optimize=false,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=false,prepareCommit=false} 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: commit: start 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: commit: enter lock 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: commit: now prepare 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: prepareCommit: flush 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: index before
Re: Offline Indexes Update to Shard
Hi see comments inline below… On Jun 2, 2014, at 6:49 AM, Vineet Mishra clearmido...@gmail.com wrote: Hi Wolfgang, Thanks for your response, can you quote some running example of MapReduceIndexerTool for indexing through csv files. If you are referring to http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_mapreduceindexertool.html?scroll=csug_topic_6_1 I had a few points to clarify, *what is the morphline? See http://kitesdk.org/docs/current/kite-morphlines/index.html and http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html#/readCSV *Is it necessary to use morphline for indexing, if yes how to create one? Yes, it requires a morphline (which is basically a chain of plugins), and you can plug in any custom java code and custom commands into a morphline. *can Index only reside on HDFS and not on LocalFS? The implementation is only on HDFS. *what is the minimum cdh version supported for it? CDH 4 or CDH 5. Wolfgang. Looking forward to your response. Thanks! On Mon, Jun 2, 2014 at 2:24 PM, Wolfgang Hoschek whosc...@cloudera.com wrote: Sounds like you should consider using MapReduceIndexerTool. AFAIK, this is the most scalable indexing (and merging) solution out there. Wolfgang. On Jun 2, 2014, at 10:33 AM, Vineet Mishra clearmido...@gmail.com wrote: Hi Erick, Thanks for your mail, please let me go through with my use case. I am having around 20-40 Billion Records to index with each record is having around 200-400 fields, the data is sensor data so it can be easily stored in Integer or Float. Now to index this huge amount of data I am going with the indexing through EmbeddedSolrServer which was working fine but I was looking out for a way to move these generated indexes to different shards possibly without copying pasting it to each machines but some other approach as to submit this indexes to some shard and let the shard take care of it distributing it over leader and replica. I want to mention one more thing, as I started indexing with EmbeddedSolrServer it went fine for some million of starting documents but there after the indexing speed is pathetically slow, it indexed around 20GB in a day and just have indexed 9 GB in another 2 days. Any indexing optimization approach also requested. Hope this makes things much clearer. Looking forward to soon hear from you. Thanks and Regards! On Fri, May 30, 2014 at 9:09 PM, Erick Erickson erickerick...@gmail.com wrote: You can copy to the shards and use the mergindexes command, the MapReduceIndexerTool follows that approach. But really, what is the higher-level use-case you're trying to support? This feels a little like an XY problem. You could do things like 1 index to a different collection then use collection aliasing to switch 2 just re-index to the current collection. 3 use the MapReduceIndexerTool (admittedly it needs Hadoop). All in all, it feels like you're doing work you don't need to do. But that's a guess since you haven't told us what the use-case is. Best, Erick On Thu, May 29, 2014 at 7:22 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, On Wed, May 28, 2014 at 4:25 AM, Vineet Mishra clearmido...@gmail.com wrote: Hi All, Has anyone tried with building Offline indexes with EmbeddedSolrServer and posting it to Shards. What do you mean by posting it to shards? How is that different than copying them manually to the right location in FS? Could you please elaborate? Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ FYI, I am done building the indexes but looking out for a way to post these index files on shards. Copying the indexes manually to each shard's replica is possible and is working fine but I don't want to go with that approach. Thanks!
Re: Offline Indexes Update to Shard
Sounds like you should consider using MapReduceIndexerTool. AFAIK, this is the most scalable indexing (and merging) solution out there. Wolfgang. On Jun 2, 2014, at 10:33 AM, Vineet Mishra clearmido...@gmail.com wrote: Hi Erick, Thanks for your mail, please let me go through with my use case. I am having around 20-40 Billion Records to index with each record is having around 200-400 fields, the data is sensor data so it can be easily stored in Integer or Float. Now to index this huge amount of data I am going with the indexing through EmbeddedSolrServer which was working fine but I was looking out for a way to move these generated indexes to different shards possibly without copying pasting it to each machines but some other approach as to submit this indexes to some shard and let the shard take care of it distributing it over leader and replica. I want to mention one more thing, as I started indexing with EmbeddedSolrServer it went fine for some million of starting documents but there after the indexing speed is pathetically slow, it indexed around 20GB in a day and just have indexed 9 GB in another 2 days. Any indexing optimization approach also requested. Hope this makes things much clearer. Looking forward to soon hear from you. Thanks and Regards! On Fri, May 30, 2014 at 9:09 PM, Erick Erickson erickerick...@gmail.com wrote: You can copy to the shards and use the mergindexes command, the MapReduceIndexerTool follows that approach. But really, what is the higher-level use-case you're trying to support? This feels a little like an XY problem. You could do things like 1 index to a different collection then use collection aliasing to switch 2 just re-index to the current collection. 3 use the MapReduceIndexerTool (admittedly it needs Hadoop). All in all, it feels like you're doing work you don't need to do. But that's a guess since you haven't told us what the use-case is. Best, Erick On Thu, May 29, 2014 at 7:22 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, On Wed, May 28, 2014 at 4:25 AM, Vineet Mishra clearmido...@gmail.com wrote: Hi All, Has anyone tried with building Offline indexes with EmbeddedSolrServer and posting it to Shards. What do you mean by posting it to shards? How is that different than copying them manually to the right location in FS? Could you please elaborate? Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ FYI, I am done building the indexes but looking out for a way to post these index files on shards. Copying the indexes manually to each shard's replica is possible and is working fine but I don't want to go with that approach. Thanks!
Re: Update existing documents using MapReduceIndexerTool?
Yes, this is a known issue. Repeatedly running the MapReduceIndexerTool on the same set of input files can result in duplicate entries in the Solr collection. This occurs because currently the tool can only insert documents and cannot update or delete existing Solr documents. Wolfgang. On May 6, 2014, at 3:08 PM, Costi Muraru costimur...@gmail.com wrote: Hi guys, I've used the MapReduceIndexerTool [1] in order to import data into SOLR and seem to stumbled upon something. I've followed the tutorial [2] and managed to import data into a SolrCloud cluster using the map reduce job. I ran the job a second time in order to update some of the existing documents. The job itself was successful, but the documents maintained the same field values as before. In order to update some fields for the existing IDs, I've decompiled the AVRO sample file (examples/test-documents/sample-statuses-20120906-141433-medium.avro), updated some of the fields with new values, while maintaining the same IDs and packaged the AVRO back. After this I ran the MapReduceIndexerTool and, although successful, the records were not updated. I've tried this several times. Even with a few documents the result is the same - the documents are not being updated with the new values. Instead, the old field values are kept. If I manually delete the old document from SOLR and after this I run the job, the document is inserted with the new values. Do you guys have any experience with this tool? Is this something by design / Am I missing something? Can this behavior be overwritten to force an update? Any feedback is gladly appreciated. Thanks, Constantin [1] http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_mapreduceindexertool.html#csug_topic_6_1 [2] http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_batch_index_to_solr_servers_using_golive.html
Re: What's the actual story with new morphline and hadoop contribs?
The solr morphline jars are integrated with solr by way of the solr specific solr/contrib/map-reduce module. Ingestion from Flume into Solr is available here: http://flume.apache.org/FlumeUserGuide.html#morphlinesolrsink FWIW, for our purposes we see no role for DataImportHandler anymore. Wolfgang. On Apr 15, 2014, at 6:01 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: The use case I keep thinking about is Flue/Morphline replacing DataImportHandler. So, when I saw morphline shipped with Solr, I tried to understand whether it is a step towards it. As it is, I am still not sure I understand why those jars are shipped with Solr, if it is not actually integrating into Solr. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Mon, Apr 14, 2014 at 8:36 PM, Wolfgang Hoschek whosc...@cloudera.com wrote: Currently all Solr morphline use cases I’m aware of run in processes outside of the Solr JVM, e.g. in Flume, in MapReduce, in HBase Lily Indexer, etc. These ingestion processes generate Solr documents for Solr updates. Running in external processes is done to improve scalability, reliability, flexibility and reusability. Not everything needs to run inside of the Solr JVM. We haven’t found a use case for it so far, but it would be easy to add an UpdateRequestProcessor that runs a morphline inside of the Solr JVM. Here is more background info: http://kitesdk.org/docs/current/kite-morphlines/index.html http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html http://files.meetup.com/5139282/SHUG10%20-%20Search%20On%20Hadoop.pdf Wolfgang. On Apr 14, 2014, at 2:26 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, I saw that 4.7.1 has morphline and hadoop contribution libraries, but I can't figure out the degree to which they are useful to _Solr_ users. I found one hadoop example in the readme that does some sort injection into Solr. Is that the only use case supported? I thought that maybe there is a UpdateRequestProcessor or Handler end-point or something that hooks into morphline to do similar/alternative work to DataImportHandler. But I can't see any entry points or examples for that. Anybody knows what the story is and/or what the future holds? Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
Re: What's the actual story with new morphline and hadoop contribs?
Currently all Solr morphline use cases I’m aware of run in processes outside of the Solr JVM, e.g. in Flume, in MapReduce, in HBase Lily Indexer, etc. These ingestion processes generate Solr documents for Solr updates. Running in external processes is done to improve scalability, reliability, flexibility and reusability. Not everything needs to run inside of the Solr JVM. We haven’t found a use case for it so far, but it would be easy to add an UpdateRequestProcessor that runs a morphline inside of the Solr JVM. Here is more background info: http://kitesdk.org/docs/current/kite-morphlines/index.html http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html http://files.meetup.com/5139282/SHUG10%20-%20Search%20On%20Hadoop.pdf Wolfgang. On Apr 14, 2014, at 2:26 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, I saw that 4.7.1 has morphline and hadoop contribution libraries, but I can't figure out the degree to which they are useful to _Solr_ users. I found one hadoop example in the readme that does some sort injection into Solr. Is that the only use case supported? I thought that maybe there is a UpdateRequestProcessor or Handler end-point or something that hooks into morphline to do similar/alternative work to DataImportHandler. But I can't see any entry points or examples for that. Anybody knows what the story is and/or what the future holds? Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
Re: MapReduceIndexerTool does not respect Lucene version in solrconfig Was: converting 4.7 index to 4.3.1
There’s no such other location in there. BTW, you can disable the mtree merge via --reducers=-2 (or --reducers=0 in old versions) . Wolfgang. On Apr 10, 2014, at 3:44 PM, Dmitry Kan solrexp...@gmail.com wrote: a correction: actually when I tested the above change I had so little data, that it didn't trigger sub-shard slicing and thus merging of the slices. Still, looks as if somewhere in the map-reduce contrib code there is a link to what lucene version to use. Wolfgang, do you happen to know where that other Version.* is specified? On Thu, Apr 10, 2014 at 12:59 PM, Dmitry Kan solrexp...@gmail.com wrote: Thanks for responding, Wolfgang. Changing to LUCENE_43: IndexWriterConfig writerConfig = new IndexWriterConfig(Version.LUCENE_43, null); didn't affect on the index format version, because, I believe, if the format of the index to merge has been of higher version (4.1 in this case), it will merge to the same and not lower version (4.0). But format version certainly could be read from the solrconfig, you are right. Dmitry On Wed, Apr 9, 2014 at 11:51 PM, Wolfgang Hoschek whosc...@cloudera.comwrote: There is a current limitation in that the code doesn't actually look into solrconfig.xml for the version. We should fix this, indeed. See https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/map-reduce/src/java/org/apache/solr/hadoop/TreeMergeOutputFormat.java#L100-101 Wolfgang. On Apr 8, 2014, at 11:49 AM, Dmitry Kan solrexp...@gmail.com wrote: Hello, When we instantiate the MapReduceIndexerTool with the collections' conf directory, we expect, that the Lucene version is respected and the index gets generated in a format compatible with the defined version. This does not seem to happen, however. Checking with luke: the expected Lucene index format: Lucene 4.0 the output Lucene index format: Lucene 4.1 Can anybody shed some light onto the semantics behind specifying the Lucene version in this context? Does this have something to do with what version of solr core is used by the morphline library? Thanks, Dmitry -- Forwarded message -- Dear list, We have been generating solr indices with the solr-hadoop contrib module (SOLR-1301). Our current solr in use is of 4.3.1 version. Is there any tool that could do the backward conversion, i.e. 4.7-4.3.1? Or is the upgrade the only way to go? -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan
Re: MapReduceIndexerTool does not respect Lucene version in solrconfig Was: converting 4.7 index to 4.3.1
There is a current limitation in that the code doesn’t actually look into solrconfig.xml for the version. We should fix this, indeed. See https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/map-reduce/src/java/org/apache/solr/hadoop/TreeMergeOutputFormat.java#L100-101 Wolfgang. On Apr 8, 2014, at 11:49 AM, Dmitry Kan solrexp...@gmail.com wrote: Hello, When we instantiate the MapReduceIndexerTool with the collections' conf directory, we expect, that the Lucene version is respected and the index gets generated in a format compatible with the defined version. This does not seem to happen, however. Checking with luke: the expected Lucene index format: Lucene 4.0 the output Lucene index format: Lucene 4.1 Can anybody shed some light onto the semantics behind specifying the Lucene version in this context? Does this have something to do with what version of solr core is used by the morphline library? Thanks, Dmitry -- Forwarded message -- Dear list, We have been generating solr indices with the solr-hadoop contrib module (SOLR-1301). Our current solr in use is of 4.3.1 version. Is there any tool that could do the backward conversion, i.e. 4.7-4.3.1? Or is the upgrade the only way to go? -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan