Re: Solr on HDFS
> > If you think about it, having a shard with 3 replicas on top of a file system that does 3x replication seems a little excessive! https://issues.apache.org/jira/browse/SOLR-6305 should help here. I can take a look at merging the patch since looks like it has been helpful to others. Kevin Risden On Fri, Aug 2, 2019 at 10:09 AM Joe Obernberger < joseph.obernber...@gmail.com> wrote: > Hi Kyle - Thank you. > > Our current index is split across 3 solr collections; our largest > collection is 26.8TBytes (80.5TBytes when 3x replicated in HDFS) across > 100 shards. There are 40 machines hosting this cluster. We've found > that when dealing with large collections having no replicas (but lots of > shards) ends up being more reliable since there is a much smaller > recovery time. We keep another 30 day index (1.4TBytes) that does have > replicas (40 shards, 3 replicas each), and if a node goes down, we > manually delete lock files and then bring it back up and yes - lots of > network IO, but it usually recovers OK. > > Having a large collection like this with no replicas seems like a recipe > for disaster. So, we've been experimenting with the latest version > (8.2) and our index process to split up the data into many solr > collections that do have replicas, and then build the list of > collections to search at query time. Our searches are date based, so we > can define what collections we want to query at query time. As a test, > we ran just two machines, HDFS, and 500 collections. One server ran out > of memory and crashed. We had over 1,600 lock files to delete. > > If you think about it, having a shard with 3 replicas on top of a file > system that does 3x replication seems a little excessive! I'd love to > see Solr take more advantage of a shared FS. Perhaps an idea is to use > HDFS but with an NFS gateway. Seems like that may be slow. > Architecturally, I love only having one large file system to manage > instead of lots of individual file systems across many machines. HDFS > makes this easy. > > -Joe > > On 8/2/2019 9:10 AM, lstusr 5u93n4 wrote: > > Hi Joe, > > > > We fought with Solr on HDFS for quite some time, and faced similar issues > > as you're seeing. (See this thread, for example:" > > > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201812.mbox/%3cCABd9LjTeacXpy3FFjFBkzMq6vhgu7Ptyh96+w-KC2p=-rqk...@mail.gmail.com%3e > > ) > > > > The Solr lock files on HDFS get deleted if the Solr server gets shut down > > gracefully, but we couldn't always guarantee that in our environment so > we > > ended up writing a custom startup script to search for lock files on HDFS > > and delete them before solr startup. > > > > However, the issue that you mention of the Solr server rebuilding its > whole > > index from replicas on startup was enough of a show-stopper for us that > we > > switched away from HDFS to local disk. It literally made the difference > > between 24+ hours of recovery time after an unexpected outage to less > than > > a minute... > > > > If you do end up finding a solution to this issue, please post it to this > > mailing list, because there are others out there (like us!) who would > most > > definitely make use it. > > > > Thanks > > > > Kyle > > > > On Fri, 2 Aug 2019 at 08:58, Joe Obernberger < > joseph.obernber...@gmail.com> > > wrote: > > > >> Thank you. No, while the cluster is using Cloudera for HDFS, we do not > >> use Cloudera to manager the solr cluster. If it is a > >> configuration/architecture issue, what can I do to fix it? I'd like a > >> system where servers can come and go, but the indexes stay available and > >> recover automatically. Is that possible with HDFS? > >> While adding an alias to other collections would be an option, if that > >> collection is the only collection, or one that is currently needed, in a > >> live system, we can't bring it down, re-create it, and re-index when > >> that process may take weeks to do. > >> > >> Any ideas? > >> > >> -Joe > >> > >> On 8/1/2019 6:15 PM, Angie Rabelero wrote: > >>> I don’t think you’re using claudera or ambari, but ambari has an option > >> to delete the locks. This seems more a configuration/architecture isssue > >> than a realibility issue. You may want to spin up an alias while you > bring > >> down, clear locks and directories, recreate and index the affected > >> collection, while you work your other isues. > >>> On Aug 1, 2019, at 16:40, Joe
Re: Solr on HDFS
Hi Kyle - Thank you. Our current index is split across 3 solr collections; our largest collection is 26.8TBytes (80.5TBytes when 3x replicated in HDFS) across 100 shards. There are 40 machines hosting this cluster. We've found that when dealing with large collections having no replicas (but lots of shards) ends up being more reliable since there is a much smaller recovery time. We keep another 30 day index (1.4TBytes) that does have replicas (40 shards, 3 replicas each), and if a node goes down, we manually delete lock files and then bring it back up and yes - lots of network IO, but it usually recovers OK. Having a large collection like this with no replicas seems like a recipe for disaster. So, we've been experimenting with the latest version (8.2) and our index process to split up the data into many solr collections that do have replicas, and then build the list of collections to search at query time. Our searches are date based, so we can define what collections we want to query at query time. As a test, we ran just two machines, HDFS, and 500 collections. One server ran out of memory and crashed. We had over 1,600 lock files to delete. If you think about it, having a shard with 3 replicas on top of a file system that does 3x replication seems a little excessive! I'd love to see Solr take more advantage of a shared FS. Perhaps an idea is to use HDFS but with an NFS gateway. Seems like that may be slow. Architecturally, I love only having one large file system to manage instead of lots of individual file systems across many machines. HDFS makes this easy. -Joe On 8/2/2019 9:10 AM, lstusr 5u93n4 wrote: Hi Joe, We fought with Solr on HDFS for quite some time, and faced similar issues as you're seeing. (See this thread, for example:" http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201812.mbox/%3cCABd9LjTeacXpy3FFjFBkzMq6vhgu7Ptyh96+w-KC2p=-rqk...@mail.gmail.com%3e ) The Solr lock files on HDFS get deleted if the Solr server gets shut down gracefully, but we couldn't always guarantee that in our environment so we ended up writing a custom startup script to search for lock files on HDFS and delete them before solr startup. However, the issue that you mention of the Solr server rebuilding its whole index from replicas on startup was enough of a show-stopper for us that we switched away from HDFS to local disk. It literally made the difference between 24+ hours of recovery time after an unexpected outage to less than a minute... If you do end up finding a solution to this issue, please post it to this mailing list, because there are others out there (like us!) who would most definitely make use it. Thanks Kyle On Fri, 2 Aug 2019 at 08:58, Joe Obernberger wrote: Thank you. No, while the cluster is using Cloudera for HDFS, we do not use Cloudera to manager the solr cluster. If it is a configuration/architecture issue, what can I do to fix it? I'd like a system where servers can come and go, but the indexes stay available and recover automatically. Is that possible with HDFS? While adding an alias to other collections would be an option, if that collection is the only collection, or one that is currently needed, in a live system, we can't bring it down, re-create it, and re-index when that process may take weeks to do. Any ideas? -Joe On 8/1/2019 6:15 PM, Angie Rabelero wrote: I don’t think you’re using claudera or ambari, but ambari has an option to delete the locks. This seems more a configuration/architecture isssue than a realibility issue. You may want to spin up an alias while you bring down, clear locks and directories, recreate and index the affected collection, while you work your other isues. On Aug 1, 2019, at 16:40, Joe Obernberger wrote: Been using Solr on HDFS for a while now, and I'm seeing an issue with redundancy/reliability. If a server goes down, when it comes back up, it will never recover because of the lock files in HDFS. That solr node needs to be brought down manually, the lock files deleted, and then brought back up. At that point, it appears to copy all the data for its replicas. If the index is large, and new data is being indexed, in some cases it will never recover. The replication retries over and over. How can we make a reliable Solr Cloud cluster when using HDFS that can handle servers coming and going? Thank you! -Joe --- This email has been checked for viruses by AVG. https://www.avg.com
Re: Solr on HDFS
Hi Joe, We fought with Solr on HDFS for quite some time, and faced similar issues as you're seeing. (See this thread, for example:" http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201812.mbox/%3cCABd9LjTeacXpy3FFjFBkzMq6vhgu7Ptyh96+w-KC2p=-rqk...@mail.gmail.com%3e ) The Solr lock files on HDFS get deleted if the Solr server gets shut down gracefully, but we couldn't always guarantee that in our environment so we ended up writing a custom startup script to search for lock files on HDFS and delete them before solr startup. However, the issue that you mention of the Solr server rebuilding its whole index from replicas on startup was enough of a show-stopper for us that we switched away from HDFS to local disk. It literally made the difference between 24+ hours of recovery time after an unexpected outage to less than a minute... If you do end up finding a solution to this issue, please post it to this mailing list, because there are others out there (like us!) who would most definitely make use it. Thanks Kyle On Fri, 2 Aug 2019 at 08:58, Joe Obernberger wrote: > Thank you. No, while the cluster is using Cloudera for HDFS, we do not > use Cloudera to manager the solr cluster. If it is a > configuration/architecture issue, what can I do to fix it? I'd like a > system where servers can come and go, but the indexes stay available and > recover automatically. Is that possible with HDFS? > While adding an alias to other collections would be an option, if that > collection is the only collection, or one that is currently needed, in a > live system, we can't bring it down, re-create it, and re-index when > that process may take weeks to do. > > Any ideas? > > -Joe > > On 8/1/2019 6:15 PM, Angie Rabelero wrote: > > I don’t think you’re using claudera or ambari, but ambari has an option > to delete the locks. This seems more a configuration/architecture isssue > than a realibility issue. You may want to spin up an alias while you bring > down, clear locks and directories, recreate and index the affected > collection, while you work your other isues. > > > > On Aug 1, 2019, at 16:40, Joe Obernberger > wrote: > > > > Been using Solr on HDFS for a while now, and I'm seeing an issue with > redundancy/reliability. If a server goes down, when it comes back up, it > will never recover because of the lock files in HDFS. That solr node needs > to be brought down manually, the lock files deleted, and then brought back > up. At that point, it appears to copy all the data for its replicas. If > the index is large, and new data is being indexed, in some cases it will > never recover. The replication retries over and over. > > > > How can we make a reliable Solr Cloud cluster when using HDFS that can > handle servers coming and going? > > > > Thank you! > > > > -Joe > > > > > > > > --- > > This email has been checked for viruses by AVG. > > https://www.avg.com > > >
Re: Solr on HDFS
Thank you. No, while the cluster is using Cloudera for HDFS, we do not use Cloudera to manager the solr cluster. If it is a configuration/architecture issue, what can I do to fix it? I'd like a system where servers can come and go, but the indexes stay available and recover automatically. Is that possible with HDFS? While adding an alias to other collections would be an option, if that collection is the only collection, or one that is currently needed, in a live system, we can't bring it down, re-create it, and re-index when that process may take weeks to do. Any ideas? -Joe On 8/1/2019 6:15 PM, Angie Rabelero wrote: I don’t think you’re using claudera or ambari, but ambari has an option to delete the locks. This seems more a configuration/architecture isssue than a realibility issue. You may want to spin up an alias while you bring down, clear locks and directories, recreate and index the affected collection, while you work your other isues. On Aug 1, 2019, at 16:40, Joe Obernberger wrote: Been using Solr on HDFS for a while now, and I'm seeing an issue with redundancy/reliability. If a server goes down, when it comes back up, it will never recover because of the lock files in HDFS. That solr node needs to be brought down manually, the lock files deleted, and then brought back up. At that point, it appears to copy all the data for its replicas. If the index is large, and new data is being indexed, in some cases it will never recover. The replication retries over and over. How can we make a reliable Solr Cloud cluster when using HDFS that can handle servers coming and going? Thank you! -Joe --- This email has been checked for viruses by AVG. https://www.avg.com
Re: Solr on HDFS
I don’t think you’re using claudera or ambari, but ambari has an option to delete the locks. This seems more a configuration/architecture isssue than a realibility issue. You may want to spin up an alias while you bring down, clear locks and directories, recreate and index the affected collection, while you work your other isues. On Aug 1, 2019, at 16:40, Joe Obernberger wrote: Been using Solr on HDFS for a while now, and I'm seeing an issue with redundancy/reliability. If a server goes down, when it comes back up, it will never recover because of the lock files in HDFS. That solr node needs to be brought down manually, the lock files deleted, and then brought back up. At that point, it appears to copy all the data for its replicas. If the index is large, and new data is being indexed, in some cases it will never recover. The replication retries over and over. How can we make a reliable Solr Cloud cluster when using HDFS that can handle servers coming and going? Thank you! -Joe
Solr on HDFS
Been using Solr on HDFS for a while now, and I'm seeing an issue with redundancy/reliability. If a server goes down, when it comes back up, it will never recover because of the lock files in HDFS. That solr node needs to be brought down manually, the lock files deleted, and then brought back up. At that point, it appears to copy all the data for its replicas. If the index is large, and new data is being indexed, in some cases it will never recover. The replication retries over and over. How can we make a reliable Solr Cloud cluster when using HDFS that can handle servers coming and going? Thank you! -Joe
Re: An exception when running Solr on HDFS,why a solr server can not recognize the write.lock file is created by itself before?
@Shawn Heisey Yeah, delete "write.lock" files manually is ok finally。 @Walter Underwood Have some performace evaluation about Solr on HDFS vs LocalFS recently? Shawn Heisey 于2018年8月28日周二 上午4:10写道: > On 8/26/2018 7:47 PM, zhenyuan wei wrote: > > I found an exception when running Solr on HDFS。The detail is: > > Running solr on HDFS,and update doc was running always, > > then,kill -9 solr JVM or reboot linux os/shutdown linux os,then restart > all. > > If you use "kill -9" to stop a Solr instance, the lockfile will get left > behind and you may have difficulty starting Solr back up on ANY kind of > filesystem until you delete the file in each core's data directory. The > filename defaults to "write.lock" if you don't change it. > > Thanks, > Shawn > >
Re: An exception when running Solr on HDFS,why a solr server can not recognize the write.lock file is created by itself before?
On 8/26/2018 7:47 PM, zhenyuan wei wrote: I found an exception when running Solr on HDFS。The detail is: Running solr on HDFS,and update doc was running always, then,kill -9 solr JVM or reboot linux os/shutdown linux os,then restart all. If you use "kill -9" to stop a Solr instance, the lockfile will get left behind and you may have difficulty starting Solr back up on ANY kind of filesystem until you delete the file in each core's data directory. The filename defaults to "write.lock" if you don't change it. Thanks, Shawn
Re: An exception when running Solr on HDFS,why a solr server can not recognize the write.lock file is created by itself before?
I accidentally put my Solr indexes on NFS once about ten years ago. It was 100X slower. I would not recommend that. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 27, 2018, at 1:39 AM, zhenyuan wei wrote: > > Thanks for your answer! @Erick Erickson > So, It's not recommended to run Solr on NFS ( like HDFS) now? Maybe > because of crash error or performance problem. > I have a look at SOLR-8335&SOLR-8169, there is no good solution for this > now, And maybe manual removal is the best option? > > > Erick Erickson 于2018年8月27日周一 上午11:41写道: > >> Because HDFS doesn't follow the file semantics that Solr expects. >> >> There's quite a bit of background here: >> https://issues.apache.org/jira/browse/SOLR-8335 >> >> Best, >> Erick >> On Sun, Aug 26, 2018 at 6:47 PM zhenyuan wei wrote: >>> >>> Hi all, >>>I found an exception when running Solr on HDFS。The detail is: >>> Running solr on HDFS,and update doc was running always, >>> then,kill -9 solr JVM or reboot linux os/shutdown linux os,then restart >> all. >>> The exception appears like: >>> >>> 2018-08-26 22:23:12.529 ERROR >>> >> (coreContainerWorkExecutor-2-thread-1-processing-n:cluster-node001:8983_solr) >>> [ ] o.a.s.c.CoreContainer Error waiting for SolrCore to be loaded on >>> startup >>> org.apache.solr.common.SolrException: Unable to create core >>> [collection002_shard56_replica_n110] >>>at >>> >> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1061) >>>at >>> org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:640) >>>at >>> >> com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197) >>>at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>at >>> >> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188) >>>at >>> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) >>>at >>> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) >>>at java.lang.Thread.run(Thread.java:834) >>> Caused by: org.apache.solr.common.SolrException: Index dir >>> 'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core >>> 'collection002_shard56_replica_n110' is already locked. The most likely >>> cause is another Solr server (or another solr core in this server) also >>> configured to use this directory; other possible causes may be specific >> to >>> lockType: hdfs >>>at org.apache.solr.core.SolrCore.(SolrCore.java:1009) >>>at org.apache.solr.core.SolrCore.(SolrCore.java:864) >>>at >>> >> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1040) >>>... 7 more >>> Caused by: org.apache.lucene.store.LockObtainFailedException: Index dir >>> 'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core >>> 'collection002_shard56_replica_n110' is already locked. The most likely >>> cause is another Solr server (or another solr core in this server) also >>> configured to use this directory; other possible causes may be specific >> to >>> lockType: hdfs >>>at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:746) >>>at org.apache.solr.core.SolrCore.(SolrCore.java:955) >>>... 9 more >>> >>> >>> In fact, a print out a hdfs api level exception stack, it reports like: >>> >>> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: >>> /solr/collection002/core_node17/data/index/write.lock for client >>> 192.168.0.12 already exists >>>at >>> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563) >>>at >>> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2450) >>>at >>> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2334) >>>at >>> >> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:623) >>>at >>>
Re: An exception when running Solr on HDFS,why a solr server can not recognize the write.lock file is created by itself before?
Thanks for your answer! @Erick Erickson So, It's not recommended to run Solr on NFS ( like HDFS) now? Maybe because of crash error or performance problem. I have a look at SOLR-8335&SOLR-8169, there is no good solution for this now, And maybe manual removal is the best option? Erick Erickson 于2018年8月27日周一 上午11:41写道: > Because HDFS doesn't follow the file semantics that Solr expects. > > There's quite a bit of background here: > https://issues.apache.org/jira/browse/SOLR-8335 > > Best, > Erick > On Sun, Aug 26, 2018 at 6:47 PM zhenyuan wei wrote: > > > > Hi all, > > I found an exception when running Solr on HDFS。The detail is: > > Running solr on HDFS,and update doc was running always, > > then,kill -9 solr JVM or reboot linux os/shutdown linux os,then restart > all. > > The exception appears like: > > > > 2018-08-26 22:23:12.529 ERROR > > > (coreContainerWorkExecutor-2-thread-1-processing-n:cluster-node001:8983_solr) > > [ ] o.a.s.c.CoreContainer Error waiting for SolrCore to be loaded on > > startup > > org.apache.solr.common.SolrException: Unable to create core > > [collection002_shard56_replica_n110] > > at > > > org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1061) > > at > > org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:640) > > at > > > com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197) > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > at > > > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188) > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) > > at java.lang.Thread.run(Thread.java:834) > > Caused by: org.apache.solr.common.SolrException: Index dir > > 'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core > > 'collection002_shard56_replica_n110' is already locked. The most likely > > cause is another Solr server (or another solr core in this server) also > > configured to use this directory; other possible causes may be specific > to > > lockType: hdfs > > at org.apache.solr.core.SolrCore.(SolrCore.java:1009) > > at org.apache.solr.core.SolrCore.(SolrCore.java:864) > > at > > > org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1040) > > ... 7 more > > Caused by: org.apache.lucene.store.LockObtainFailedException: Index dir > > 'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core > > 'collection002_shard56_replica_n110' is already locked. The most likely > > cause is another Solr server (or another solr core in this server) also > > configured to use this directory; other possible causes may be specific > to > > lockType: hdfs > > at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:746) > > at org.apache.solr.core.SolrCore.(SolrCore.java:955) > > ... 9 more > > > > > > In fact, a print out a hdfs api level exception stack, it reports like: > > > > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > > /solr/collection002/core_node17/data/index/write.lock for client > > 192.168.0.12 already exists > > at > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563) > > at > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2450) > > at > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2334) > > at > > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:623) > > at > > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397) > > at > > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > at > > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > >
Re: An exception when running Solr on HDFS,why a solr server can not recognize the write.lock file is created by itself before?
Because HDFS doesn't follow the file semantics that Solr expects. There's quite a bit of background here: https://issues.apache.org/jira/browse/SOLR-8335 Best, Erick On Sun, Aug 26, 2018 at 6:47 PM zhenyuan wei wrote: > > Hi all, > I found an exception when running Solr on HDFS。The detail is: > Running solr on HDFS,and update doc was running always, > then,kill -9 solr JVM or reboot linux os/shutdown linux os,then restart all. > The exception appears like: > > 2018-08-26 22:23:12.529 ERROR > (coreContainerWorkExecutor-2-thread-1-processing-n:cluster-node001:8983_solr) > [ ] o.a.s.c.CoreContainer Error waiting for SolrCore to be loaded on > startup > org.apache.solr.common.SolrException: Unable to create core > [collection002_shard56_replica_n110] > at > org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1061) > at > org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:640) > at > com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) > at java.lang.Thread.run(Thread.java:834) > Caused by: org.apache.solr.common.SolrException: Index dir > 'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core > 'collection002_shard56_replica_n110' is already locked. The most likely > cause is another Solr server (or another solr core in this server) also > configured to use this directory; other possible causes may be specific to > lockType: hdfs > at org.apache.solr.core.SolrCore.(SolrCore.java:1009) > at org.apache.solr.core.SolrCore.(SolrCore.java:864) > at > org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1040) > ... 7 more > Caused by: org.apache.lucene.store.LockObtainFailedException: Index dir > 'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core > 'collection002_shard56_replica_n110' is already locked. The most likely > cause is another Solr server (or another solr core in this server) also > configured to use this directory; other possible causes may be specific to > lockType: hdfs > at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:746) > at org.apache.solr.core.SolrCore.(SolrCore.java:955) > ... 9 more > > > In fact, a print out a hdfs api level exception stack, it reports like: > > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /solr/collection002/core_node17/data/index/write.lock for client > 192.168.0.12 already exists > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2450) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2334) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:623) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1727) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) > > at sun.reflect.GeneratedConstructorAccessor140.newInstance(Unknown > Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce
An exception when running Solr on HDFS,why a solr server can not recognize the write.lock file is created by itself before?
Hi all, I found an exception when running Solr on HDFS。The detail is: Running solr on HDFS,and update doc was running always, then,kill -9 solr JVM or reboot linux os/shutdown linux os,then restart all. The exception appears like: 2018-08-26 22:23:12.529 ERROR (coreContainerWorkExecutor-2-thread-1-processing-n:cluster-node001:8983_solr) [ ] o.a.s.c.CoreContainer Error waiting for SolrCore to be loaded on startup org.apache.solr.common.SolrException: Unable to create core [collection002_shard56_replica_n110] at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1061) at org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:640) at com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) at java.lang.Thread.run(Thread.java:834) Caused by: org.apache.solr.common.SolrException: Index dir 'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core 'collection002_shard56_replica_n110' is already locked. The most likely cause is another Solr server (or another solr core in this server) also configured to use this directory; other possible causes may be specific to lockType: hdfs at org.apache.solr.core.SolrCore.(SolrCore.java:1009) at org.apache.solr.core.SolrCore.(SolrCore.java:864) at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1040) ... 7 more Caused by: org.apache.lucene.store.LockObtainFailedException: Index dir 'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core 'collection002_shard56_replica_n110' is already locked. The most likely cause is another Solr server (or another solr core in this server) also configured to use this directory; other possible causes may be specific to lockType: hdfs at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:746) at org.apache.solr.core.SolrCore.(SolrCore.java:955) ... 9 more In fact, a print out a hdfs api level exception stack, it reports like: Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /solr/collection002/core_node17/data/index/write.lock for client 192.168.0.12 already exists at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2450) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2334) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:623) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1727) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) at sun.reflect.GeneratedConstructorAccessor140.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1839) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.
Re: Running Solr on HDFS - Disk space
The only option should be to configure Solr to just have a replication factor of 1 or HDFS to have no replication. I would go for the middle and configure both to use a factor of 2. This way a single failure in HDFS and Solr is not a problem. While in 1/3 or 3/1 option a single server error would bring the collection down. Setting the HDFS replication factor is a bit tricky as Solr takes in some places the default replication factor set on HDFS and some times takes a default from the client side. HDFS allows you to set a replication factor for every file individually. regards, Hendrik On 07.06.2018 15:30, Shawn Heisey wrote: On 6/7/2018 6:41 AM, Greenhorn Techie wrote: As HDFS has got its own replication mechanism, with a HDFS replication factor of 3, and then SolrCloud replication factor of 3, does that mean each document will probably have around 9 copies replicated underneath of HDFS? If so, is there a way to configure HDFS or Solr such that only three copies are maintained overall? Yes, that is exactly what happens. SolrCloud replication assumes that each of its replicas is a completely independent index. I am not aware of anything in Solr's HDFS support that can use one HDFS index directory for multiple replicas. At the most basic level, a Solr index is a Lucene index. Lucene goes to great lengths to make sure that an index *CANNOT* be used in more than one place. Perhaps somebody who is more familiar with HDFSDirectoryFactory can offer you a solution. But as far as I know, there isn't one. Thanks, Shawn
Re: Running Solr on HDFS - Disk space
On 6/7/2018 6:41 AM, Greenhorn Techie wrote: As HDFS has got its own replication mechanism, with a HDFS replication factor of 3, and then SolrCloud replication factor of 3, does that mean each document will probably have around 9 copies replicated underneath of HDFS? If so, is there a way to configure HDFS or Solr such that only three copies are maintained overall? Yes, that is exactly what happens. SolrCloud replication assumes that each of its replicas is a completely independent index. I am not aware of anything in Solr's HDFS support that can use one HDFS index directory for multiple replicas. At the most basic level, a Solr index is a Lucene index. Lucene goes to great lengths to make sure that an index *CANNOT* be used in more than one place. Perhaps somebody who is more familiar with HDFSDirectoryFactory can offer you a solution. But as far as I know, there isn't one. Thanks, Shawn
Running Solr on HDFS - Disk space
Hi, As HDFS has got its own replication mechanism, with a HDFS replication factor of 3, and then SolrCloud replication factor of 3, does that mean each document will probably have around 9 copies replicated underneath of HDFS? If so, is there a way to configure HDFS or Solr such that only three copies are maintained overall? Thanks
Re: Solr on HDFS vs local storage - Benchmarking
bq: We also had an HDFS setup already so it looked like a good option to not loos data. Earlier we had a few cases where we lost the machines so HDFS looked safer for that. right, that's one of the places where using HDFS to back Solr makes a lot of sense. The other approach is to just have replicas for each shard distributed across different physical machines. But whatever works is fine. And there are a bunch of parameters you can tune both on HDFS and for local file systems so "it's more an art than a science". bq: Frequent adds with commits, which is likely not good in general anyway, does look quite a bit slower then local storage so far. I think you can go a long way towards fixing this by doing some autowarming. I wouldn't want to open a new searcher every second and do much autowarming over HDFS, but if you can stand less frequent commits (say every minute?) you might be able to smooth out the performance Best, Erick On Wed, Nov 22, 2017 at 11:31 AM, Hendrik Haddorp wrote: > We actually use no auto warming. Our collections are pretty small and the > query performance is not really a problem so far. We are using lots of > collections and most Solr caches seem to be per core and not global so we > also have a problem with caching. I have to test the HDFS cache some more as > that should work cross collections. > > We also had an HDFS setup already so it looked like a good option to not > loos data. Earlier we had a few cases where we lost the machines so HDFS > looked safer for that. > > I would expect that the HDFS performance is also quite good if you have lots > of document adds and not so frequent commits. Frequent adds with commits, > which is likely not good in general anyway, does look quite a bit slower > then local storage so far. As we didn't see that in our earlier tests, which > were more, query focused, I said it large depends on what you are doing. > > Hendrik > > On 22.11.2017 18:41, Erick Erickson wrote: >> >> In my experience, for relatively static indexes the performance is >> roughly similar. Once the data is read from whatever data source it's >> in memory, where the data came from is (largely) secondary in >> importance. >> >> In cases where there's a lot of I/O I expect HDFS to be slower, this >> fits Hendrik's observation: "We now had a patter with lots of small >> updates and commits and that seems to be quite a bit slower". He's >> merging segments and (presumably) autowarming frequently, implying >> lots of I/O and HDFS adds an extra layer. >> >> Personally I'd use whichever is most convenient and see if the >> performance was "good enough". I wouldn't recommend _installing_ HDFS >> just to use it with Solr, why add another complication? If you need >> the redundancy add replicas. If you already have the HDFS >> infrastructure in place and using HDFS is easier than local storage, >> feel free >> >> Best, >> Erick >> >> >> On Wed, Nov 22, 2017 at 8:06 AM, Greenhorn Techie >> wrote: >>> >>> Hendrik, >>> >>> Thanks for your response. >>> >>> Regarding "But this seems to greatly depend on how your setup looks like >>> and what actions you perform." May I know what are the factors influence >>> and what considerations are to be taken in relation to this? >>> >>> Thanks >>> >>> On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp >>> wrote: >>> >>>> We did some testing and the performance was strangely even better with >>>> HDFS then the with the local file system. But this seems to greatly >>>> depend on how your setup looks like and what actions you perform. We now >>>> had a patter with lots of small updates and commits and that seems to be >>>> quite a bit slower. We are about to do performance testing on that now. >>>> >>>> The reason we switched to HDFS was largely connected to us using Docker >>>> and Marathon/Mesos. With HDFS the data is in a shared file system and >>>> thus it is possible to move the replica to a different instance on a a >>>> different host. >>>> >>>> regards, >>>> Hendrik >>>> >>>> On 22.11.2017 14:59, Greenhorn Techie wrote: >>>>> >>>>> Hi, >>>>> >>>>> Good Afternoon!! >>>>> >>>>> While the discussion around issues related to "Solr on HDFS" is live, I >>>>> would like to understand if anyone has done any performance >>>>> benchmarking >>>>> for both Solr indexing and search between HDFS vs local file system. >>>>> >>>>> Also, from experience, what would the community folks suggest? Solr on >>>>> local file system or Solr on HDFS? Has anyone done a comparative study >>>>> of >>>>> these choices? >>>>> >>>>> Thanks >>>>> >>>> >
Re: Solr on HDFS vs local storage - Benchmarking
We actually use no auto warming. Our collections are pretty small and the query performance is not really a problem so far. We are using lots of collections and most Solr caches seem to be per core and not global so we also have a problem with caching. I have to test the HDFS cache some more as that should work cross collections. We also had an HDFS setup already so it looked like a good option to not loos data. Earlier we had a few cases where we lost the machines so HDFS looked safer for that. I would expect that the HDFS performance is also quite good if you have lots of document adds and not so frequent commits. Frequent adds with commits, which is likely not good in general anyway, does look quite a bit slower then local storage so far. As we didn't see that in our earlier tests, which were more, query focused, I said it large depends on what you are doing. Hendrik On 22.11.2017 18:41, Erick Erickson wrote: In my experience, for relatively static indexes the performance is roughly similar. Once the data is read from whatever data source it's in memory, where the data came from is (largely) secondary in importance. In cases where there's a lot of I/O I expect HDFS to be slower, this fits Hendrik's observation: "We now had a patter with lots of small updates and commits and that seems to be quite a bit slower". He's merging segments and (presumably) autowarming frequently, implying lots of I/O and HDFS adds an extra layer. Personally I'd use whichever is most convenient and see if the performance was "good enough". I wouldn't recommend _installing_ HDFS just to use it with Solr, why add another complication? If you need the redundancy add replicas. If you already have the HDFS infrastructure in place and using HDFS is easier than local storage, feel free Best, Erick On Wed, Nov 22, 2017 at 8:06 AM, Greenhorn Techie wrote: Hendrik, Thanks for your response. Regarding "But this seems to greatly depend on how your setup looks like and what actions you perform." May I know what are the factors influence and what considerations are to be taken in relation to this? Thanks On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp wrote: We did some testing and the performance was strangely even better with HDFS then the with the local file system. But this seems to greatly depend on how your setup looks like and what actions you perform. We now had a patter with lots of small updates and commits and that seems to be quite a bit slower. We are about to do performance testing on that now. The reason we switched to HDFS was largely connected to us using Docker and Marathon/Mesos. With HDFS the data is in a shared file system and thus it is possible to move the replica to a different instance on a a different host. regards, Hendrik On 22.11.2017 14:59, Greenhorn Techie wrote: Hi, Good Afternoon!! While the discussion around issues related to "Solr on HDFS" is live, I would like to understand if anyone has done any performance benchmarking for both Solr indexing and search between HDFS vs local file system. Also, from experience, what would the community folks suggest? Solr on local file system or Solr on HDFS? Has anyone done a comparative study of these choices? Thanks
Re: Solr on HDFS vs local storage - Benchmarking
In my experience, for relatively static indexes the performance is roughly similar. Once the data is read from whatever data source it's in memory, where the data came from is (largely) secondary in importance. In cases where there's a lot of I/O I expect HDFS to be slower, this fits Hendrik's observation: "We now had a patter with lots of small updates and commits and that seems to be quite a bit slower". He's merging segments and (presumably) autowarming frequently, implying lots of I/O and HDFS adds an extra layer. Personally I'd use whichever is most convenient and see if the performance was "good enough". I wouldn't recommend _installing_ HDFS just to use it with Solr, why add another complication? If you need the redundancy add replicas. If you already have the HDFS infrastructure in place and using HDFS is easier than local storage, feel free Best, Erick On Wed, Nov 22, 2017 at 8:06 AM, Greenhorn Techie wrote: > Hendrik, > > Thanks for your response. > > Regarding "But this seems to greatly depend on how your setup looks like > and what actions you perform." May I know what are the factors influence > and what considerations are to be taken in relation to this? > > Thanks > > On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp > wrote: > >> We did some testing and the performance was strangely even better with >> HDFS then the with the local file system. But this seems to greatly >> depend on how your setup looks like and what actions you perform. We now >> had a patter with lots of small updates and commits and that seems to be >> quite a bit slower. We are about to do performance testing on that now. >> >> The reason we switched to HDFS was largely connected to us using Docker >> and Marathon/Mesos. With HDFS the data is in a shared file system and >> thus it is possible to move the replica to a different instance on a a >> different host. >> >> regards, >> Hendrik >> >> On 22.11.2017 14:59, Greenhorn Techie wrote: >> > Hi, >> > >> > Good Afternoon!! >> > >> > While the discussion around issues related to "Solr on HDFS" is live, I >> > would like to understand if anyone has done any performance benchmarking >> > for both Solr indexing and search between HDFS vs local file system. >> > >> > Also, from experience, what would the community folks suggest? Solr on >> > local file system or Solr on HDFS? Has anyone done a comparative study of >> > these choices? >> > >> > Thanks >> > >> >>
Re: Solr on HDFS vs local storage - Benchmarking
Hendrik, Thanks for your response. Regarding "But this seems to greatly depend on how your setup looks like and what actions you perform." May I know what are the factors influence and what considerations are to be taken in relation to this? Thanks On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp wrote: > We did some testing and the performance was strangely even better with > HDFS then the with the local file system. But this seems to greatly > depend on how your setup looks like and what actions you perform. We now > had a patter with lots of small updates and commits and that seems to be > quite a bit slower. We are about to do performance testing on that now. > > The reason we switched to HDFS was largely connected to us using Docker > and Marathon/Mesos. With HDFS the data is in a shared file system and > thus it is possible to move the replica to a different instance on a a > different host. > > regards, > Hendrik > > On 22.11.2017 14:59, Greenhorn Techie wrote: > > Hi, > > > > Good Afternoon!! > > > > While the discussion around issues related to "Solr on HDFS" is live, I > > would like to understand if anyone has done any performance benchmarking > > for both Solr indexing and search between HDFS vs local file system. > > > > Also, from experience, what would the community folks suggest? Solr on > > local file system or Solr on HDFS? Has anyone done a comparative study of > > these choices? > > > > Thanks > > > >
Re: Solr on HDFS vs local storage - Benchmarking
We did some testing and the performance was strangely even better with HDFS then the with the local file system. But this seems to greatly depend on how your setup looks like and what actions you perform. We now had a patter with lots of small updates and commits and that seems to be quite a bit slower. We are about to do performance testing on that now. The reason we switched to HDFS was largely connected to us using Docker and Marathon/Mesos. With HDFS the data is in a shared file system and thus it is possible to move the replica to a different instance on a a different host. regards, Hendrik On 22.11.2017 14:59, Greenhorn Techie wrote: Hi, Good Afternoon!! While the discussion around issues related to "Solr on HDFS" is live, I would like to understand if anyone has done any performance benchmarking for both Solr indexing and search between HDFS vs local file system. Also, from experience, what would the community folks suggest? Solr on local file system or Solr on HDFS? Has anyone done a comparative study of these choices? Thanks
Solr on HDFS vs local storage - Benchmarking
Hi, Good Afternoon!! While the discussion around issues related to "Solr on HDFS" is live, I would like to understand if anyone has done any performance benchmarking for both Solr indexing and search between HDFS vs local file system. Also, from experience, what would the community folks suggest? Solr on local file system or Solr on HDFS? Has anyone done a comparative study of these choices? Thanks
Re: Solr on HDFS: AutoAddReplica does not add a replica
I'm also not really an HDFS expert but I believe it is slightly different: The HDFS data is replicated, lets say 3 times, between the HDFS data nodes but for an HDFS client it looks like one directory and it is hidden that the data is replicated. Every client should see the same data. Just like every client should see the same data in ZooKeeper (every ZK node also has a full replica). So with 2 replicas there should only be two disjoint data sets. Thus it should not matter which solr node claims the replica and then continues where things were left. Solr should only be concerned about the replication between the solr replicas but not about the replication between the HDFS data nodes, just as it does not have to deal with the replication between the ZK nodes. Anyhow, for now I would be happy if my patch for SOLR-10092 could get included soon as the auto add replica feature does not work without that at all for me :-) On 22.02.2017 16:15, Erick Erickson wrote: bq: in the none HDFS case that sounds logical but in the HDFS case all the index data is in the shared HDFS file system That's not really the point, and it's not quite true. The Solr index unique _per replica_. So replica1 points to an HDFS directory (that's triply replicated to be sure). replica2 points to a totally different set of index files. So with the default replication of 3 your two replicas will have 6 copies of the index that are totally disjoint in two sets of three. From Solr's point of view, the fact that HDFS replicates the data doesn't really alter much. Autoaddreplica will indeed, to be able to re-use the HDFS data if a Solr node goes away. But that doesn't change the replication issue I described. At least that's my understanding, I admit I'm not an HDFS guy and it may be out of date. Erick On Tue, Feb 21, 2017 at 10:30 PM, Hendrik Haddorp wrote: Hi Erick, in the none HDFS case that sounds logical but in the HDFS case all the index data is in the shared HDFS file system. Even the transaction logs should be in there. So the node that once had the replica should not really have more information then any other node, especially if legacyClound is set to false so having ZooKeeper truth. regards, Hendrik On 22.02.2017 02:28, Erick Erickson wrote: Hendrik: bq: Not really sure why one replica needs to be up though. I didn't write the code so I'm guessing a bit, but consider the situation where you have no replicas for a shard up and add a new one. Eventually it could become the leader but there would have been no chance for it to check if it's version of the index was up to date. But since it would be the leader, when other replicas for that shard _do_ come on line they'd replicate the index down from the newly added replica, possibly using very old data. FWIW, Erick On Tue, Feb 21, 2017 at 1:12 PM, Hendrik Haddorp wrote: Hi, I had opened SOLR-10092 (https://issues.apache.org/jira/browse/SOLR-10092) for this a while ago. I was now able to gt this feature working with a very small code change. After a few seconds Solr reassigns the replica to a different Solr instance as long as one replica is still up. Not really sure why one replica needs to be up though. I added the patch based on Solr 6.3 to the bug report. Would be great if it could be merged soon. regards, Hendrik On 19.01.2017 17:08, Hendrik Haddorp wrote: HDFS is like a shared filesystem so every Solr Cloud instance can access the data using the same path or URL. The clusterstate.json looks like this: "shards":{"shard1":{ "range":"8000-7fff", "state":"active", "replicas":{ "core_node1":{ "core":"test1.collection-0_shard1_replica1", "dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/", "base_url":"http://slave3:9000/solr";, "node_name":"slave3:9000_solr", "state":"active", "ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"}, "core_node2":{ "core":"test1.collection-0_shard1_replica2", "dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/", "base_url":"http://slave2:9000/solr";, "node_name":"slave2:9000_solr", "state":"active", "ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog", "leader":"true"}, "core_node3":{ "core":"test1.collection-0_shard1_replica3", "dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/", "base_url":"http://slave4:9005/solr";, "node_name":"slave4:9005_solr", "state":"active", "ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog" So every replica is always assigned to one node and this is being stored in ZK, pretty much the same as for none HDFS setups. Just as the data is not stored locally but on the network and as the path d
Re: Solr on HDFS: AutoAddReplica does not add a replica
bq: in the none HDFS case that sounds logical but in the HDFS case all the index data is in the shared HDFS file system That's not really the point, and it's not quite true. The Solr index unique _per replica_. So replica1 points to an HDFS directory (that's triply replicated to be sure). replica2 points to a totally different set of index files. So with the default replication of 3 your two replicas will have 6 copies of the index that are totally disjoint in two sets of three. From Solr's point of view, the fact that HDFS replicates the data doesn't really alter much. Autoaddreplica will indeed, to be able to re-use the HDFS data if a Solr node goes away. But that doesn't change the replication issue I described. At least that's my understanding, I admit I'm not an HDFS guy and it may be out of date. Erick On Tue, Feb 21, 2017 at 10:30 PM, Hendrik Haddorp wrote: > Hi Erick, > > in the none HDFS case that sounds logical but in the HDFS case all the index > data is in the shared HDFS file system. Even the transaction logs should be > in there. So the node that once had the replica should not really have more > information then any other node, especially if legacyClound is set to false > so having ZooKeeper truth. > > regards, > Hendrik > > On 22.02.2017 02:28, Erick Erickson wrote: >> >> Hendrik: >> >> bq: Not really sure why one replica needs to be up though. >> >> I didn't write the code so I'm guessing a bit, but consider the >> situation where you have no replicas for a shard up and add a new one. >> Eventually it could become the leader but there would have been no >> chance for it to check if it's version of the index was up to date. >> But since it would be the leader, when other replicas for that shard >> _do_ come on line they'd replicate the index down from the newly added >> replica, possibly using very old data. >> >> FWIW, >> Erick >> >> On Tue, Feb 21, 2017 at 1:12 PM, Hendrik Haddorp >> wrote: >>> >>> Hi, >>> >>> I had opened SOLR-10092 >>> (https://issues.apache.org/jira/browse/SOLR-10092) >>> for this a while ago. I was now able to gt this feature working with a >>> very >>> small code change. After a few seconds Solr reassigns the replica to a >>> different Solr instance as long as one replica is still up. Not really >>> sure >>> why one replica needs to be up though. I added the patch based on Solr >>> 6.3 >>> to the bug report. Would be great if it could be merged soon. >>> >>> regards, >>> Hendrik >>> >>> On 19.01.2017 17:08, Hendrik Haddorp wrote: HDFS is like a shared filesystem so every Solr Cloud instance can access the data using the same path or URL. The clusterstate.json looks like this: "shards":{"shard1":{ "range":"8000-7fff", "state":"active", "replicas":{ "core_node1":{ "core":"test1.collection-0_shard1_replica1", "dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/", "base_url":"http://slave3:9000/solr";, "node_name":"slave3:9000_solr", "state":"active", "ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"}, "core_node2":{ "core":"test1.collection-0_shard1_replica2", "dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/", "base_url":"http://slave2:9000/solr";, "node_name":"slave2:9000_solr", "state":"active", "ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog", "leader":"true"}, "core_node3":{ "core":"test1.collection-0_shard1_replica3", "dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/", "base_url":"http://slave4:9005/solr";, "node_name":"slave4:9005_solr", "state":"active", "ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog" So every replica is always assigned to one node and this is being stored in ZK, pretty much the same as for none HDFS setups. Just as the data is not stored locally but on the network and as the path does not contain any node information you can of course easily take over the work to a different Solr node. You should just need to update the owner of the replica in ZK and you should basically be done, I assume. That's why the documentation states that an advantage of using HDFS is that a failing node can be replaced by a different one. The Overseer just has to move the ownership of the replica, which seems like what the code is trying to do. There just seems to be a bug in the code so that the core does not get created on the target node. Each data directory also contains a lock file. The
Re: Solr on HDFS: AutoAddReplica does not add a replica
Hi Erick, in the none HDFS case that sounds logical but in the HDFS case all the index data is in the shared HDFS file system. Even the transaction logs should be in there. So the node that once had the replica should not really have more information then any other node, especially if legacyClound is set to false so having ZooKeeper truth. regards, Hendrik On 22.02.2017 02:28, Erick Erickson wrote: Hendrik: bq: Not really sure why one replica needs to be up though. I didn't write the code so I'm guessing a bit, but consider the situation where you have no replicas for a shard up and add a new one. Eventually it could become the leader but there would have been no chance for it to check if it's version of the index was up to date. But since it would be the leader, when other replicas for that shard _do_ come on line they'd replicate the index down from the newly added replica, possibly using very old data. FWIW, Erick On Tue, Feb 21, 2017 at 1:12 PM, Hendrik Haddorp wrote: Hi, I had opened SOLR-10092 (https://issues.apache.org/jira/browse/SOLR-10092) for this a while ago. I was now able to gt this feature working with a very small code change. After a few seconds Solr reassigns the replica to a different Solr instance as long as one replica is still up. Not really sure why one replica needs to be up though. I added the patch based on Solr 6.3 to the bug report. Would be great if it could be merged soon. regards, Hendrik On 19.01.2017 17:08, Hendrik Haddorp wrote: HDFS is like a shared filesystem so every Solr Cloud instance can access the data using the same path or URL. The clusterstate.json looks like this: "shards":{"shard1":{ "range":"8000-7fff", "state":"active", "replicas":{ "core_node1":{ "core":"test1.collection-0_shard1_replica1", "dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/", "base_url":"http://slave3:9000/solr";, "node_name":"slave3:9000_solr", "state":"active", "ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"}, "core_node2":{ "core":"test1.collection-0_shard1_replica2", "dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/", "base_url":"http://slave2:9000/solr";, "node_name":"slave2:9000_solr", "state":"active", "ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog", "leader":"true"}, "core_node3":{ "core":"test1.collection-0_shard1_replica3", "dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/", "base_url":"http://slave4:9005/solr";, "node_name":"slave4:9005_solr", "state":"active", "ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog" So every replica is always assigned to one node and this is being stored in ZK, pretty much the same as for none HDFS setups. Just as the data is not stored locally but on the network and as the path does not contain any node information you can of course easily take over the work to a different Solr node. You should just need to update the owner of the replica in ZK and you should basically be done, I assume. That's why the documentation states that an advantage of using HDFS is that a failing node can be replaced by a different one. The Overseer just has to move the ownership of the replica, which seems like what the code is trying to do. There just seems to be a bug in the code so that the core does not get created on the target node. Each data directory also contains a lock file. The documentation states that one should use the HdfsLockFactory, which unfortunately can easily lead to SOLR-8335, which hopefully will be fixed by SOLR-8169. A manual cleanup is however also easily done but seems to require a node restart to take effect. But I'm also only recently playing around with all this ;-) regards, Hendrik On 19.01.2017 16:40, Shawn Heisey wrote: On 1/19/2017 4:09 AM, Hendrik Haddorp wrote: Given that the data is on HDFS it shouldn't matter if any active replica is left as the data does not need to get transferred from another instance but the new core will just take over the existing data. Thus a replication factor of 1 should also work just in that case the shard would be down until the new core is up. Anyhow, it looks like the above call is missing to set the shard id I guess or some code is checking wrongly. I know very little about how SolrCloud interacts with HDFS, so although I'm reasonably certain about what comes below, I could be wrong. I have not ever heard of SolrCloud being able to automatically take over an existing index directory when it creates a replica, or even share index directories unless the admin fools it into doing so without its knowledge. Sharing an index directory for replicas with SolrCloud would NOT work correctly
Re: Solr on HDFS: AutoAddReplica does not add a replica
Hendrik: bq: Not really sure why one replica needs to be up though. I didn't write the code so I'm guessing a bit, but consider the situation where you have no replicas for a shard up and add a new one. Eventually it could become the leader but there would have been no chance for it to check if it's version of the index was up to date. But since it would be the leader, when other replicas for that shard _do_ come on line they'd replicate the index down from the newly added replica, possibly using very old data. FWIW, Erick On Tue, Feb 21, 2017 at 1:12 PM, Hendrik Haddorp wrote: > Hi, > > I had opened SOLR-10092 (https://issues.apache.org/jira/browse/SOLR-10092) > for this a while ago. I was now able to gt this feature working with a very > small code change. After a few seconds Solr reassigns the replica to a > different Solr instance as long as one replica is still up. Not really sure > why one replica needs to be up though. I added the patch based on Solr 6.3 > to the bug report. Would be great if it could be merged soon. > > regards, > Hendrik > > On 19.01.2017 17:08, Hendrik Haddorp wrote: >> >> HDFS is like a shared filesystem so every Solr Cloud instance can access >> the data using the same path or URL. The clusterstate.json looks like this: >> >> "shards":{"shard1":{ >> "range":"8000-7fff", >> "state":"active", >> "replicas":{ >> "core_node1":{ >> "core":"test1.collection-0_shard1_replica1", >> "dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/", >> "base_url":"http://slave3:9000/solr";, >> "node_name":"slave3:9000_solr", >> "state":"active", >> >> "ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"}, >> "core_node2":{ >> "core":"test1.collection-0_shard1_replica2", >> "dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/", >> "base_url":"http://slave2:9000/solr";, >> "node_name":"slave2:9000_solr", >> "state":"active", >> >> "ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog", >> "leader":"true"}, >> "core_node3":{ >> "core":"test1.collection-0_shard1_replica3", >> "dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/", >> "base_url":"http://slave4:9005/solr";, >> "node_name":"slave4:9005_solr", >> "state":"active", >> >> "ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog" >> >> So every replica is always assigned to one node and this is being stored >> in ZK, pretty much the same as for none HDFS setups. Just as the data is not >> stored locally but on the network and as the path does not contain any node >> information you can of course easily take over the work to a different Solr >> node. You should just need to update the owner of the replica in ZK and you >> should basically be done, I assume. That's why the documentation states that >> an advantage of using HDFS is that a failing node can be replaced by a >> different one. The Overseer just has to move the ownership of the replica, >> which seems like what the code is trying to do. There just seems to be a bug >> in the code so that the core does not get created on the target node. >> >> Each data directory also contains a lock file. The documentation states >> that one should use the HdfsLockFactory, which unfortunately can easily lead >> to SOLR-8335, which hopefully will be fixed by SOLR-8169. A manual cleanup >> is however also easily done but seems to require a node restart to take >> effect. But I'm also only recently playing around with all this ;-) >> >> regards, >> Hendrik >> >> On 19.01.2017 16:40, Shawn Heisey wrote: >>> >>> On 1/19/2017 4:09 AM, Hendrik Haddorp wrote: Given that the data is on HDFS it shouldn't matter if any active replica is left as the data does not need to get transferred from another instance but the new core will just take over the existing data. Thus a replication factor of 1 should also work just in that case the shard would be down until the new core is up. Anyhow, it looks like the above call is missing to set the shard id I guess or some code is checking wrongly. >>> >>> I know very little about how SolrCloud interacts with HDFS, so although >>> I'm reasonably certain about what comes below, I could be wrong. >>> >>> I have not ever heard of SolrCloud being able to automatically take over >>> an existing index directory when it creates a replica, or even share >>> index directories unless the admin fools it into doing so without its >>> knowledge. Sharing an index directory for replicas with SolrCloud would >>> NOT work correctly. Solr must be able to update all replicas >>> independently, which means that each of them will lock its index >>> directory and write to it. >>> >>> It is my understandin
Re: Solr on HDFS: AutoAddReplica does not add a replica
Hi, I had opened SOLR-10092 (https://issues.apache.org/jira/browse/SOLR-10092) for this a while ago. I was now able to gt this feature working with a very small code change. After a few seconds Solr reassigns the replica to a different Solr instance as long as one replica is still up. Not really sure why one replica needs to be up though. I added the patch based on Solr 6.3 to the bug report. Would be great if it could be merged soon. regards, Hendrik On 19.01.2017 17:08, Hendrik Haddorp wrote: HDFS is like a shared filesystem so every Solr Cloud instance can access the data using the same path or URL. The clusterstate.json looks like this: "shards":{"shard1":{ "range":"8000-7fff", "state":"active", "replicas":{ "core_node1":{ "core":"test1.collection-0_shard1_replica1", "dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/", "base_url":"http://slave3:9000/solr";, "node_name":"slave3:9000_solr", "state":"active", "ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"}, "core_node2":{ "core":"test1.collection-0_shard1_replica2", "dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/", "base_url":"http://slave2:9000/solr";, "node_name":"slave2:9000_solr", "state":"active", "ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog", "leader":"true"}, "core_node3":{ "core":"test1.collection-0_shard1_replica3", "dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/", "base_url":"http://slave4:9005/solr";, "node_name":"slave4:9005_solr", "state":"active", "ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog" So every replica is always assigned to one node and this is being stored in ZK, pretty much the same as for none HDFS setups. Just as the data is not stored locally but on the network and as the path does not contain any node information you can of course easily take over the work to a different Solr node. You should just need to update the owner of the replica in ZK and you should basically be done, I assume. That's why the documentation states that an advantage of using HDFS is that a failing node can be replaced by a different one. The Overseer just has to move the ownership of the replica, which seems like what the code is trying to do. There just seems to be a bug in the code so that the core does not get created on the target node. Each data directory also contains a lock file. The documentation states that one should use the HdfsLockFactory, which unfortunately can easily lead to SOLR-8335, which hopefully will be fixed by SOLR-8169. A manual cleanup is however also easily done but seems to require a node restart to take effect. But I'm also only recently playing around with all this ;-) regards, Hendrik On 19.01.2017 16:40, Shawn Heisey wrote: On 1/19/2017 4:09 AM, Hendrik Haddorp wrote: Given that the data is on HDFS it shouldn't matter if any active replica is left as the data does not need to get transferred from another instance but the new core will just take over the existing data. Thus a replication factor of 1 should also work just in that case the shard would be down until the new core is up. Anyhow, it looks like the above call is missing to set the shard id I guess or some code is checking wrongly. I know very little about how SolrCloud interacts with HDFS, so although I'm reasonably certain about what comes below, I could be wrong. I have not ever heard of SolrCloud being able to automatically take over an existing index directory when it creates a replica, or even share index directories unless the admin fools it into doing so without its knowledge. Sharing an index directory for replicas with SolrCloud would NOT work correctly. Solr must be able to update all replicas independently, which means that each of them will lock its index directory and write to it. It is my understanding (from reading messages on mailing lists) that when using HDFS, Solr replicas are all separate and consume additional disk space, just like on a regular filesystem. I found the code that generates the "No shard id" exception, but my knowledge of how the zookeeper code in Solr works is not deep enough to understand what it means or how to fix it. Thanks, Shawn
Re: Solr on HDFS: AutoAddReplica does not add a replica
HDFS is like a shared filesystem so every Solr Cloud instance can access the data using the same path or URL. The clusterstate.json looks like this: "shards":{"shard1":{ "range":"8000-7fff", "state":"active", "replicas":{ "core_node1":{ "core":"test1.collection-0_shard1_replica1", "dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/", "base_url":"http://slave3:9000/solr";, "node_name":"slave3:9000_solr", "state":"active", "ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"}, "core_node2":{ "core":"test1.collection-0_shard1_replica2", "dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/", "base_url":"http://slave2:9000/solr";, "node_name":"slave2:9000_solr", "state":"active", "ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog", "leader":"true"}, "core_node3":{ "core":"test1.collection-0_shard1_replica3", "dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/", "base_url":"http://slave4:9005/solr";, "node_name":"slave4:9005_solr", "state":"active", "ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog" So every replica is always assigned to one node and this is being stored in ZK, pretty much the same as for none HDFS setups. Just as the data is not stored locally but on the network and as the path does not contain any node information you can of course easily take over the work to a different Solr node. You should just need to update the owner of the replica in ZK and you should basically be done, I assume. That's why the documentation states that an advantage of using HDFS is that a failing node can be replaced by a different one. The Overseer just has to move the ownership of the replica, which seems like what the code is trying to do. There just seems to be a bug in the code so that the core does not get created on the target node. Each data directory also contains a lock file. The documentation states that one should use the HdfsLockFactory, which unfortunately can easily lead to SOLR-8335, which hopefully will be fixed by SOLR-8169. A manual cleanup is however also easily done but seems to require a node restart to take effect. But I'm also only recently playing around with all this ;-) regards, Hendrik On 19.01.2017 16:40, Shawn Heisey wrote: On 1/19/2017 4:09 AM, Hendrik Haddorp wrote: Given that the data is on HDFS it shouldn't matter if any active replica is left as the data does not need to get transferred from another instance but the new core will just take over the existing data. Thus a replication factor of 1 should also work just in that case the shard would be down until the new core is up. Anyhow, it looks like the above call is missing to set the shard id I guess or some code is checking wrongly. I know very little about how SolrCloud interacts with HDFS, so although I'm reasonably certain about what comes below, I could be wrong. I have not ever heard of SolrCloud being able to automatically take over an existing index directory when it creates a replica, or even share index directories unless the admin fools it into doing so without its knowledge. Sharing an index directory for replicas with SolrCloud would NOT work correctly. Solr must be able to update all replicas independently, which means that each of them will lock its index directory and write to it. It is my understanding (from reading messages on mailing lists) that when using HDFS, Solr replicas are all separate and consume additional disk space, just like on a regular filesystem. I found the code that generates the "No shard id" exception, but my knowledge of how the zookeeper code in Solr works is not deep enough to understand what it means or how to fix it. Thanks, Shawn
Re: Solr on HDFS: AutoAddReplica does not add a replica
On 1/19/2017 4:09 AM, Hendrik Haddorp wrote: > Given that the data is on HDFS it shouldn't matter if any active > replica is left as the data does not need to get transferred from > another instance but the new core will just take over the existing > data. Thus a replication factor of 1 should also work just in that > case the shard would be down until the new core is up. Anyhow, it > looks like the above call is missing to set the shard id I guess or > some code is checking wrongly. I know very little about how SolrCloud interacts with HDFS, so although I'm reasonably certain about what comes below, I could be wrong. I have not ever heard of SolrCloud being able to automatically take over an existing index directory when it creates a replica, or even share index directories unless the admin fools it into doing so without its knowledge. Sharing an index directory for replicas with SolrCloud would NOT work correctly. Solr must be able to update all replicas independently, which means that each of them will lock its index directory and write to it. It is my understanding (from reading messages on mailing lists) that when using HDFS, Solr replicas are all separate and consume additional disk space, just like on a regular filesystem. I found the code that generates the "No shard id" exception, but my knowledge of how the zookeeper code in Solr works is not deep enough to understand what it means or how to fix it. Thanks, Shawn
Re: Solr on HDFS: AutoAddReplica does not add a replica
Hi, I'm seeing the same issue on Solr 6.3 using HDFS and a replication factor of 3, even though I believe a replication factor of 1 should work the same. When I stop a Solr instance this is detected and Solr actually wants to create a replica on a different instance. The command for that does however fail: o.a.s.c.OverseerAutoReplicaFailoverThread Exception trying to create new replica on http://...:9000/solr:org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://...:9000/solr: Error CREATEing SolrCore 'test2.collection-09_shard1_replica1': Unable to create core [test2.collection-09_shard1_replica1] Caused by: No shard id for CoreDescriptor[name=test2.collection-09_shard1_replica1;instanceDir=/var/opt/solr/test2.collection-09_shard1_replica1] at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:593) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:262) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:251) at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219) at org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.createSolrCore(OverseerAutoReplicaFailoverThread.java:456) at org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.lambda$addReplica$0(OverseerAutoReplicaFailoverThread.java:251) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Given that the data is on HDFS it shouldn't matter if any active replica is left as the data does not need to get transferred from another instance but the new core will just take over the existing data. Thus a replication factor of 1 should also work just in that case the shard would be down until the new core is up. Anyhow, it looks like the above call is missing to set the shard id I guess or some code is checking wrongly. On 14.01.2017 02:44, Shawn Heisey wrote: On 1/13/2017 5:46 PM, Chetas Joshi wrote: One of the things I have observed is: if I use the collection API to create a replica for that shard, it does not complain about the config which has been set to ReplicationFactor=1. If replication factor was the issue as suggested by Shawn, shouldn't it complain? The replicationFactor value is used by exactly two things: initial collection creation, and autoAddReplicas. It will not affect ANY other command or operation, including ADDREPLICA. You can create MORE replicas than replicationFactor indicates, and there will be no error messages or warnings. In order to have a replica automatically added, your replicationFactor must be at least two, and the number of active replicas in the cloud for a shard must be less than that number. If that's the case and the expiration times have been reached without recovery, then Solr will automatically add replicas until there are at least as many replicas operational as specified in replicationFactor. I would also like to mention that I experience some instance dirs getting deleted and also found this open bug (https://issues.apache.org/jira/browse/SOLR-8905) The description on that issue is incomprehensible. I can't make any sense out of it. It mentions the core.properties file, but the error message shown doesn't talk about the properties file at all. The error and issue description seem to have nothing at all to do with the code lines that were quoted. Also, it was reported on version 4.10.3 ... but this is going to be significantly different from current 6.x versions, and the 4.x versions will NOT be updated with bugfixes. Thanks, Shawn
Re: Solr on HDFS: AutoAddReplica does not add a replica
On 1/13/2017 5:46 PM, Chetas Joshi wrote: > One of the things I have observed is: if I use the collection API to > create a replica for that shard, it does not complain about the config > which has been set to ReplicationFactor=1. If replication factor was > the issue as suggested by Shawn, shouldn't it complain? The replicationFactor value is used by exactly two things: initial collection creation, and autoAddReplicas. It will not affect ANY other command or operation, including ADDREPLICA. You can create MORE replicas than replicationFactor indicates, and there will be no error messages or warnings. In order to have a replica automatically added, your replicationFactor must be at least two, and the number of active replicas in the cloud for a shard must be less than that number. If that's the case and the expiration times have been reached without recovery, then Solr will automatically add replicas until there are at least as many replicas operational as specified in replicationFactor. > I would also like to mention that I experience some instance dirs > getting deleted and also found this open bug > (https://issues.apache.org/jira/browse/SOLR-8905) The description on that issue is incomprehensible. I can't make any sense out of it. It mentions the core.properties file, but the error message shown doesn't talk about the properties file at all. The error and issue description seem to have nothing at all to do with the code lines that were quoted. Also, it was reported on version 4.10.3 ... but this is going to be significantly different from current 6.x versions, and the 4.x versions will NOT be updated with bugfixes. Thanks, Shawn
Re: Solr on HDFS: AutoAddReplica does not add a replica
Erick, I have not changed any config. I have autoaddReplica = true for individual collection config as well as the overall cluster config. Still, it does not add a replica when I decommission a node. Adding a replica is overseer's job. I looked at the logs of the overseer of the solrCloud but could not find anything there as well. I am doing some testing using different configs. I would be happy to share my finding. One of the things I have observed is: if I use the collection API to create a replica for that shard, it does not complain about the config which has been set to ReplicationFactor=1. If replication factor was the issue as suggested by Shawn, shouldn't it complain? I would also like to mention that I experience some instance dirs getting deleted and also found this open bug ( https://issues.apache.org/jira/browse/SOLR-8905) Thanks! On Thu, Jan 12, 2017 at 9:50 AM, Erick Erickson wrote: > Hmmm, have you changed any of the settings for autoAddReplcia? There > are several parameters that govern how long before a replica would be > added. > > But I suggest you use the Cloudera resources for this question, not > only did they write this functionality, but Cloudera support is deeply > embedded in HDFS and I suspect has _by far_ the most experience with > it. > > And that said, anything you find out that would suggest good ways to > clarify the docs would be most welcome! > > Best, > Erick > > On Thu, Jan 12, 2017 at 8:42 AM, Shawn Heisey wrote: > > On 1/11/2017 7:14 PM, Chetas Joshi wrote: > >> This is what I understand about how Solr works on HDFS. Please correct > me > >> if I am wrong. > >> > >> Although solr shard replication Factor = 1, HDFS default replication = > 3. > >> When the node goes down, the solr server running on that node goes down > and > >> hence the instance (core) representing the replica goes down. The data > in > >> on HDFS (distributed across all the datanodes of the hadoop cluster > with 3X > >> replication). This is the reason why I have kept replicationFactor=1. > >> > >> As per the link: > >> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS > >> One benefit to running Solr in HDFS is the ability to automatically add > new > >> replicas when the Overseer notices that a shard has gone down. Because > the > >> "gone" index shards are stored in HDFS, a new core will be created and > the > >> new core will point to the existing indexes in HDFS. > >> > >> This is the expected behavior of Solr overseer which I am not able to > see. > >> After a couple of hours a node was assigned to host the shard but the > >> status of the shard is still "down" and the instance dir is missing on > that > >> node for that particular shard_replica. > > > > As I said before, I know very little about HDFS, so the following could > > be wrong, but it makes sense so I'll say it: > > > > I would imagine that Solr doesn't know or care what your HDFS > > replication is ... the only replicas it knows about are the ones that it > > is managing itself. The autoAddReplicas feature manages *SolrCloud* > > replicas, not HDFS replicas. > > > > I have seen people say that multiple SolrCloud replicas will take up > > additional space in HDFS -- they do not point at the same index files. > > This is because proper Lucene operation requires that it lock an index > > and prevent any other thread/process from writing to the index at the > > same time. When you index, SolrCloud updates all replicas independently > > -- the only time indexes are replicated is when you add a new replica or > > a serious problem has occurred and an index needs to be recovered. > > > > Thanks, > > Shawn > > >
Re: Solr on HDFS: AutoAddReplica does not add a replica
Hmmm, have you changed any of the settings for autoAddReplcia? There are several parameters that govern how long before a replica would be added. But I suggest you use the Cloudera resources for this question, not only did they write this functionality, but Cloudera support is deeply embedded in HDFS and I suspect has _by far_ the most experience with it. And that said, anything you find out that would suggest good ways to clarify the docs would be most welcome! Best, Erick On Thu, Jan 12, 2017 at 8:42 AM, Shawn Heisey wrote: > On 1/11/2017 7:14 PM, Chetas Joshi wrote: >> This is what I understand about how Solr works on HDFS. Please correct me >> if I am wrong. >> >> Although solr shard replication Factor = 1, HDFS default replication = 3. >> When the node goes down, the solr server running on that node goes down and >> hence the instance (core) representing the replica goes down. The data in >> on HDFS (distributed across all the datanodes of the hadoop cluster with 3X >> replication). This is the reason why I have kept replicationFactor=1. >> >> As per the link: >> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS >> One benefit to running Solr in HDFS is the ability to automatically add new >> replicas when the Overseer notices that a shard has gone down. Because the >> "gone" index shards are stored in HDFS, a new core will be created and the >> new core will point to the existing indexes in HDFS. >> >> This is the expected behavior of Solr overseer which I am not able to see. >> After a couple of hours a node was assigned to host the shard but the >> status of the shard is still "down" and the instance dir is missing on that >> node for that particular shard_replica. > > As I said before, I know very little about HDFS, so the following could > be wrong, but it makes sense so I'll say it: > > I would imagine that Solr doesn't know or care what your HDFS > replication is ... the only replicas it knows about are the ones that it > is managing itself. The autoAddReplicas feature manages *SolrCloud* > replicas, not HDFS replicas. > > I have seen people say that multiple SolrCloud replicas will take up > additional space in HDFS -- they do not point at the same index files. > This is because proper Lucene operation requires that it lock an index > and prevent any other thread/process from writing to the index at the > same time. When you index, SolrCloud updates all replicas independently > -- the only time indexes are replicated is when you add a new replica or > a serious problem has occurred and an index needs to be recovered. > > Thanks, > Shawn >
Re: Solr on HDFS: AutoAddReplica does not add a replica
On 1/11/2017 7:14 PM, Chetas Joshi wrote: > This is what I understand about how Solr works on HDFS. Please correct me > if I am wrong. > > Although solr shard replication Factor = 1, HDFS default replication = 3. > When the node goes down, the solr server running on that node goes down and > hence the instance (core) representing the replica goes down. The data in > on HDFS (distributed across all the datanodes of the hadoop cluster with 3X > replication). This is the reason why I have kept replicationFactor=1. > > As per the link: > https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS > One benefit to running Solr in HDFS is the ability to automatically add new > replicas when the Overseer notices that a shard has gone down. Because the > "gone" index shards are stored in HDFS, a new core will be created and the > new core will point to the existing indexes in HDFS. > > This is the expected behavior of Solr overseer which I am not able to see. > After a couple of hours a node was assigned to host the shard but the > status of the shard is still "down" and the instance dir is missing on that > node for that particular shard_replica. As I said before, I know very little about HDFS, so the following could be wrong, but it makes sense so I'll say it: I would imagine that Solr doesn't know or care what your HDFS replication is ... the only replicas it knows about are the ones that it is managing itself. The autoAddReplicas feature manages *SolrCloud* replicas, not HDFS replicas. I have seen people say that multiple SolrCloud replicas will take up additional space in HDFS -- they do not point at the same index files. This is because proper Lucene operation requires that it lock an index and prevent any other thread/process from writing to the index at the same time. When you index, SolrCloud updates all replicas independently -- the only time indexes are replicated is when you add a new replica or a serious problem has occurred and an index needs to be recovered. Thanks, Shawn
Re: Solr on HDFS: AutoAddReplica does not add a replica
Hi Shawn, This is what I understand about how Solr works on HDFS. Please correct me if I am wrong. Although solr shard replication Factor = 1, HDFS default replication = 3. When the node goes down, the solr server running on that node goes down and hence the instance (core) representing the replica goes down. The data in on HDFS (distributed across all the datanodes of the hadoop cluster with 3X replication). This is the reason why I have kept replicationFactor=1. As per the link: https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS One benefit to running Solr in HDFS is the ability to automatically add new replicas when the Overseer notices that a shard has gone down. Because the "gone" index shards are stored in HDFS, a new core will be created and the new core will point to the existing indexes in HDFS. This is the expected behavior of Solr overseer which I am not able to see. After a couple of hours a node was assigned to host the shard but the status of the shard is still "down" and the instance dir is missing on that node for that particular shard_replica. Thanks! On Wed, Jan 11, 2017 at 5:03 PM, Shawn Heisey wrote: > On 1/11/2017 1:47 PM, Chetas Joshi wrote: > > I have deployed a SolrCloud (solr 5.5.0) on hdfs using cloudera 5.4.7. > The > > cloud has 86 nodes. > > > > This is my config for the collection > > > > numShards=80 > > ReplicationFactor=1 > > maxShardsPerNode=1 > > autoAddReplica=true > > > > I recently decommissioned a node to resolve some disk issues. The shard > > that was being hosted on that host is now being shown as "gone" on the > solr > > admin UI. > > > > The got the cluster status using the collection API. It says > > shard: active, replica: down > > > > The overseer does not seem to be creating an extra core even though > > autoAddReplica=true ( > > https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS). > > > > Is this happening because the overseer sees the shard as active as > > suggested by the cluster status? > > If yes, is "autoAddReplica" not reliable? should I add a replica for this > > shard when such cases arise? > > Your replicationFactor is one. When there's one replica, you have no > redundancy. If that replica goes down, the shard is completely gone. > > As I understand it (I've got no experience with HDFS at all), > autoAddReplicas is designed to automatically add replicas until > replicationFactor is satisfied. As already mentioned, your > replicationFactor is one. This means that it will always be satisfied. > > If autoAddReplicas were to kick in any time a replica went down, then > Solr would be busy adding replicas anytime you restarted a node ... > which would be a very bad idea. > > If your number of replicas is one, and that replica goes down, where > would Solr go to get the data to create another replica? The single > replica is down, so there's nothing to copy from. You might be thinking > "from the leader" ... but a leader is nothing more than a replica that > has been temporarily elected to have an extra job. A replicationFactor > of two doesn't mean a leader and two copies .. it means there are a > total of two replicas, one of which is elected leader. > > If you want autoAddReplicas to work, you're going to need to have a > replicationFactor of at least two, and you're probably going to have to > delete the dead replica before another will be created. > > Thanks, > Shawn > >
Re: Solr on HDFS: AutoAddReplica does not add a replica
On 1/11/2017 1:47 PM, Chetas Joshi wrote: > I have deployed a SolrCloud (solr 5.5.0) on hdfs using cloudera 5.4.7. The > cloud has 86 nodes. > > This is my config for the collection > > numShards=80 > ReplicationFactor=1 > maxShardsPerNode=1 > autoAddReplica=true > > I recently decommissioned a node to resolve some disk issues. The shard > that was being hosted on that host is now being shown as "gone" on the solr > admin UI. > > The got the cluster status using the collection API. It says > shard: active, replica: down > > The overseer does not seem to be creating an extra core even though > autoAddReplica=true ( > https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS). > > Is this happening because the overseer sees the shard as active as > suggested by the cluster status? > If yes, is "autoAddReplica" not reliable? should I add a replica for this > shard when such cases arise? Your replicationFactor is one. When there's one replica, you have no redundancy. If that replica goes down, the shard is completely gone. As I understand it (I've got no experience with HDFS at all), autoAddReplicas is designed to automatically add replicas until replicationFactor is satisfied. As already mentioned, your replicationFactor is one. This means that it will always be satisfied. If autoAddReplicas were to kick in any time a replica went down, then Solr would be busy adding replicas anytime you restarted a node ... which would be a very bad idea. If your number of replicas is one, and that replica goes down, where would Solr go to get the data to create another replica? The single replica is down, so there's nothing to copy from. You might be thinking "from the leader" ... but a leader is nothing more than a replica that has been temporarily elected to have an extra job. A replicationFactor of two doesn't mean a leader and two copies .. it means there are a total of two replicas, one of which is elected leader. If you want autoAddReplicas to work, you're going to need to have a replicationFactor of at least two, and you're probably going to have to delete the dead replica before another will be created. Thanks, Shawn
Solr on HDFS: AutoAddReplica does not add a replica
Hello, I have deployed a SolrCloud (solr 5.5.0) on hdfs using cloudera 5.4.7. The cloud has 86 nodes. This is my config for the collection numShards=80 ReplicationFactor=1 maxShardsPerNode=1 autoAddReplica=true I recently decommissioned a node to resolve some disk issues. The shard that was being hosted on that host is now being shown as "gone" on the solr admin UI. The got the cluster status using the collection API. It says shard: active, replica: down The overseer does not seem to be creating an extra core even though autoAddReplica=true ( https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS). Is this happening because the overseer sees the shard as active as suggested by the cluster status? If yes, is "autoAddReplica" not reliable? should I add a replica for this shard when such cases arise? Thanks!
Re: Solr on HDFS: Streaming API performance tuning
I took another look at the stack trace and I'm pretty sure the issue is with NULL values in one of the sort fields. The null pointer is occurring during the comparison of sort values. See line 85 of: https://github.com/apache/lucene-solr/blob/branch_5_5/solr/solrj/src/java/org/apache/solr/client/solrj/io/comp/FieldComparator.java Joel Bernstein http://joelsolr.blogspot.com/ On Mon, Dec 19, 2016 at 4:43 PM, Chetas Joshi wrote: > Hi Joel, > > I don't have any solr documents that have NULL values for the sort fields I > use in my queries. > > Thanks! > > On Sun, Dec 18, 2016 at 12:56 PM, Joel Bernstein > wrote: > > > Ok, based on the stack trace I suspect one of your sort fields has NULL > > values, which in the 5x branch could produce null pointers if a segment > had > > no values for a sort field. This is also fixed in the Solr 6x branch. > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Sat, Dec 17, 2016 at 2:44 PM, Chetas Joshi > > wrote: > > > > > Here is the stack trace. > > > > > > java.lang.NullPointerException > > > > > > at > > > org.apache.solr.client.solrj.io.comp.FieldComparator$2. > > > compare(FieldComparator.java:85) > > > > > > at > > > org.apache.solr.client.solrj.io.comp.FieldComparator. > > > compare(FieldComparator.java:92) > > > > > > at > > > org.apache.solr.client.solrj.io.comp.FieldComparator. > > > compare(FieldComparator.java:30) > > > > > > at > > > org.apache.solr.client.solrj.io.comp.MultiComp.compare( > > MultiComp.java:45) > > > > > > at > > > org.apache.solr.client.solrj.io.comp.MultiComp.compare( > > MultiComp.java:33) > > > > > > at > > > org.apache.solr.client.solrj.io.stream.CloudSolrStream$ > > > TupleWrapper.compareTo(CloudSolrStream.java:396) > > > > > > at > > > org.apache.solr.client.solrj.io.stream.CloudSolrStream$ > > > TupleWrapper.compareTo(CloudSolrStream.java:381) > > > > > > at java.util.TreeMap.put(TreeMap.java:560) > > > > > > at java.util.TreeSet.add(TreeSet.java:255) > > > > > > at > > > org.apache.solr.client.solrj.io.stream.CloudSolrStream._ > > > read(CloudSolrStream.java:366) > > > > > > at > > > org.apache.solr.client.solrj.io.stream.CloudSolrStream. > > > read(CloudSolrStream.java:353) > > > > > > at > > > > > > *.*.*.*.SolrStreamResultIterator$$anon$1.run(SolrStreamResultIterator. > > > scala:101) > > > > > > at java.lang.Thread.run(Thread.java:745) > > > > > > 16/11/17 13:04:31 *ERROR* SolrStreamResultIterator:missing exponent > > > number: > > > char=A,position=106596 > > > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA' > > > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj' > > > > > > org.noggit.JSONParser$ParseException: missing exponent number: > > > char=A,position=106596 > > > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA' > > > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj' > > > > > > at org.noggit.JSONParser.err(JSONParser.java:356) > > > > > > at org.noggit.JSONParser.readExp(JSONParser.java:513) > > > > > > at org.noggit.JSONParser.readNumber(JSONParser.java:419) > > > > > > at org.noggit.JSONParser.next(JSONParser.java:845) > > > > > > at org.noggit.JSONParser.nextEvent(JSONParser.java:951) > > > > > > at org.noggit.ObjectBuilder.getObject(ObjectBuilder.java:127) > > > > > > at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:57) > > > > > > at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:37) > > > > > > at > > > org.apache.solr.client.solrj.io.stream.JSONTupleStream. > > > next(JSONTupleStream.java:84) > > > > > > at > > > org.apache.solr.client.solrj.io.stream.SolrStream.read( > > > SolrStream.java:147) > > > > > > at > > > org.apache.solr.client.solrj.io.stream.CloudSolrStream$ > > TupleWrapper.next( > > > CloudSolrStream.java:413) > > > > > > at > > > org.apache.solr.client.solrj.io.stream.CloudSolrStream._ > > > read(CloudSolrStream.java:365) > > > > > > at > > > org.apache.solr.client.solrj.io.stream.CloudSolrStream. > > > read(CloudSolrStream.java:353) > > > > > > > > > Thanks! > > > > > > On Fri, Dec 16, 2016 at 11:45 PM, Reth RM > wrote: > > > > > > > If you could provide the json parse exception stack trace, it might > > help > > > to > > > > predict issue there. > > > > > > > > > > > > On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi < > chetas.jo...@gmail.com> > > > > wrote: > > > > > > > > > Hi Joel, > > > > > > > > > > The only NON alpha-numeric characters I have in my data are '+' and > > > '/'. > > > > I > > > > > don't have any backslashes. > > > > > > > > > > If the special characters was the issue, I should get the JSON > > parsing > > > > > exceptions every time irrespective of the index size and > irrespective > > > of > > > > > the available memory on the machine. That is not the case here. The > > > > > streaming API successfully returns all the
Re: Solr on HDFS: Streaming API performance tuning
Hi Joel, I don't have any solr documents that have NULL values for the sort fields I use in my queries. Thanks! On Sun, Dec 18, 2016 at 12:56 PM, Joel Bernstein wrote: > Ok, based on the stack trace I suspect one of your sort fields has NULL > values, which in the 5x branch could produce null pointers if a segment had > no values for a sort field. This is also fixed in the Solr 6x branch. > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Sat, Dec 17, 2016 at 2:44 PM, Chetas Joshi > wrote: > > > Here is the stack trace. > > > > java.lang.NullPointerException > > > > at > > org.apache.solr.client.solrj.io.comp.FieldComparator$2. > > compare(FieldComparator.java:85) > > > > at > > org.apache.solr.client.solrj.io.comp.FieldComparator. > > compare(FieldComparator.java:92) > > > > at > > org.apache.solr.client.solrj.io.comp.FieldComparator. > > compare(FieldComparator.java:30) > > > > at > > org.apache.solr.client.solrj.io.comp.MultiComp.compare( > MultiComp.java:45) > > > > at > > org.apache.solr.client.solrj.io.comp.MultiComp.compare( > MultiComp.java:33) > > > > at > > org.apache.solr.client.solrj.io.stream.CloudSolrStream$ > > TupleWrapper.compareTo(CloudSolrStream.java:396) > > > > at > > org.apache.solr.client.solrj.io.stream.CloudSolrStream$ > > TupleWrapper.compareTo(CloudSolrStream.java:381) > > > > at java.util.TreeMap.put(TreeMap.java:560) > > > > at java.util.TreeSet.add(TreeSet.java:255) > > > > at > > org.apache.solr.client.solrj.io.stream.CloudSolrStream._ > > read(CloudSolrStream.java:366) > > > > at > > org.apache.solr.client.solrj.io.stream.CloudSolrStream. > > read(CloudSolrStream.java:353) > > > > at > > > > *.*.*.*.SolrStreamResultIterator$$anon$1.run(SolrStreamResultIterator. > > scala:101) > > > > at java.lang.Thread.run(Thread.java:745) > > > > 16/11/17 13:04:31 *ERROR* SolrStreamResultIterator:missing exponent > > number: > > char=A,position=106596 > > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA' > > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj' > > > > org.noggit.JSONParser$ParseException: missing exponent number: > > char=A,position=106596 > > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA' > > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj' > > > > at org.noggit.JSONParser.err(JSONParser.java:356) > > > > at org.noggit.JSONParser.readExp(JSONParser.java:513) > > > > at org.noggit.JSONParser.readNumber(JSONParser.java:419) > > > > at org.noggit.JSONParser.next(JSONParser.java:845) > > > > at org.noggit.JSONParser.nextEvent(JSONParser.java:951) > > > > at org.noggit.ObjectBuilder.getObject(ObjectBuilder.java:127) > > > > at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:57) > > > > at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:37) > > > > at > > org.apache.solr.client.solrj.io.stream.JSONTupleStream. > > next(JSONTupleStream.java:84) > > > > at > > org.apache.solr.client.solrj.io.stream.SolrStream.read( > > SolrStream.java:147) > > > > at > > org.apache.solr.client.solrj.io.stream.CloudSolrStream$ > TupleWrapper.next( > > CloudSolrStream.java:413) > > > > at > > org.apache.solr.client.solrj.io.stream.CloudSolrStream._ > > read(CloudSolrStream.java:365) > > > > at > > org.apache.solr.client.solrj.io.stream.CloudSolrStream. > > read(CloudSolrStream.java:353) > > > > > > Thanks! > > > > On Fri, Dec 16, 2016 at 11:45 PM, Reth RM wrote: > > > > > If you could provide the json parse exception stack trace, it might > help > > to > > > predict issue there. > > > > > > > > > On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi > > > wrote: > > > > > > > Hi Joel, > > > > > > > > The only NON alpha-numeric characters I have in my data are '+' and > > '/'. > > > I > > > > don't have any backslashes. > > > > > > > > If the special characters was the issue, I should get the JSON > parsing > > > > exceptions every time irrespective of the index size and irrespective > > of > > > > the available memory on the machine. That is not the case here. The > > > > streaming API successfully returns all the documents when the index > > size > > > is > > > > small and fits in the available memory. That's the reason I am > > confused. > > > > > > > > Thanks! > > > > > > > > On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein > > > > wrote: > > > > > > > > > The Streaming API may have been throwing exceptions because the > JSON > > > > > special characters were not escaped. This was fixed in Solr 6.0. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Joel Bernstein > > > > > http://joelsolr.blogspot.com/ > > > > > > > > > > On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi < > > chetas.jo...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > I am running Solr 5.5.0. > > > > > > It is a solrCloud of 50 nodes a
Re: Solr on HDFS: Streaming API performance tuning
Ok, based on the stack trace I suspect one of your sort fields has NULL values, which in the 5x branch could produce null pointers if a segment had no values for a sort field. This is also fixed in the Solr 6x branch. Joel Bernstein http://joelsolr.blogspot.com/ On Sat, Dec 17, 2016 at 2:44 PM, Chetas Joshi wrote: > Here is the stack trace. > > java.lang.NullPointerException > > at > org.apache.solr.client.solrj.io.comp.FieldComparator$2. > compare(FieldComparator.java:85) > > at > org.apache.solr.client.solrj.io.comp.FieldComparator. > compare(FieldComparator.java:92) > > at > org.apache.solr.client.solrj.io.comp.FieldComparator. > compare(FieldComparator.java:30) > > at > org.apache.solr.client.solrj.io.comp.MultiComp.compare(MultiComp.java:45) > > at > org.apache.solr.client.solrj.io.comp.MultiComp.compare(MultiComp.java:33) > > at > org.apache.solr.client.solrj.io.stream.CloudSolrStream$ > TupleWrapper.compareTo(CloudSolrStream.java:396) > > at > org.apache.solr.client.solrj.io.stream.CloudSolrStream$ > TupleWrapper.compareTo(CloudSolrStream.java:381) > > at java.util.TreeMap.put(TreeMap.java:560) > > at java.util.TreeSet.add(TreeSet.java:255) > > at > org.apache.solr.client.solrj.io.stream.CloudSolrStream._ > read(CloudSolrStream.java:366) > > at > org.apache.solr.client.solrj.io.stream.CloudSolrStream. > read(CloudSolrStream.java:353) > > at > > *.*.*.*.SolrStreamResultIterator$$anon$1.run(SolrStreamResultIterator. > scala:101) > > at java.lang.Thread.run(Thread.java:745) > > 16/11/17 13:04:31 *ERROR* SolrStreamResultIterator:missing exponent > number: > char=A,position=106596 > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA' > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj' > > org.noggit.JSONParser$ParseException: missing exponent number: > char=A,position=106596 > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA' > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj' > > at org.noggit.JSONParser.err(JSONParser.java:356) > > at org.noggit.JSONParser.readExp(JSONParser.java:513) > > at org.noggit.JSONParser.readNumber(JSONParser.java:419) > > at org.noggit.JSONParser.next(JSONParser.java:845) > > at org.noggit.JSONParser.nextEvent(JSONParser.java:951) > > at org.noggit.ObjectBuilder.getObject(ObjectBuilder.java:127) > > at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:57) > > at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:37) > > at > org.apache.solr.client.solrj.io.stream.JSONTupleStream. > next(JSONTupleStream.java:84) > > at > org.apache.solr.client.solrj.io.stream.SolrStream.read( > SolrStream.java:147) > > at > org.apache.solr.client.solrj.io.stream.CloudSolrStream$TupleWrapper.next( > CloudSolrStream.java:413) > > at > org.apache.solr.client.solrj.io.stream.CloudSolrStream._ > read(CloudSolrStream.java:365) > > at > org.apache.solr.client.solrj.io.stream.CloudSolrStream. > read(CloudSolrStream.java:353) > > > Thanks! > > On Fri, Dec 16, 2016 at 11:45 PM, Reth RM wrote: > > > If you could provide the json parse exception stack trace, it might help > to > > predict issue there. > > > > > > On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi > > wrote: > > > > > Hi Joel, > > > > > > The only NON alpha-numeric characters I have in my data are '+' and > '/'. > > I > > > don't have any backslashes. > > > > > > If the special characters was the issue, I should get the JSON parsing > > > exceptions every time irrespective of the index size and irrespective > of > > > the available memory on the machine. That is not the case here. The > > > streaming API successfully returns all the documents when the index > size > > is > > > small and fits in the available memory. That's the reason I am > confused. > > > > > > Thanks! > > > > > > On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein > > > wrote: > > > > > > > The Streaming API may have been throwing exceptions because the JSON > > > > special characters were not escaped. This was fixed in Solr 6.0. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Joel Bernstein > > > > http://joelsolr.blogspot.com/ > > > > > > > > On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi < > chetas.jo...@gmail.com> > > > > wrote: > > > > > > > > > Hello, > > > > > > > > > > I am running Solr 5.5.0. > > > > > It is a solrCloud of 50 nodes and I have the following config for > all > > > the > > > > > collections. > > > > > maxShardsperNode: 1 > > > > > replicationFactor: 1 > > > > > > > > > > I was using Streaming API to get back results from Solr. It worked > > fine > > > > for > > > > > a while until the index data size reached beyond 40 GB per shard > > (i.e. > > > > per > > > > > node). It started throwing JSON parsing exceptions while reading > the > > > > > TupleStream data. FYI: I have other services (Yarn, Spark) deployed > > on
Re: Solr on HDFS: Streaming API performance tuning
Here is the stack trace. java.lang.NullPointerException at org.apache.solr.client.solrj.io.comp.FieldComparator$2.compare(FieldComparator.java:85) at org.apache.solr.client.solrj.io.comp.FieldComparator.compare(FieldComparator.java:92) at org.apache.solr.client.solrj.io.comp.FieldComparator.compare(FieldComparator.java:30) at org.apache.solr.client.solrj.io.comp.MultiComp.compare(MultiComp.java:45) at org.apache.solr.client.solrj.io.comp.MultiComp.compare(MultiComp.java:33) at org.apache.solr.client.solrj.io.stream.CloudSolrStream$TupleWrapper.compareTo(CloudSolrStream.java:396) at org.apache.solr.client.solrj.io.stream.CloudSolrStream$TupleWrapper.compareTo(CloudSolrStream.java:381) at java.util.TreeMap.put(TreeMap.java:560) at java.util.TreeSet.add(TreeSet.java:255) at org.apache.solr.client.solrj.io.stream.CloudSolrStream._read(CloudSolrStream.java:366) at org.apache.solr.client.solrj.io.stream.CloudSolrStream.read(CloudSolrStream.java:353) at *.*.*.*.SolrStreamResultIterator$$anon$1.run(SolrStreamResultIterator.scala:101) at java.lang.Thread.run(Thread.java:745) 16/11/17 13:04:31 *ERROR* SolrStreamResultIterator:missing exponent number: char=A,position=106596 BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA' AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj' org.noggit.JSONParser$ParseException: missing exponent number: char=A,position=106596 BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA' AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj' at org.noggit.JSONParser.err(JSONParser.java:356) at org.noggit.JSONParser.readExp(JSONParser.java:513) at org.noggit.JSONParser.readNumber(JSONParser.java:419) at org.noggit.JSONParser.next(JSONParser.java:845) at org.noggit.JSONParser.nextEvent(JSONParser.java:951) at org.noggit.ObjectBuilder.getObject(ObjectBuilder.java:127) at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:57) at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:37) at org.apache.solr.client.solrj.io.stream.JSONTupleStream.next(JSONTupleStream.java:84) at org.apache.solr.client.solrj.io.stream.SolrStream.read(SolrStream.java:147) at org.apache.solr.client.solrj.io.stream.CloudSolrStream$TupleWrapper.next(CloudSolrStream.java:413) at org.apache.solr.client.solrj.io.stream.CloudSolrStream._read(CloudSolrStream.java:365) at org.apache.solr.client.solrj.io.stream.CloudSolrStream.read(CloudSolrStream.java:353) Thanks! On Fri, Dec 16, 2016 at 11:45 PM, Reth RM wrote: > If you could provide the json parse exception stack trace, it might help to > predict issue there. > > > On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi > wrote: > > > Hi Joel, > > > > The only NON alpha-numeric characters I have in my data are '+' and '/'. > I > > don't have any backslashes. > > > > If the special characters was the issue, I should get the JSON parsing > > exceptions every time irrespective of the index size and irrespective of > > the available memory on the machine. That is not the case here. The > > streaming API successfully returns all the documents when the index size > is > > small and fits in the available memory. That's the reason I am confused. > > > > Thanks! > > > > On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein > > wrote: > > > > > The Streaming API may have been throwing exceptions because the JSON > > > special characters were not escaped. This was fixed in Solr 6.0. > > > > > > > > > > > > > > > > > > > > > Joel Bernstein > > > http://joelsolr.blogspot.com/ > > > > > > On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi > > > wrote: > > > > > > > Hello, > > > > > > > > I am running Solr 5.5.0. > > > > It is a solrCloud of 50 nodes and I have the following config for all > > the > > > > collections. > > > > maxShardsperNode: 1 > > > > replicationFactor: 1 > > > > > > > > I was using Streaming API to get back results from Solr. It worked > fine > > > for > > > > a while until the index data size reached beyond 40 GB per shard > (i.e. > > > per > > > > node). It started throwing JSON parsing exceptions while reading the > > > > TupleStream data. FYI: I have other services (Yarn, Spark) deployed > on > > > the > > > > same boxes on which Solr shards are running. Spark jobs also use a > lot > > of > > > > disk cache. So, the free available disk cache on the boxes vary a > > > > lot depending upon what else is running on the box. > > > > > > > > Due to this issue, I moved to using the cursor approach and it works > > fine > > > > but as we all know it is way slower than the streaming approach. > > > > > > > > Currently the index size per shard is 80GB (The machine has 512 GB of > > RAM > > > > and being used by different services/programs: heap/off-heap and the > > disk > > > > cache requirements). > > > > > > > > When I have enough RAM (more th
Re: Solr on HDFS: Streaming API performance tuning
If you could provide the json parse exception stack trace, it might help to predict issue there. On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi wrote: > Hi Joel, > > The only NON alpha-numeric characters I have in my data are '+' and '/'. I > don't have any backslashes. > > If the special characters was the issue, I should get the JSON parsing > exceptions every time irrespective of the index size and irrespective of > the available memory on the machine. That is not the case here. The > streaming API successfully returns all the documents when the index size is > small and fits in the available memory. That's the reason I am confused. > > Thanks! > > On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein > wrote: > > > The Streaming API may have been throwing exceptions because the JSON > > special characters were not escaped. This was fixed in Solr 6.0. > > > > > > > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi > > wrote: > > > > > Hello, > > > > > > I am running Solr 5.5.0. > > > It is a solrCloud of 50 nodes and I have the following config for all > the > > > collections. > > > maxShardsperNode: 1 > > > replicationFactor: 1 > > > > > > I was using Streaming API to get back results from Solr. It worked fine > > for > > > a while until the index data size reached beyond 40 GB per shard (i.e. > > per > > > node). It started throwing JSON parsing exceptions while reading the > > > TupleStream data. FYI: I have other services (Yarn, Spark) deployed on > > the > > > same boxes on which Solr shards are running. Spark jobs also use a lot > of > > > disk cache. So, the free available disk cache on the boxes vary a > > > lot depending upon what else is running on the box. > > > > > > Due to this issue, I moved to using the cursor approach and it works > fine > > > but as we all know it is way slower than the streaming approach. > > > > > > Currently the index size per shard is 80GB (The machine has 512 GB of > RAM > > > and being used by different services/programs: heap/off-heap and the > disk > > > cache requirements). > > > > > > When I have enough RAM (more than 80 GB so that all the index data > could > > > fit in memory) available on the machine, the streaming API succeeds > > without > > > running into any exceptions. > > > > > > Question: > > > How different the index data caching mechanism (for HDFS) is for the > > > Streaming API from the cursorMark approach? > > > Why cursor works every time but streaming works only when there is a > lot > > of > > > free disk cache? > > > > > > Thank you. > > > > > >
Re: Solr on HDFS: Streaming API performance tuning
Hi Joel, The only NON alpha-numeric characters I have in my data are '+' and '/'. I don't have any backslashes. If the special characters was the issue, I should get the JSON parsing exceptions every time irrespective of the index size and irrespective of the available memory on the machine. That is not the case here. The streaming API successfully returns all the documents when the index size is small and fits in the available memory. That's the reason I am confused. Thanks! On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein wrote: > The Streaming API may have been throwing exceptions because the JSON > special characters were not escaped. This was fixed in Solr 6.0. > > > > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi > wrote: > > > Hello, > > > > I am running Solr 5.5.0. > > It is a solrCloud of 50 nodes and I have the following config for all the > > collections. > > maxShardsperNode: 1 > > replicationFactor: 1 > > > > I was using Streaming API to get back results from Solr. It worked fine > for > > a while until the index data size reached beyond 40 GB per shard (i.e. > per > > node). It started throwing JSON parsing exceptions while reading the > > TupleStream data. FYI: I have other services (Yarn, Spark) deployed on > the > > same boxes on which Solr shards are running. Spark jobs also use a lot of > > disk cache. So, the free available disk cache on the boxes vary a > > lot depending upon what else is running on the box. > > > > Due to this issue, I moved to using the cursor approach and it works fine > > but as we all know it is way slower than the streaming approach. > > > > Currently the index size per shard is 80GB (The machine has 512 GB of RAM > > and being used by different services/programs: heap/off-heap and the disk > > cache requirements). > > > > When I have enough RAM (more than 80 GB so that all the index data could > > fit in memory) available on the machine, the streaming API succeeds > without > > running into any exceptions. > > > > Question: > > How different the index data caching mechanism (for HDFS) is for the > > Streaming API from the cursorMark approach? > > Why cursor works every time but streaming works only when there is a lot > of > > free disk cache? > > > > Thank you. > > >
Re: Solr on HDFS: Streaming API performance tuning
The Streaming API may have been throwing exceptions because the JSON special characters were not escaped. This was fixed in Solr 6.0. Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi wrote: > Hello, > > I am running Solr 5.5.0. > It is a solrCloud of 50 nodes and I have the following config for all the > collections. > maxShardsperNode: 1 > replicationFactor: 1 > > I was using Streaming API to get back results from Solr. It worked fine for > a while until the index data size reached beyond 40 GB per shard (i.e. per > node). It started throwing JSON parsing exceptions while reading the > TupleStream data. FYI: I have other services (Yarn, Spark) deployed on the > same boxes on which Solr shards are running. Spark jobs also use a lot of > disk cache. So, the free available disk cache on the boxes vary a > lot depending upon what else is running on the box. > > Due to this issue, I moved to using the cursor approach and it works fine > but as we all know it is way slower than the streaming approach. > > Currently the index size per shard is 80GB (The machine has 512 GB of RAM > and being used by different services/programs: heap/off-heap and the disk > cache requirements). > > When I have enough RAM (more than 80 GB so that all the index data could > fit in memory) available on the machine, the streaming API succeeds without > running into any exceptions. > > Question: > How different the index data caching mechanism (for HDFS) is for the > Streaming API from the cursorMark approach? > Why cursor works every time but streaming works only when there is a lot of > free disk cache? > > Thank you. >
Solr on HDFS: Streaming API performance tuning
Hello, I am running Solr 5.5.0. It is a solrCloud of 50 nodes and I have the following config for all the collections. maxShardsperNode: 1 replicationFactor: 1 I was using Streaming API to get back results from Solr. It worked fine for a while until the index data size reached beyond 40 GB per shard (i.e. per node). It started throwing JSON parsing exceptions while reading the TupleStream data. FYI: I have other services (Yarn, Spark) deployed on the same boxes on which Solr shards are running. Spark jobs also use a lot of disk cache. So, the free available disk cache on the boxes vary a lot depending upon what else is running on the box. Due to this issue, I moved to using the cursor approach and it works fine but as we all know it is way slower than the streaming approach. Currently the index size per shard is 80GB (The machine has 512 GB of RAM and being used by different services/programs: heap/off-heap and the disk cache requirements). When I have enough RAM (more than 80 GB so that all the index data could fit in memory) available on the machine, the streaming API succeeds without running into any exceptions. Question: How different the index data caching mechanism (for HDFS) is for the Streaming API from the cursorMark approach? Why cursor works every time but streaming works only when there is a lot of free disk cache? Thank you.
Re: Solr on HDFS: increase in query time with increase in data
On 12/16/2016 11:58 AM, Chetas Joshi wrote: > How different the index data caching mechanism is for the Streaming > API from the cursor approach? Solr and Lucene do not handle that caching. Systems external to Solr (like the OS, or HDFS) handle the caching. The cache effectiveness will be a combination of the cache size, overall data size, and the data access patterns of the application. I do not know enough to tell you how the cursorMark feature and the streaming API work when they access the index data. I would imagine them to be pretty similar, but cannot be sure about that. Thanks, Shawn
Re: Solr on HDFS: increase in query time with increase in data
Thank you everyone. I would add nodes to the SolrCloud and split the shards. Shawn, Thank you for explaining why putting index data on local file system could be a better idea than using HDFS. I need to find out how HDFS caches the index files in a resource constrained environment. I would also like to add that when I try the Streaming API instead of using the cursor approach, it starts running into JSON parsing exceptions when my nodes (running Solr shards) don't have enough RAM to fit the entire index into memory. FYI: I have other services (Yarn, Spark) deployed on the same boxes as well. Spark jobs also use a lot of disk cache. When I have enough RAM (more than 70 GB so that all the index data could fit in memory), the streaming API succeeds without running into any exceptions. How different the index data caching mechanism is for the Streaming API from the cursor approach? Thanks! On Fri, Dec 16, 2016 at 6:52 AM, Shawn Heisey wrote: > On 12/14/2016 11:58 AM, Chetas Joshi wrote: > > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have > > the following config. > > maxShardsperNode: 1 > > replicationFactor: 1 > > > > I have been ingesting data into Solr for the last 3 months. With increase > > in data, I am observing increase in the query time. Currently the size of > > my indices is 70 GB per shard (i.e. per node). > > Query times will increase as the index size increases, but significant > jumps in the query time may be an indication of a performance problem. > Performance problems are usually caused by insufficient resources, > memory in particular. > > With HDFS, I am honestly not sure *where* the cache memory is needed. I > would assume that it's needed on the HDFS hosts, that a lot of spare > memory on the Solr (HDFS client) hosts probably won't make much > difference. I could be wrong -- I have no idea what kind of caching > HDFS does. If the HDFS client can cache data, then you probably would > want extra memory on the Solr machines. > > > I am using cursor approach (/export handler) using SolrJ client to get > back > > results from Solr. All the fields I am querying on and all the fields > that > > I get back from Solr are indexed and have docValues enabled as well. What > > could be the reason behind increase in query time? > > If actual disk access is required to satisfy a query, Solr is going to > be slow. Caching is absolutely required for good performance. If your > query times are really long but used to be short, chances are that your > index size has exceeded your system's ability to cache it effectively. > > One thing to keep in mind: Gigabit Ethernet is comparable in speed to > the sustained transfer rate of a single modern SATA magnetic disk, so if > the data has to traverse a gigabit network, it probably will be nearly > as slow as it would be if it were coming from a single disk. Having a > 10gig network for your storage is probably a good idea ... but current > fast memory chips can leave 10gig in the dust, so if the data can come > from cache and the chips are new enough, then it can be faster than > network storage. > > Because the network can be a potential bottleneck, I strongly recommend > putting index data on local disks. If you have enough memory, the disk > doesn't even need to be super-fast. > > > Has this got something to do with the OS disk cache that is used for > > loading the Solr indices? When a query is fired, will Solr wait for all > > (70GB) of disk cache being available so that it can load the index file? > > Caching the files on the disk is not handled by Solr, so Solr won't wait > for the entire index to be cached unless the underlying storage waits > for some reason. The caching is usually handled by the OS. For HDFS, > it might be handled by a combination of the OS and Hadoop, but I don't > know enough about HDFS to comment. Solr makes a request for the parts > of the index files that it needs to satisfy the request. If the > underlying system is capable of caching the data, if that feature is > enabled, and if there's memory available for that purpose, then it gets > cached. > > Thanks, > Shawn > >
Re: Solr on HDFS: increase in query time with increase in data
On 12/14/2016 11:58 AM, Chetas Joshi wrote: > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have > the following config. > maxShardsperNode: 1 > replicationFactor: 1 > > I have been ingesting data into Solr for the last 3 months. With increase > in data, I am observing increase in the query time. Currently the size of > my indices is 70 GB per shard (i.e. per node). Query times will increase as the index size increases, but significant jumps in the query time may be an indication of a performance problem. Performance problems are usually caused by insufficient resources, memory in particular. With HDFS, I am honestly not sure *where* the cache memory is needed. I would assume that it's needed on the HDFS hosts, that a lot of spare memory on the Solr (HDFS client) hosts probably won't make much difference. I could be wrong -- I have no idea what kind of caching HDFS does. If the HDFS client can cache data, then you probably would want extra memory on the Solr machines. > I am using cursor approach (/export handler) using SolrJ client to get back > results from Solr. All the fields I am querying on and all the fields that > I get back from Solr are indexed and have docValues enabled as well. What > could be the reason behind increase in query time? If actual disk access is required to satisfy a query, Solr is going to be slow. Caching is absolutely required for good performance. If your query times are really long but used to be short, chances are that your index size has exceeded your system's ability to cache it effectively. One thing to keep in mind: Gigabit Ethernet is comparable in speed to the sustained transfer rate of a single modern SATA magnetic disk, so if the data has to traverse a gigabit network, it probably will be nearly as slow as it would be if it were coming from a single disk. Having a 10gig network for your storage is probably a good idea ... but current fast memory chips can leave 10gig in the dust, so if the data can come from cache and the chips are new enough, then it can be faster than network storage. Because the network can be a potential bottleneck, I strongly recommend putting index data on local disks. If you have enough memory, the disk doesn't even need to be super-fast. > Has this got something to do with the OS disk cache that is used for > loading the Solr indices? When a query is fired, will Solr wait for all > (70GB) of disk cache being available so that it can load the index file? Caching the files on the disk is not handled by Solr, so Solr won't wait for the entire index to be cached unless the underlying storage waits for some reason. The caching is usually handled by the OS. For HDFS, it might be handled by a combination of the OS and Hadoop, but I don't know enough about HDFS to comment. Solr makes a request for the parts of the index files that it needs to satisfy the request. If the underlying system is capable of caching the data, if that feature is enabled, and if there's memory available for that purpose, then it gets cached. Thanks, Shawn
Re: Solr on HDFS: increase in query time with increase in data
I think 70GB is too huge for a shard. How much memory does the system is having? Incase solr does not have sufficient memory to load the indexes, it will use only the amount of memory defined in your Solr Caches. Although you are on HFDS, solr performances will be really bad if it has do disk IO at the query time. The best option for you is to shard it into atleast 8-10 nodes and create appropriate replicas according to your read traffic. Regards, Piyush On Fri, Dec 16, 2016 at 12:15 PM, Reth RM wrote: > I think the shard index size is huge and should be split. > > On Wed, Dec 14, 2016 at 10:58 AM, Chetas Joshi > wrote: > > > Hi everyone, > > > > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have > > the following config. > > maxShardsperNode: 1 > > replicationFactor: 1 > > > > I have been ingesting data into Solr for the last 3 months. With increase > > in data, I am observing increase in the query time. Currently the size of > > my indices is 70 GB per shard (i.e. per node). > > > > I am using cursor approach (/export handler) using SolrJ client to get > back > > results from Solr. All the fields I am querying on and all the fields > that > > I get back from Solr are indexed and have docValues enabled as well. What > > could be the reason behind increase in query time? > > > > Has this got something to do with the OS disk cache that is used for > > loading the Solr indices? When a query is fired, will Solr wait for all > > (70GB) of disk cache being available so that it can load the index file? > > > > Thnaks! > > >
Re: Solr on HDFS: increase in query time with increase in data
I think the shard index size is huge and should be split. On Wed, Dec 14, 2016 at 10:58 AM, Chetas Joshi wrote: > Hi everyone, > > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have > the following config. > maxShardsperNode: 1 > replicationFactor: 1 > > I have been ingesting data into Solr for the last 3 months. With increase > in data, I am observing increase in the query time. Currently the size of > my indices is 70 GB per shard (i.e. per node). > > I am using cursor approach (/export handler) using SolrJ client to get back > results from Solr. All the fields I am querying on and all the fields that > I get back from Solr are indexed and have docValues enabled as well. What > could be the reason behind increase in query time? > > Has this got something to do with the OS disk cache that is used for > loading the Solr indices? When a query is fired, will Solr wait for all > (70GB) of disk cache being available so that it can load the index file? > > Thnaks! >
Solr on HDFS: increase in query time with increase in data
Hi everyone, I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have the following config. maxShardsperNode: 1 replicationFactor: 1 I have been ingesting data into Solr for the last 3 months. With increase in data, I am observing increase in the query time. Currently the size of my indices is 70 GB per shard (i.e. per node). I am using cursor approach (/export handler) using SolrJ client to get back results from Solr. All the fields I am querying on and all the fields that I get back from Solr are indexed and have docValues enabled as well. What could be the reason behind increase in query time? Has this got something to do with the OS disk cache that is used for loading the Solr indices? When a query is fired, will Solr wait for all (70GB) of disk cache being available so that it can load the index file? Thnaks!
Re: Solr on HDFS: adding a shard replica
The core_node name is largely irrelevant, you should have names more descriptive in the state.json file like collection1_shard1_replica1. You happen to see 19 because you have only one replica per shard, Exactly how are you creating the replica? What version of Solr? If you're using the "core admin" UI, it's tricky to get right. I'd strongly recommend using the "collections API, ADDREPLICA" command, see: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api_addreplica Best, Erick On Tue, Sep 13, 2016 at 7:11 PM, Chetas Joshi wrote: > Is this happening because I have set replicationFactor=1? > So even if I manually add replica for the shard that's down, it will just > create a dataDir but would not copy any of the data into the dataDir? > > On Tue, Sep 13, 2016 at 6:07 PM, Chetas Joshi > wrote: > >> Hi, >> >> I just started experimenting with solr cloud. >> >> I have a solr cloud of 20 nodes. I have one collection with 18 shards >> running on 18 different nodes with replication factor=1. >> >> When one of my shards goes down, I create a replica using the Solr UI. On >> HDFS I see a core getting added. But the data (index table and tlog) >> information does not get copied over to that directory. For example, on >> HDFS I have >> >> /solr/collection/core_node_1/data/index >> /solr/collection/core_node_1/data/tlog >> >> when I create a replica of a shard, it creates >> >> /solr/collection/core_node_19/data/index >> /solr/collection/core_node_19/data/tlog >> >> (core_node_19 as I already have 18 shards for the collection). The issue >> is both my folders core_node_19/data/index and core_node_19/data/tlog are >> empty. Data does not get copied over from core_node_1/data/index and >> core_node_1/data/tlog. >> >> I need to remove core_node_1 and just keep core_node_19 (the replica). Why >> the data is not getting copied over? Do I need to manually move all the >> data from one folder to the other? >> >> Thank you, >> Chetas. >> >>
Re: Solr on HDFS: adding a shard replica
Is this happening because I have set replicationFactor=1? So even if I manually add replica for the shard that's down, it will just create a dataDir but would not copy any of the data into the dataDir? On Tue, Sep 13, 2016 at 6:07 PM, Chetas Joshi wrote: > Hi, > > I just started experimenting with solr cloud. > > I have a solr cloud of 20 nodes. I have one collection with 18 shards > running on 18 different nodes with replication factor=1. > > When one of my shards goes down, I create a replica using the Solr UI. On > HDFS I see a core getting added. But the data (index table and tlog) > information does not get copied over to that directory. For example, on > HDFS I have > > /solr/collection/core_node_1/data/index > /solr/collection/core_node_1/data/tlog > > when I create a replica of a shard, it creates > > /solr/collection/core_node_19/data/index > /solr/collection/core_node_19/data/tlog > > (core_node_19 as I already have 18 shards for the collection). The issue > is both my folders core_node_19/data/index and core_node_19/data/tlog are > empty. Data does not get copied over from core_node_1/data/index and > core_node_1/data/tlog. > > I need to remove core_node_1 and just keep core_node_19 (the replica). Why > the data is not getting copied over? Do I need to manually move all the > data from one folder to the other? > > Thank you, > Chetas. > >
Solr on HDFS: adding a shard replica
Hi, I just started experimenting with solr cloud. I have a solr cloud of 20 nodes. I have one collection with 18 shards running on 18 different nodes with replication factor=1. When one of my shards goes down, I create a replica using the Solr UI. On HDFS I see a core getting added. But the data (index table and tlog) information does not get copied over to that directory. For example, on HDFS I have /solr/collection/core_node_1/data/index /solr/collection/core_node_1/data/tlog when I create a replica of a shard, it creates /solr/collection/core_node_19/data/index /solr/collection/core_node_19/data/tlog (core_node_19 as I already have 18 shards for the collection). The issue is both my folders core_node_19/data/index and core_node_19/data/tlog are empty. Data does not get copied over from core_node_1/data/index and core_node_1/data/tlog. I need to remove core_node_1 and just keep core_node_19 (the replica). Why the data is not getting copied over? Do I need to manually move all the data from one folder to the other? Thank you, Chetas.
Re: Solr on HDFS in a Hadoop cluster
Thanks a lot Otis, While reading the SolrCloud documentation to understand how SolrCloud could run on HDFS, I got confused with leader, replica, "non-replica" shards, core, index, and collections. Once it is specified that one cannot add shards, then that one can add replica-only shards, then that last "Shard Splitting" paragraph states that something changed starting with Solr 4.3. But it doesn't states that splitting shards can end in a new non-replica shard, in a just added node, thus increasing the amount of storage available to the index / collection. It states that "split action effectively makes two copies of the data as new shards" instead, which tastes a lot like replica style shards. So does it? Could there be some sort of tutorial describing how to add available storage capacity for index / collection, thus adding a node / shard - core that one can send new documents to be indexed? (of course, load-balancing would be trigered, so it looks like documents would be added to shards out of a set of nodes). Thanks, Charles VALLEE Centre de compétence Big data EDF – DSP - CSP IT-O DATACENTER - Expertise en Energie Informatique (EEI) 32 avenue Pablo Picasso 92000 Nanterre charles.val...@edf.fr Tél. : + (0) 1 78 66 69 81 Un geste simple pour l'environnement, n'imprimez ce message que si vous en avez l'utilité. De :otis.gospodne...@gmail.com A : solr-user@lucene.apache.org Date : 06/01/2015 18:55 Objet : Re: Solr on HDFS in a Hadoop cluster Oh, and https://issues.apache.org/jira/browse/SOLR-6743 Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On Tue, Jan 6, 2015 at 12:52 PM, Otis Gospodnetic < otis.gospodne...@gmail.com> wrote: > Hi Charles, > > See http://search-lucene.com/?q=solr+hdfs and > https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS > > Otis > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > > > On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE > wrote: > >> I am considering using *Solr* to extend *Hortonworks Data Platform* >> capabilities to search. >> >> - I found tutorials to index documents into a Solr instance from *HDFS*, >> but I guess this solution would require a Solr cluster distinct to the >> Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop >> cluster instead? - *With the index stored in HDFS?* >> >> - Where would the processing take place (could it be handed down to >> Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to >> integrate with *Yarn*? >> >> - What about *SolrCloud*: what does it bring regarding Hadoop based >> use-cases? Does it stand for a Solr-only cluster? >> >> - Well, if that could lead to something working with a roles-based >> authorization-compliant *Banana*, it would be Christmass again! >> >> Thanks a lot for any help! >> >> Charles >> Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse. Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message. Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus. This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message. E-mail communication cannot be guaranteed to be timely secure, error or virus-free.
Re: Solr on HDFS in a Hadoop cluster
Oh, and https://issues.apache.org/jira/browse/SOLR-6743 Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On Tue, Jan 6, 2015 at 12:52 PM, Otis Gospodnetic < otis.gospodne...@gmail.com> wrote: > Hi Charles, > > See http://search-lucene.com/?q=solr+hdfs and > https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS > > Otis > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > > > On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE > wrote: > >> I am considering using *Solr* to extend *Hortonworks Data Platform* >> capabilities to search. >> >> - I found tutorials to index documents into a Solr instance from *HDFS*, >> but I guess this solution would require a Solr cluster distinct to the >> Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop >> cluster instead? - *With the index stored in HDFS?* >> >> - Where would the processing take place (could it be handed down to >> Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to >> integrate with *Yarn*? >> >> - What about *SolrCloud*: what does it bring regarding Hadoop based >> use-cases? Does it stand for a Solr-only cluster? >> >> - Well, if that could lead to something working with a roles-based >> authorization-compliant *Banana*, it would be Christmass again! >> >> Thanks a lot for any help! >> >> Charles >> >> >> >> Ce message et toutes les pièces jointes (ci-après le 'Message') sont >> établis à l'intention exclusive des destinataires et les informations qui y >> figurent sont strictement confidentielles. Toute utilisation de ce Message >> non conforme à sa destination, toute diffusion ou toute publication totale >> ou partielle, est interdite sauf autorisation expresse. >> >> Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de >> le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou >> partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de >> votre système, ainsi que toutes ses copies, et de n'en garder aucune trace >> sur quelque support que ce soit. Nous vous remercions également d'en >> avertir immédiatement l'expéditeur par retour du message. >> >> Il est impossible de garantir que les communications par messagerie >> électronique arrivent en temps utile, sont sécurisées ou dénuées de toute >> erreur ou virus. >> >> >> This message and any attachments (the 'Message') are intended solely for >> the addressees. The information contained in this Message is confidential. >> Any use of information contained in this Message not in accord with its >> purpose, any dissemination or disclosure, either whole or partial, is >> prohibited except formal approval. >> >> If you are not the addressee, you may not copy, forward, disclose or use >> any part of it. If you have received this message in error, please delete >> it and all copies from your system and notify the sender immediately by >> return message. >> >> E-mail communication cannot be guaranteed to be timely secure, error or >> virus-free. >> >> >
Re: Solr on HDFS in a Hadoop cluster
Hi Charles, See http://search-lucene.com/?q=solr+hdfs and https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE wrote: > I am considering using *Solr* to extend *Hortonworks Data Platform* > capabilities to search. > > - I found tutorials to index documents into a Solr instance from *HDFS*, > but I guess this solution would require a Solr cluster distinct to the > Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop > cluster instead? - *With the index stored in HDFS?* > > - Where would the processing take place (could it be handed down to > Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to > integrate with *Yarn*? > > - What about *SolrCloud*: what does it bring regarding Hadoop based > use-cases? Does it stand for a Solr-only cluster? > > - Well, if that could lead to something working with a roles-based > authorization-compliant *Banana*, it would be Christmass again! > > Thanks a lot for any help! > > Charles > > > > Ce message et toutes les pièces jointes (ci-après le 'Message') sont > établis à l'intention exclusive des destinataires et les informations qui y > figurent sont strictement confidentielles. Toute utilisation de ce Message > non conforme à sa destination, toute diffusion ou toute publication totale > ou partielle, est interdite sauf autorisation expresse. > > Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de > le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou > partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de > votre système, ainsi que toutes ses copies, et de n'en garder aucune trace > sur quelque support que ce soit. Nous vous remercions également d'en > avertir immédiatement l'expéditeur par retour du message. > > Il est impossible de garantir que les communications par messagerie > électronique arrivent en temps utile, sont sécurisées ou dénuées de toute > erreur ou virus. > > > This message and any attachments (the 'Message') are intended solely for > the addressees. The information contained in this Message is confidential. > Any use of information contained in this Message not in accord with its > purpose, any dissemination or disclosure, either whole or partial, is > prohibited except formal approval. > > If you are not the addressee, you may not copy, forward, disclose or use > any part of it. If you have received this message in error, please delete > it and all copies from your system and notify the sender immediately by > return message. > > E-mail communication cannot be guaranteed to be timely secure, error or > virus-free. > >
Solr on HDFS in a Hadoop cluster
I am considering using Solr to extend Hortonworks Data Platform capabilities to search. - I found tutorials to index documents into a Solr instance from HDFS, but I guess this solution would require a Solr cluster distinct to the Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop cluster instead? - With the index stored in HDFS? - Where would the processing take place (could it be handed down to Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to integrate with Yarn? - What about SolrCloud: what does it bring regarding Hadoop based use-cases? Does it stand for a Solr-only cluster? - Well, if that could lead to something working with a roles-based authorization-compliant Banana, it would be Christmass again! Thanks a lot for any help! Charles Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse. Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message. Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus. This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message. E-mail communication cannot be guaranteed to be timely secure, error or virus-free.
Re: SOLR on hdfs
Hi all, I am new to Solr and hdfs, actually, I am trying to index text content extracted from binary files like PDF, MS Office...etc which are stored on hdfs (single node), till now I've running Solr on HDFS, and create the core but I couldn't send the files to solr for indexing. Can someone please help me to do that. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-on-hdfs-tp4045128p4146049.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR on hdfs
Hi Joseph, I believe Nutch can index into Solr/SolrCloud just fine. Sounds like that is the approach you should take. Otis -- Solr & ElasticSearch Support http://sematext.com/ On Thu, Mar 7, 2013 at 12:10 AM, Joseph Lim wrote: > Hi Amit, > > Currently I am designing a Learning Management System where it is based on > Hadoop and hbase . Right now I want to integrate nutch with solr in it as > part of crawler module, so that users will only be able to search relevant > documents from specific source. And since crawling and indexing takes so > much of the time, (might be 5 to 6 hours ~ 5gb) so hope that if there is > anything happen to the server, there will be replicates to back it up. > > I just saw what solrcloud can do but will need to check out if nutch is > able to work with it. Not knowing of other constraints I will encounter, so > was asking if I can just output the solr dir into a hdfs in the first > place. > > Cheers. > > On Thursday, March 7, 2013, Amit Nithian wrote: > > > Joseph, > > > > Doing what Otis said will do literally what you want which is copying the > > index to HDFS. It's no different than copying it to a different machine > > which btw is what Solr's master/slave replication scheme does. > > Alternatively, I think people are starting to setup new Solr instances > with > > SolrCloud which doesn't have the concept of master/slave but rather a > > series of nodes with the option of having replicas (what I believe to be > > backup nodes) so that you have the redundancy you want. > > > > Honestly HDFS in the way that you are looking for is probably no > different > > than storing your solr index in a RAIDed storage format but I don't > > pretend to know much about RAID arrays. > > > > What exactly are you trying to achieve from a systems perspective? Why do > > you want Hadoop in the mix here and how does copying the index to HDFS > help > > you? If SolrCloud seems complicated try just setting up a simple > > master/slave replication scheme for that's really easy. > > > > Cheers > > Amit > > > > > > On Wed, Mar 6, 2013 at 9:55 PM, Joseph Lim wrote: > > > > > Hi Amit, > > > > > > so you mean that if I just want to get redundancy for solr in hdfs, the > > > only best way to do it is to as per what Otis suggested using the > > following > > > command > > > > > > hadoop fs -copyFromLocal URI > > > > > > Ok let me try out solrcloud as I will need to make sure it works well > > with > > > nutch too.. > > > > > > Thanks for the help.. > > > > > > > > > On Thu, Mar 7, 2013 at 5:47 AM, Amit Nithian > wrote: > > > > > > > Why wouldn't SolrCloud help you here? You can setup shards and > replicas > > > etc > > > > to have redundancy b/c HDFS isn't designed to serve real time queries > > as > > > > far as I understand. If you are using HDFS as a backup mechanism to > me > > > > you'd be better served having multiple slaves tethered to a master > (in > > a > > > > non-cloud environment) or setup SolrCloud either option would give > you > > > more > > > > redundancy than copying an index to HDFS. > > > > > > > > - Amit > > > > > > > > > > > > On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim > wrote: > > > > > > > > > Hi Upayavira, > > > > > > > > > > sure, let me explain. I am setting up Nutch and SOLR in hadoop > > > > environment. > > > > > Since I am using hdfs, in the event if there is any crashes to the > > > > > localhost(running solr), i will still have the shards of data being > > > > stored > > > > > in hdfs. > > > > > > > > > > Thanks you so much =) > > > > > > > > > > On Thu, Mar 7, 2013 at 1:19 AM, Upayavira wrote: > > > > > > > > > > > What are you actually trying to achieve? If you can share what > you > > > are > > > > > > trying to achieve maybe folks can help you find the right way to > do > > > it. > > > > > > > > > > > > Upayavira > > > > > > > > > > > > On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote: > > > > > > > Hello Otis , > > > > > > > > > > > > > > Is there any configuration where it will index into hdfs > instead? > > > > > > > > > > > > > > I tried crawlzilla and lily but I hope to update specific > > package > > > > such > > > > > > > as > > > > > > > Hadoop only or nutch only when there are updates. > > > > > > > > > > > > > > That's y would prefer to install separately . > > > > > > > > > > > > > > Thanks so much. Looking forward for your reply. > > > > > > > > > > > > > > On Wednesday, March 6, 2013, Otis Gospodnetic wrote: > > > > > > > > > > > > > > > Hello Joseph, > > > > > > > > > > > > > > > > You can certainly put them there, as in: > > > > > > > > hadoop fs -copyFromLocal URI > > > > > > > > > > > > > > > > But searching such an index will be slow. > > > > > > > > See also: http://katta.sourceforge.net/ > > > > > > > > > > > > > > > > Otis > > > > > > > > -- > > > > > > > > Solr & ElasticSearch Support > > > > > > > > http://sematext.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar
Re: SOLR on hdfs
Hi Amit, Currently I am designing a Learning Management System where it is based on Hadoop and hbase . Right now I want to integrate nutch with solr in it as part of crawler module, so that users will only be able to search relevant documents from specific source. And since crawling and indexing takes so much of the time, (might be 5 to 6 hours ~ 5gb) so hope that if there is anything happen to the server, there will be replicates to back it up. I just saw what solrcloud can do but will need to check out if nutch is able to work with it. Not knowing of other constraints I will encounter, so was asking if I can just output the solr dir into a hdfs in the first place. Cheers. On Thursday, March 7, 2013, Amit Nithian wrote: > Joseph, > > Doing what Otis said will do literally what you want which is copying the > index to HDFS. It's no different than copying it to a different machine > which btw is what Solr's master/slave replication scheme does. > Alternatively, I think people are starting to setup new Solr instances with > SolrCloud which doesn't have the concept of master/slave but rather a > series of nodes with the option of having replicas (what I believe to be > backup nodes) so that you have the redundancy you want. > > Honestly HDFS in the way that you are looking for is probably no different > than storing your solr index in a RAIDed storage format but I don't > pretend to know much about RAID arrays. > > What exactly are you trying to achieve from a systems perspective? Why do > you want Hadoop in the mix here and how does copying the index to HDFS help > you? If SolrCloud seems complicated try just setting up a simple > master/slave replication scheme for that's really easy. > > Cheers > Amit > > > On Wed, Mar 6, 2013 at 9:55 PM, Joseph Lim wrote: > > > Hi Amit, > > > > so you mean that if I just want to get redundancy for solr in hdfs, the > > only best way to do it is to as per what Otis suggested using the > following > > command > > > > hadoop fs -copyFromLocal URI > > > > Ok let me try out solrcloud as I will need to make sure it works well > with > > nutch too.. > > > > Thanks for the help.. > > > > > > On Thu, Mar 7, 2013 at 5:47 AM, Amit Nithian wrote: > > > > > Why wouldn't SolrCloud help you here? You can setup shards and replicas > > etc > > > to have redundancy b/c HDFS isn't designed to serve real time queries > as > > > far as I understand. If you are using HDFS as a backup mechanism to me > > > you'd be better served having multiple slaves tethered to a master (in > a > > > non-cloud environment) or setup SolrCloud either option would give you > > more > > > redundancy than copying an index to HDFS. > > > > > > - Amit > > > > > > > > > On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim wrote: > > > > > > > Hi Upayavira, > > > > > > > > sure, let me explain. I am setting up Nutch and SOLR in hadoop > > > environment. > > > > Since I am using hdfs, in the event if there is any crashes to the > > > > localhost(running solr), i will still have the shards of data being > > > stored > > > > in hdfs. > > > > > > > > Thanks you so much =) > > > > > > > > On Thu, Mar 7, 2013 at 1:19 AM, Upayavira wrote: > > > > > > > > > What are you actually trying to achieve? If you can share what you > > are > > > > > trying to achieve maybe folks can help you find the right way to do > > it. > > > > > > > > > > Upayavira > > > > > > > > > > On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote: > > > > > > Hello Otis , > > > > > > > > > > > > Is there any configuration where it will index into hdfs instead? > > > > > > > > > > > > I tried crawlzilla and lily but I hope to update specific > package > > > such > > > > > > as > > > > > > Hadoop only or nutch only when there are updates. > > > > > > > > > > > > That's y would prefer to install separately . > > > > > > > > > > > > Thanks so much. Looking forward for your reply. > > > > > > > > > > > > On Wednesday, March 6, 2013, Otis Gospodnetic wrote: > > > > > > > > > > > > > Hello Joseph, > > > > > > > > > > > > > > You can certainly put them there, as in: > > > > > > > hadoop fs -copyFromLocal URI > > > > > > > > > > > > > > But searching such an index will be slow. > > > > > > > See also: http://katta.sourceforge.net/ > > > > > > > > > > > > > > Otis > > > > > > > -- > > > > > > > Solr & ElasticSearch Support > > > > > > > http://sematext.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > Would like to know how can i put the indexed solr shards into > > > hdfs? > > > > > > > > > > > > > > > > Thanks.. > > > > > > > > > > > > > > > > Joseph > > > > *Joseph* > > > -- Best Regards, *Joseph*
Re: SOLR on hdfs
Joseph, Doing what Otis said will do literally what you want which is copying the index to HDFS. It's no different than copying it to a different machine which btw is what Solr's master/slave replication scheme does. Alternatively, I think people are starting to setup new Solr instances with SolrCloud which doesn't have the concept of master/slave but rather a series of nodes with the option of having replicas (what I believe to be backup nodes) so that you have the redundancy you want. Honestly HDFS in the way that you are looking for is probably no different than storing your solr index in a RAIDed storage format but I don't pretend to know much about RAID arrays. What exactly are you trying to achieve from a systems perspective? Why do you want Hadoop in the mix here and how does copying the index to HDFS help you? If SolrCloud seems complicated try just setting up a simple master/slave replication scheme for that's really easy. Cheers Amit On Wed, Mar 6, 2013 at 9:55 PM, Joseph Lim wrote: > Hi Amit, > > so you mean that if I just want to get redundancy for solr in hdfs, the > only best way to do it is to as per what Otis suggested using the following > command > > hadoop fs -copyFromLocal URI > > Ok let me try out solrcloud as I will need to make sure it works well with > nutch too.. > > Thanks for the help.. > > > On Thu, Mar 7, 2013 at 5:47 AM, Amit Nithian wrote: > > > Why wouldn't SolrCloud help you here? You can setup shards and replicas > etc > > to have redundancy b/c HDFS isn't designed to serve real time queries as > > far as I understand. If you are using HDFS as a backup mechanism to me > > you'd be better served having multiple slaves tethered to a master (in a > > non-cloud environment) or setup SolrCloud either option would give you > more > > redundancy than copying an index to HDFS. > > > > - Amit > > > > > > On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim wrote: > > > > > Hi Upayavira, > > > > > > sure, let me explain. I am setting up Nutch and SOLR in hadoop > > environment. > > > Since I am using hdfs, in the event if there is any crashes to the > > > localhost(running solr), i will still have the shards of data being > > stored > > > in hdfs. > > > > > > Thanks you so much =) > > > > > > On Thu, Mar 7, 2013 at 1:19 AM, Upayavira wrote: > > > > > > > What are you actually trying to achieve? If you can share what you > are > > > > trying to achieve maybe folks can help you find the right way to do > it. > > > > > > > > Upayavira > > > > > > > > On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote: > > > > > Hello Otis , > > > > > > > > > > Is there any configuration where it will index into hdfs instead? > > > > > > > > > > I tried crawlzilla and lily but I hope to update specific package > > such > > > > > as > > > > > Hadoop only or nutch only when there are updates. > > > > > > > > > > That's y would prefer to install separately . > > > > > > > > > > Thanks so much. Looking forward for your reply. > > > > > > > > > > On Wednesday, March 6, 2013, Otis Gospodnetic wrote: > > > > > > > > > > > Hello Joseph, > > > > > > > > > > > > You can certainly put them there, as in: > > > > > > hadoop fs -copyFromLocal URI > > > > > > > > > > > > But searching such an index will be slow. > > > > > > See also: http://katta.sourceforge.net/ > > > > > > > > > > > > Otis > > > > > > -- > > > > > > Solr & ElasticSearch Support > > > > > > http://sematext.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > Would like to know how can i put the indexed solr shards into > > hdfs? > > > > > > > > > > > > > > Thanks.. > > > > > > > > > > > > > > Joseph > > > > > > > On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" < > > > > otis.gospodne...@gmail.com > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Joseph, > > > > > > > > > > > > > > > > What exactly are you looking to to? > > > > > > > > See http://incubator.apache.org/blur/ > > > > > > > > > > > > > > > > Otis > > > > > > > > -- > > > > > > > > Solr & ElasticSearch Support > > > > > > > > http://sematext.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim < > ysli...@gmail.com > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi I am running hadoop distributed file system, how do I > put > > my > > > > > > output > > > > > > > of > > > > > > > > > the solr dir into hdfs automatically? > > > > > > > > > > > > > > > > > > Thanks so much.. > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Best Regards, > > > > > > > > > *Joseph* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best Regards, > > > > > *Joseph* > > > > > > > > > > > > > > > > -- > > > Best Regards, > > > *Joseph* > > > > > > > > > -- > Best Regards, > *Joseph* >
Re: SOLR on hdfs
Hi Amit, so you mean that if I just want to get redundancy for solr in hdfs, the only best way to do it is to as per what Otis suggested using the following command hadoop fs -copyFromLocal URI Ok let me try out solrcloud as I will need to make sure it works well with nutch too.. Thanks for the help.. On Thu, Mar 7, 2013 at 5:47 AM, Amit Nithian wrote: > Why wouldn't SolrCloud help you here? You can setup shards and replicas etc > to have redundancy b/c HDFS isn't designed to serve real time queries as > far as I understand. If you are using HDFS as a backup mechanism to me > you'd be better served having multiple slaves tethered to a master (in a > non-cloud environment) or setup SolrCloud either option would give you more > redundancy than copying an index to HDFS. > > - Amit > > > On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim wrote: > > > Hi Upayavira, > > > > sure, let me explain. I am setting up Nutch and SOLR in hadoop > environment. > > Since I am using hdfs, in the event if there is any crashes to the > > localhost(running solr), i will still have the shards of data being > stored > > in hdfs. > > > > Thanks you so much =) > > > > On Thu, Mar 7, 2013 at 1:19 AM, Upayavira wrote: > > > > > What are you actually trying to achieve? If you can share what you are > > > trying to achieve maybe folks can help you find the right way to do it. > > > > > > Upayavira > > > > > > On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote: > > > > Hello Otis , > > > > > > > > Is there any configuration where it will index into hdfs instead? > > > > > > > > I tried crawlzilla and lily but I hope to update specific package > such > > > > as > > > > Hadoop only or nutch only when there are updates. > > > > > > > > That's y would prefer to install separately . > > > > > > > > Thanks so much. Looking forward for your reply. > > > > > > > > On Wednesday, March 6, 2013, Otis Gospodnetic wrote: > > > > > > > > > Hello Joseph, > > > > > > > > > > You can certainly put them there, as in: > > > > > hadoop fs -copyFromLocal URI > > > > > > > > > > But searching such an index will be slow. > > > > > See also: http://katta.sourceforge.net/ > > > > > > > > > > Otis > > > > > -- > > > > > Solr & ElasticSearch Support > > > > > http://sematext.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim > > > > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > Would like to know how can i put the indexed solr shards into > hdfs? > > > > > > > > > > > > Thanks.. > > > > > > > > > > > > Joseph > > > > > > On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" < > > > otis.gospodne...@gmail.com > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi Joseph, > > > > > > > > > > > > > > What exactly are you looking to to? > > > > > > > See http://incubator.apache.org/blur/ > > > > > > > > > > > > > > Otis > > > > > > > -- > > > > > > > Solr & ElasticSearch Support > > > > > > > http://sematext.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi I am running hadoop distributed file system, how do I put > my > > > > > output > > > > > > of > > > > > > > > the solr dir into hdfs automatically? > > > > > > > > > > > > > > > > Thanks so much.. > > > > > > > > > > > > > > > > -- > > > > > > > > Best Regards, > > > > > > > > *Joseph* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best Regards, > > > > *Joseph* > > > > > > > > > > > -- > > Best Regards, > > *Joseph* > > > -- Best Regards, *Joseph*
Re: SOLR on hdfs
Why wouldn't SolrCloud help you here? You can setup shards and replicas etc to have redundancy b/c HDFS isn't designed to serve real time queries as far as I understand. If you are using HDFS as a backup mechanism to me you'd be better served having multiple slaves tethered to a master (in a non-cloud environment) or setup SolrCloud either option would give you more redundancy than copying an index to HDFS. - Amit On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim wrote: > Hi Upayavira, > > sure, let me explain. I am setting up Nutch and SOLR in hadoop environment. > Since I am using hdfs, in the event if there is any crashes to the > localhost(running solr), i will still have the shards of data being stored > in hdfs. > > Thanks you so much =) > > On Thu, Mar 7, 2013 at 1:19 AM, Upayavira wrote: > > > What are you actually trying to achieve? If you can share what you are > > trying to achieve maybe folks can help you find the right way to do it. > > > > Upayavira > > > > On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote: > > > Hello Otis , > > > > > > Is there any configuration where it will index into hdfs instead? > > > > > > I tried crawlzilla and lily but I hope to update specific package such > > > as > > > Hadoop only or nutch only when there are updates. > > > > > > That's y would prefer to install separately . > > > > > > Thanks so much. Looking forward for your reply. > > > > > > On Wednesday, March 6, 2013, Otis Gospodnetic wrote: > > > > > > > Hello Joseph, > > > > > > > > You can certainly put them there, as in: > > > > hadoop fs -copyFromLocal URI > > > > > > > > But searching such an index will be slow. > > > > See also: http://katta.sourceforge.net/ > > > > > > > > Otis > > > > -- > > > > Solr & ElasticSearch Support > > > > http://sematext.com/ > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim > > > > > > wrote: > > > > > > > > > Hi, > > > > > Would like to know how can i put the indexed solr shards into hdfs? > > > > > > > > > > Thanks.. > > > > > > > > > > Joseph > > > > > On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" < > > otis.gospodne...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > Hi Joseph, > > > > > > > > > > > > What exactly are you looking to to? > > > > > > See http://incubator.apache.org/blur/ > > > > > > > > > > > > Otis > > > > > > -- > > > > > > Solr & ElasticSearch Support > > > > > > http://sematext.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim > > > > > > wrote: > > > > > > > > > > > > > Hi I am running hadoop distributed file system, how do I put my > > > > output > > > > > of > > > > > > > the solr dir into hdfs automatically? > > > > > > > > > > > > > > Thanks so much.. > > > > > > > > > > > > > > -- > > > > > > > Best Regards, > > > > > > > *Joseph* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Best Regards, > > > *Joseph* > > > > > > -- > Best Regards, > *Joseph* >
Re: SOLR on hdfs
Hi Upayavira, sure, let me explain. I am setting up Nutch and SOLR in hadoop environment. Since I am using hdfs, in the event if there is any crashes to the localhost(running solr), i will still have the shards of data being stored in hdfs. Thanks you so much =) On Thu, Mar 7, 2013 at 1:19 AM, Upayavira wrote: > What are you actually trying to achieve? If you can share what you are > trying to achieve maybe folks can help you find the right way to do it. > > Upayavira > > On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote: > > Hello Otis , > > > > Is there any configuration where it will index into hdfs instead? > > > > I tried crawlzilla and lily but I hope to update specific package such > > as > > Hadoop only or nutch only when there are updates. > > > > That's y would prefer to install separately . > > > > Thanks so much. Looking forward for your reply. > > > > On Wednesday, March 6, 2013, Otis Gospodnetic wrote: > > > > > Hello Joseph, > > > > > > You can certainly put them there, as in: > > > hadoop fs -copyFromLocal URI > > > > > > But searching such an index will be slow. > > > See also: http://katta.sourceforge.net/ > > > > > > Otis > > > -- > > > Solr & ElasticSearch Support > > > http://sematext.com/ > > > > > > > > > > > > > > > > > > On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim > > > > wrote: > > > > > > > Hi, > > > > Would like to know how can i put the indexed solr shards into hdfs? > > > > > > > > Thanks.. > > > > > > > > Joseph > > > > On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" < > otis.gospodne...@gmail.com > > > > > > > > wrote: > > > > > > > > > Hi Joseph, > > > > > > > > > > What exactly are you looking to to? > > > > > See http://incubator.apache.org/blur/ > > > > > > > > > > Otis > > > > > -- > > > > > Solr & ElasticSearch Support > > > > > http://sematext.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim > > > > wrote: > > > > > > > > > > > Hi I am running hadoop distributed file system, how do I put my > > > output > > > > of > > > > > > the solr dir into hdfs automatically? > > > > > > > > > > > > Thanks so much.. > > > > > > > > > > > > -- > > > > > > Best Regards, > > > > > > *Joseph* > > > > > > > > > > > > > > > > > > > > > > > > -- > > Best Regards, > > *Joseph* > -- Best Regards, *Joseph*
Re: SOLR on hdfs
What are you actually trying to achieve? If you can share what you are trying to achieve maybe folks can help you find the right way to do it. Upayavira On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote: > Hello Otis , > > Is there any configuration where it will index into hdfs instead? > > I tried crawlzilla and lily but I hope to update specific package such > as > Hadoop only or nutch only when there are updates. > > That's y would prefer to install separately . > > Thanks so much. Looking forward for your reply. > > On Wednesday, March 6, 2013, Otis Gospodnetic wrote: > > > Hello Joseph, > > > > You can certainly put them there, as in: > > hadoop fs -copyFromLocal URI > > > > But searching such an index will be slow. > > See also: http://katta.sourceforge.net/ > > > > Otis > > -- > > Solr & ElasticSearch Support > > http://sematext.com/ > > > > > > > > > > > > On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim > > > wrote: > > > > > Hi, > > > Would like to know how can i put the indexed solr shards into hdfs? > > > > > > Thanks.. > > > > > > Joseph > > > On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" > > > > > > > > > wrote: > > > > > > > Hi Joseph, > > > > > > > > What exactly are you looking to to? > > > > See http://incubator.apache.org/blur/ > > > > > > > > Otis > > > > -- > > > > Solr & ElasticSearch Support > > > > http://sematext.com/ > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim > > > > > > > wrote: > > > > > > > > > Hi I am running hadoop distributed file system, how do I put my > > output > > > of > > > > > the solr dir into hdfs automatically? > > > > > > > > > > Thanks so much.. > > > > > > > > > > -- > > > > > Best Regards, > > > > > *Joseph* > > > > > > > > > > > > > > > > > -- > Best Regards, > *Joseph*
Re: SOLR on hdfs
Hello Otis , Is there any configuration where it will index into hdfs instead? I tried crawlzilla and lily but I hope to update specific package such as Hadoop only or nutch only when there are updates. That's y would prefer to install separately . Thanks so much. Looking forward for your reply. On Wednesday, March 6, 2013, Otis Gospodnetic wrote: > Hello Joseph, > > You can certainly put them there, as in: > hadoop fs -copyFromLocal URI > > But searching such an index will be slow. > See also: http://katta.sourceforge.net/ > > Otis > -- > Solr & ElasticSearch Support > http://sematext.com/ > > > > > > On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim > > wrote: > > > Hi, > > Would like to know how can i put the indexed solr shards into hdfs? > > > > Thanks.. > > > > Joseph > > On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" > > > > > > wrote: > > > > > Hi Joseph, > > > > > > What exactly are you looking to to? > > > See http://incubator.apache.org/blur/ > > > > > > Otis > > > -- > > > Solr & ElasticSearch Support > > > http://sematext.com/ > > > > > > > > > > > > > > > > > > On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim > > > > > wrote: > > > > > > > Hi I am running hadoop distributed file system, how do I put my > output > > of > > > > the solr dir into hdfs automatically? > > > > > > > > Thanks so much.. > > > > > > > > -- > > > > Best Regards, > > > > *Joseph* > > > > > > > > > > -- Best Regards, *Joseph*
Re: SOLR on hdfs
Hello Joseph, You can certainly put them there, as in: hadoop fs -copyFromLocal URI But searching such an index will be slow. See also: http://katta.sourceforge.net/ Otis -- Solr & ElasticSearch Support http://sematext.com/ On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim wrote: > Hi, > Would like to know how can i put the indexed solr shards into hdfs? > > Thanks.. > > Joseph > On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" > wrote: > > > Hi Joseph, > > > > What exactly are you looking to to? > > See http://incubator.apache.org/blur/ > > > > Otis > > -- > > Solr & ElasticSearch Support > > http://sematext.com/ > > > > > > > > > > > > On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim wrote: > > > > > Hi I am running hadoop distributed file system, how do I put my output > of > > > the solr dir into hdfs automatically? > > > > > > Thanks so much.. > > > > > > -- > > > Best Regards, > > > *Joseph* > > > > > >
Re: SOLR on hdfs
Hi, Would like to know how can i put the indexed solr shards into hdfs? Thanks.. Joseph On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" wrote: > Hi Joseph, > > What exactly are you looking to to? > See http://incubator.apache.org/blur/ > > Otis > -- > Solr & ElasticSearch Support > http://sematext.com/ > > > > > > On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim wrote: > > > Hi I am running hadoop distributed file system, how do I put my output of > > the solr dir into hdfs automatically? > > > > Thanks so much.. > > > > -- > > Best Regards, > > *Joseph* > > >
Re: SOLR on hdfs
Hi Joseph, What exactly are you looking to to? See http://incubator.apache.org/blur/ Otis -- Solr & ElasticSearch Support http://sematext.com/ On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim wrote: > Hi I am running hadoop distributed file system, how do I put my output of > the solr dir into hdfs automatically? > > Thanks so much.. > > -- > Best Regards, > *Joseph* >