Re: Solr on HDFS

2019-08-02 Thread Kevin Risden
>
> If you think about it, having a shard with 3 replicas on top of a file

system that does 3x replication seems a little excessive!


https://issues.apache.org/jira/browse/SOLR-6305 should help here. I can
take a look at merging the patch since looks like it has been helpful to
others.


Kevin Risden


On Fri, Aug 2, 2019 at 10:09 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Hi Kyle - Thank you.
>
> Our current index is split across 3 solr collections; our largest
> collection is 26.8TBytes (80.5TBytes when 3x replicated in HDFS) across
> 100 shards.  There are 40 machines hosting this cluster. We've found
> that when dealing with large collections having no replicas (but lots of
> shards) ends up being more reliable since there is a much smaller
> recovery time.  We keep another 30 day index (1.4TBytes) that does have
> replicas (40 shards, 3 replicas each), and if a node goes down, we
> manually delete lock files and then bring it back up and yes - lots of
> network IO, but it usually recovers OK.
>
> Having a large collection like this with no replicas seems like a recipe
> for disaster.  So, we've been experimenting with the latest version
> (8.2) and our index process to split up the data into many solr
> collections that do have replicas, and then build the list of
> collections to search at query time.  Our searches are date based, so we
> can define what collections we want to query at query time. As a test,
> we ran just two machines, HDFS, and 500 collections. One server ran out
> of memory and crashed.  We had over 1,600 lock files to delete.
>
> If you think about it, having a shard with 3 replicas on top of a file
> system that does 3x replication seems a little excessive! I'd love to
> see Solr take more advantage of a shared FS.  Perhaps an idea is to use
> HDFS but with an NFS gateway.  Seems like that may be slow.
> Architecturally, I love only having one large file system to manage
> instead of lots of individual file systems across many machines.  HDFS
> makes this easy.
>
> -Joe
>
> On 8/2/2019 9:10 AM, lstusr 5u93n4 wrote:
> > Hi Joe,
> >
> > We fought with Solr on HDFS for quite some time, and faced similar issues
> > as you're seeing. (See this thread, for example:"
> >
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201812.mbox/%3cCABd9LjTeacXpy3FFjFBkzMq6vhgu7Ptyh96+w-KC2p=-rqk...@mail.gmail.com%3e
> >   )
> >
> > The Solr lock files on HDFS get deleted if the Solr server gets shut down
> > gracefully, but we couldn't always guarantee that in our environment so
> we
> > ended up writing a custom startup script to search for lock files on HDFS
> > and delete them before solr startup.
> >
> > However, the issue that you mention of the Solr server rebuilding its
> whole
> > index from replicas on startup was enough of a show-stopper for us that
> we
> > switched away from HDFS to local disk. It literally made the difference
> > between 24+ hours of recovery time after an unexpected outage to less
> than
> > a minute...
> >
> > If you do end up finding a solution to this issue, please post it to this
> > mailing list, because there are others out there (like us!) who would
> most
> > definitely make use it.
> >
> > Thanks
> >
> > Kyle
> >
> > On Fri, 2 Aug 2019 at 08:58, Joe Obernberger <
> joseph.obernber...@gmail.com>
> > wrote:
> >
> >> Thank you.  No, while the cluster is using Cloudera for HDFS, we do not
> >> use Cloudera to manager the solr cluster.  If it is a
> >> configuration/architecture issue, what can I do to fix it?  I'd like a
> >> system where servers can come and go, but the indexes stay available and
> >> recover automatically.  Is that possible with HDFS?
> >> While adding an alias to other collections would be an option, if that
> >> collection is the only collection, or one that is currently needed, in a
> >> live system, we can't bring it down, re-create it, and re-index when
> >> that process may take weeks to do.
> >>
> >> Any ideas?
> >>
> >> -Joe
> >>
> >> On 8/1/2019 6:15 PM, Angie Rabelero wrote:
> >>> I don’t think you’re using claudera or ambari, but ambari has an option
> >> to delete the locks. This seems more a configuration/architecture isssue
> >> than a realibility issue. You may want to spin up an alias while you
> bring
> >> down, clear locks and directories, recreate and index the affected
> >> collection, while you work your other isues.
> >>> On Aug 1, 2019, at 16:40, Joe 

Re: Solr on HDFS

2019-08-02 Thread Joe Obernberger

Hi Kyle - Thank you.

Our current index is split across 3 solr collections; our largest 
collection is 26.8TBytes (80.5TBytes when 3x replicated in HDFS) across 
100 shards.  There are 40 machines hosting this cluster. We've found 
that when dealing with large collections having no replicas (but lots of 
shards) ends up being more reliable since there is a much smaller 
recovery time.  We keep another 30 day index (1.4TBytes) that does have 
replicas (40 shards, 3 replicas each), and if a node goes down, we 
manually delete lock files and then bring it back up and yes - lots of 
network IO, but it usually recovers OK.


Having a large collection like this with no replicas seems like a recipe 
for disaster.  So, we've been experimenting with the latest version 
(8.2) and our index process to split up the data into many solr 
collections that do have replicas, and then build the list of 
collections to search at query time.  Our searches are date based, so we 
can define what collections we want to query at query time. As a test, 
we ran just two machines, HDFS, and 500 collections. One server ran out 
of memory and crashed.  We had over 1,600 lock files to delete.


If you think about it, having a shard with 3 replicas on top of a file 
system that does 3x replication seems a little excessive! I'd love to 
see Solr take more advantage of a shared FS.  Perhaps an idea is to use 
HDFS but with an NFS gateway.  Seems like that may be slow.  
Architecturally, I love only having one large file system to manage 
instead of lots of individual file systems across many machines.  HDFS 
makes this easy.


-Joe

On 8/2/2019 9:10 AM, lstusr 5u93n4 wrote:

Hi Joe,

We fought with Solr on HDFS for quite some time, and faced similar issues
as you're seeing. (See this thread, for example:"
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201812.mbox/%3cCABd9LjTeacXpy3FFjFBkzMq6vhgu7Ptyh96+w-KC2p=-rqk...@mail.gmail.com%3e
  )

The Solr lock files on HDFS get deleted if the Solr server gets shut down
gracefully, but we couldn't always guarantee that in our environment so we
ended up writing a custom startup script to search for lock files on HDFS
and delete them before solr startup.

However, the issue that you mention of the Solr server rebuilding its whole
index from replicas on startup was enough of a show-stopper for us that we
switched away from HDFS to local disk. It literally made the difference
between 24+ hours of recovery time after an unexpected outage to less than
a minute...

If you do end up finding a solution to this issue, please post it to this
mailing list, because there are others out there (like us!) who would most
definitely make use it.

Thanks

Kyle

On Fri, 2 Aug 2019 at 08:58, Joe Obernberger 
wrote:


Thank you.  No, while the cluster is using Cloudera for HDFS, we do not
use Cloudera to manager the solr cluster.  If it is a
configuration/architecture issue, what can I do to fix it?  I'd like a
system where servers can come and go, but the indexes stay available and
recover automatically.  Is that possible with HDFS?
While adding an alias to other collections would be an option, if that
collection is the only collection, or one that is currently needed, in a
live system, we can't bring it down, re-create it, and re-index when
that process may take weeks to do.

Any ideas?

-Joe

On 8/1/2019 6:15 PM, Angie Rabelero wrote:

I don’t think you’re using claudera or ambari, but ambari has an option

to delete the locks. This seems more a configuration/architecture isssue
than a realibility issue. You may want to spin up an alias while you bring
down, clear locks and directories, recreate and index the affected
collection, while you work your other isues.

On Aug 1, 2019, at 16:40, Joe Obernberger 

wrote:

Been using Solr on HDFS for a while now, and I'm seeing an issue with

redundancy/reliability.  If a server goes down, when it comes back up, it
will never recover because of the lock files in HDFS. That solr node needs
to be brought down manually, the lock files deleted, and then brought back
up.  At that point, it appears to copy all the data for its replicas.  If
the index is large, and new data is being indexed, in some cases it will
never recover. The replication retries over and over.

How can we make a reliable Solr Cloud cluster when using HDFS that can

handle servers coming and going?

Thank you!

-Joe



---
This email has been checked for viruses by AVG.
https://www.avg.com



Re: Solr on HDFS

2019-08-02 Thread lstusr 5u93n4
Hi Joe,

We fought with Solr on HDFS for quite some time, and faced similar issues
as you're seeing. (See this thread, for example:"
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201812.mbox/%3cCABd9LjTeacXpy3FFjFBkzMq6vhgu7Ptyh96+w-KC2p=-rqk...@mail.gmail.com%3e
 )

The Solr lock files on HDFS get deleted if the Solr server gets shut down
gracefully, but we couldn't always guarantee that in our environment so we
ended up writing a custom startup script to search for lock files on HDFS
and delete them before solr startup.

However, the issue that you mention of the Solr server rebuilding its whole
index from replicas on startup was enough of a show-stopper for us that we
switched away from HDFS to local disk. It literally made the difference
between 24+ hours of recovery time after an unexpected outage to less than
a minute...

If you do end up finding a solution to this issue, please post it to this
mailing list, because there are others out there (like us!) who would most
definitely make use it.

Thanks

Kyle

On Fri, 2 Aug 2019 at 08:58, Joe Obernberger 
wrote:

> Thank you.  No, while the cluster is using Cloudera for HDFS, we do not
> use Cloudera to manager the solr cluster.  If it is a
> configuration/architecture issue, what can I do to fix it?  I'd like a
> system where servers can come and go, but the indexes stay available and
> recover automatically.  Is that possible with HDFS?
> While adding an alias to other collections would be an option, if that
> collection is the only collection, or one that is currently needed, in a
> live system, we can't bring it down, re-create it, and re-index when
> that process may take weeks to do.
>
> Any ideas?
>
> -Joe
>
> On 8/1/2019 6:15 PM, Angie Rabelero wrote:
> > I don’t think you’re using claudera or ambari, but ambari has an option
> to delete the locks. This seems more a configuration/architecture isssue
> than a realibility issue. You may want to spin up an alias while you bring
> down, clear locks and directories, recreate and index the affected
> collection, while you work your other isues.
> >
> > On Aug 1, 2019, at 16:40, Joe Obernberger 
> wrote:
> >
> > Been using Solr on HDFS for a while now, and I'm seeing an issue with
> redundancy/reliability.  If a server goes down, when it comes back up, it
> will never recover because of the lock files in HDFS. That solr node needs
> to be brought down manually, the lock files deleted, and then brought back
> up.  At that point, it appears to copy all the data for its replicas.  If
> the index is large, and new data is being indexed, in some cases it will
> never recover. The replication retries over and over.
> >
> > How can we make a reliable Solr Cloud cluster when using HDFS that can
> handle servers coming and going?
> >
> > Thank you!
> >
> > -Joe
> >
> >
> >
> > ---
> > This email has been checked for viruses by AVG.
> > https://www.avg.com
> >
>


Re: Solr on HDFS

2019-08-02 Thread Joe Obernberger
Thank you.  No, while the cluster is using Cloudera for HDFS, we do not 
use Cloudera to manager the solr cluster.  If it is a 
configuration/architecture issue, what can I do to fix it?  I'd like a 
system where servers can come and go, but the indexes stay available and 
recover automatically.  Is that possible with HDFS?
While adding an alias to other collections would be an option, if that 
collection is the only collection, or one that is currently needed, in a 
live system, we can't bring it down, re-create it, and re-index when 
that process may take weeks to do.


Any ideas?

-Joe

On 8/1/2019 6:15 PM, Angie Rabelero wrote:

I don’t think you’re using claudera or ambari, but ambari has an option to 
delete the locks. This seems more a configuration/architecture isssue than a 
realibility issue. You may want to spin up an alias while you bring down, clear 
locks and directories, recreate and index the affected collection, while you 
work your other isues.

On Aug 1, 2019, at 16:40, Joe Obernberger  wrote:

Been using Solr on HDFS for a while now, and I'm seeing an issue with 
redundancy/reliability.  If a server goes down, when it comes back up, it will 
never recover because of the lock files in HDFS. That solr node needs to be 
brought down manually, the lock files deleted, and then brought back up.  At 
that point, it appears to copy all the data for its replicas.  If the index is 
large, and new data is being indexed, in some cases it will never recover. The 
replication retries over and over.

How can we make a reliable Solr Cloud cluster when using HDFS that can handle 
servers coming and going?

Thank you!

-Joe



---
This email has been checked for viruses by AVG.
https://www.avg.com



Re: Solr on HDFS

2019-08-01 Thread Angie Rabelero
I don’t think you’re using claudera or ambari, but ambari has an option to 
delete the locks. This seems more a configuration/architecture isssue than a 
realibility issue. You may want to spin up an alias while you bring down, clear 
locks and directories, recreate and index the affected collection, while you 
work your other isues.

On Aug 1, 2019, at 16:40, Joe Obernberger  wrote:

Been using Solr on HDFS for a while now, and I'm seeing an issue with 
redundancy/reliability.  If a server goes down, when it comes back up, it will 
never recover because of the lock files in HDFS. That solr node needs to be 
brought down manually, the lock files deleted, and then brought back up.  At 
that point, it appears to copy all the data for its replicas.  If the index is 
large, and new data is being indexed, in some cases it will never recover. The 
replication retries over and over.

How can we make a reliable Solr Cloud cluster when using HDFS that can handle 
servers coming and going?

Thank you!

-Joe




Solr on HDFS

2019-08-01 Thread Joe Obernberger
Been using Solr on HDFS for a while now, and I'm seeing an issue with 
redundancy/reliability.  If a server goes down, when it comes back up, 
it will never recover because of the lock files in HDFS. That solr node 
needs to be brought down manually, the lock files deleted, and then 
brought back up.  At that point, it appears to copy all the data for its 
replicas.  If the index is large, and new data is being indexed, in some 
cases it will never recover. The replication retries over and over.


How can we make a reliable Solr Cloud cluster when using HDFS that can 
handle servers coming and going?


Thank you!

-Joe



Re: An exception when running Solr on HDFS,why a solr server can not recognize the write.lock file is created by itself before?

2018-08-27 Thread zhenyuan wei
@Shawn Heisey  Yeah, delete "write.lock" files manually is ok finally。
@Walter Underwood  Have some performace evaluation about Solr on HDFS vs
LocalFS  recently?

Shawn Heisey  于2018年8月28日周二 上午4:10写道:

> On 8/26/2018 7:47 PM, zhenyuan wei wrote:
> >  I found an exception when running Solr on HDFS。The detail is:
> > Running solr on HDFS,and update doc was running always,
> > then,kill -9 solr JVM or reboot linux os/shutdown linux os,then restart
> all.
>
> If you use "kill -9" to stop a Solr instance, the lockfile will get left
> behind and you may have difficulty starting Solr back up on ANY kind of
> filesystem until you delete the file in each core's data directory.  The
> filename defaults to "write.lock" if you don't change it.
>
> Thanks,
> Shawn
>
>


Re: An exception when running Solr on HDFS,why a solr server can not recognize the write.lock file is created by itself before?

2018-08-27 Thread Shawn Heisey

On 8/26/2018 7:47 PM, zhenyuan wei wrote:

 I found an exception when running Solr on HDFS。The detail is:
Running solr on HDFS,and update doc was running always,
then,kill -9 solr JVM or reboot linux os/shutdown linux os,then restart all.


If you use "kill -9" to stop a Solr instance, the lockfile will get left 
behind and you may have difficulty starting Solr back up on ANY kind of 
filesystem until you delete the file in each core's data directory.  The 
filename defaults to "write.lock" if you don't change it.


Thanks,
Shawn



Re: An exception when running Solr on HDFS,why a solr server can not recognize the write.lock file is created by itself before?

2018-08-27 Thread Walter Underwood
I accidentally put my Solr indexes on NFS once about ten years ago.
It was 100X slower. I would not recommend that.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 27, 2018, at 1:39 AM, zhenyuan wei  wrote:
> 
> Thanks for your answer! @Erick Erickson 
> So, It's not recommended to run Solr on NFS ( like HDFS) now?  Maybe
> because of crash error or performance problem.
> I have a look at SOLR-8335&SOLR-8169, there is no good solution for this
> now, And maybe manual removal is the best option?
> 
> 
> Erick Erickson  于2018年8月27日周一 上午11:41写道:
> 
>> Because HDFS doesn't follow the file semantics that Solr expects.
>> 
>> There's quite a bit of background here:
>> https://issues.apache.org/jira/browse/SOLR-8335
>> 
>> Best,
>> Erick
>> On Sun, Aug 26, 2018 at 6:47 PM zhenyuan wei  wrote:
>>> 
>>> Hi all,
>>>I found an exception when running Solr on HDFS。The detail is:
>>> Running solr on HDFS,and update doc was running always,
>>> then,kill -9 solr JVM or reboot linux os/shutdown linux os,then restart
>> all.
>>> The exception  appears like:
>>> 
>>> 2018-08-26 22:23:12.529 ERROR
>>> 
>> (coreContainerWorkExecutor-2-thread-1-processing-n:cluster-node001:8983_solr)
>>> [   ] o.a.s.c.CoreContainer Error waiting for SolrCore to be loaded on
>>> startup
>>> org.apache.solr.common.SolrException: Unable to create core
>>> [collection002_shard56_replica_n110]
>>>at
>>> 
>> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1061)
>>>at
>>> org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:640)
>>>at
>>> 
>> com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
>>>at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>at
>>> 
>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
>>>at
>>> 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
>>>at
>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
>>>at java.lang.Thread.run(Thread.java:834)
>>> Caused by: org.apache.solr.common.SolrException: Index dir
>>> 'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core
>>> 'collection002_shard56_replica_n110' is already locked. The most likely
>>> cause is another Solr server (or another solr core in this server) also
>>> configured to use this directory; other possible causes may be specific
>> to
>>> lockType: hdfs
>>>at org.apache.solr.core.SolrCore.(SolrCore.java:1009)
>>>at org.apache.solr.core.SolrCore.(SolrCore.java:864)
>>>at
>>> 
>> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1040)
>>>... 7 more
>>> Caused by: org.apache.lucene.store.LockObtainFailedException: Index dir
>>> 'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core
>>> 'collection002_shard56_replica_n110' is already locked. The most likely
>>> cause is another Solr server (or another solr core in this server) also
>>> configured to use this directory; other possible causes may be specific
>> to
>>> lockType: hdfs
>>>at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:746)
>>>at org.apache.solr.core.SolrCore.(SolrCore.java:955)
>>>... 9 more
>>> 
>>> 
>>> In fact, a print out a hdfs api level exception stack, it reports like:
>>> 
>>> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException:
>>> /solr/collection002/core_node17/data/index/write.lock for client
>>> 192.168.0.12 already exists
>>>at
>>> 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563)
>>>at
>>> 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2450)
>>>at
>>> 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2334)
>>>at
>>> 
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:623)
>>>at
>>> 

Re: An exception when running Solr on HDFS,why a solr server can not recognize the write.lock file is created by itself before?

2018-08-27 Thread zhenyuan wei
Thanks for your answer! @Erick Erickson 
So, It's not recommended to run Solr on NFS ( like HDFS) now?  Maybe
because of crash error or performance problem.
I have a look at SOLR-8335&SOLR-8169, there is no good solution for this
now, And maybe manual removal is the best option?


Erick Erickson  于2018年8月27日周一 上午11:41写道:

> Because HDFS doesn't follow the file semantics that Solr expects.
>
> There's quite a bit of background here:
> https://issues.apache.org/jira/browse/SOLR-8335
>
> Best,
> Erick
> On Sun, Aug 26, 2018 at 6:47 PM zhenyuan wei  wrote:
> >
> > Hi all,
> > I found an exception when running Solr on HDFS。The detail is:
> > Running solr on HDFS,and update doc was running always,
> > then,kill -9 solr JVM or reboot linux os/shutdown linux os,then restart
> all.
> > The exception  appears like:
> >
> > 2018-08-26 22:23:12.529 ERROR
> >
> (coreContainerWorkExecutor-2-thread-1-processing-n:cluster-node001:8983_solr)
> > [   ] o.a.s.c.CoreContainer Error waiting for SolrCore to be loaded on
> > startup
> > org.apache.solr.common.SolrException: Unable to create core
> > [collection002_shard56_replica_n110]
> > at
> >
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1061)
> > at
> > org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:640)
> > at
> >
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > at
> >
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> > at java.lang.Thread.run(Thread.java:834)
> > Caused by: org.apache.solr.common.SolrException: Index dir
> > 'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core
> > 'collection002_shard56_replica_n110' is already locked. The most likely
> > cause is another Solr server (or another solr core in this server) also
> > configured to use this directory; other possible causes may be specific
> to
> > lockType: hdfs
> > at org.apache.solr.core.SolrCore.(SolrCore.java:1009)
> > at org.apache.solr.core.SolrCore.(SolrCore.java:864)
> > at
> >
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1040)
> > ... 7 more
> > Caused by: org.apache.lucene.store.LockObtainFailedException: Index dir
> > 'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core
> > 'collection002_shard56_replica_n110' is already locked. The most likely
> > cause is another Solr server (or another solr core in this server) also
> > configured to use this directory; other possible causes may be specific
> to
> > lockType: hdfs
> > at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:746)
> > at org.apache.solr.core.SolrCore.(SolrCore.java:955)
> > ... 9 more
> >
> >
> > In fact, a print out a hdfs api level exception stack, it reports like:
> >
> > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException:
> > /solr/collection002/core_node17/data/index/write.lock for client
> > 192.168.0.12 already exists
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2450)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2334)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:623)
> > at
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397)
> > at
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > at
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> >

Re: An exception when running Solr on HDFS,why a solr server can not recognize the write.lock file is created by itself before?

2018-08-26 Thread Erick Erickson
Because HDFS doesn't follow the file semantics that Solr expects.

There's quite a bit of background here:
https://issues.apache.org/jira/browse/SOLR-8335

Best,
Erick
On Sun, Aug 26, 2018 at 6:47 PM zhenyuan wei  wrote:
>
> Hi all,
> I found an exception when running Solr on HDFS。The detail is:
> Running solr on HDFS,and update doc was running always,
> then,kill -9 solr JVM or reboot linux os/shutdown linux os,then restart all.
> The exception  appears like:
>
> 2018-08-26 22:23:12.529 ERROR
> (coreContainerWorkExecutor-2-thread-1-processing-n:cluster-node001:8983_solr)
> [   ] o.a.s.c.CoreContainer Error waiting for SolrCore to be loaded on
> startup
> org.apache.solr.common.SolrException: Unable to create core
> [collection002_shard56_replica_n110]
> at
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1061)
> at
> org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:640)
> at
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)
> Caused by: org.apache.solr.common.SolrException: Index dir
> 'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core
> 'collection002_shard56_replica_n110' is already locked. The most likely
> cause is another Solr server (or another solr core in this server) also
> configured to use this directory; other possible causes may be specific to
> lockType: hdfs
> at org.apache.solr.core.SolrCore.(SolrCore.java:1009)
> at org.apache.solr.core.SolrCore.(SolrCore.java:864)
> at
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1040)
> ... 7 more
> Caused by: org.apache.lucene.store.LockObtainFailedException: Index dir
> 'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core
> 'collection002_shard56_replica_n110' is already locked. The most likely
> cause is another Solr server (or another solr core in this server) also
> configured to use this directory; other possible causes may be specific to
> lockType: hdfs
> at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:746)
> at org.apache.solr.core.SolrCore.(SolrCore.java:955)
> ... 9 more
>
>
> In fact, a print out a hdfs api level exception stack, it reports like:
>
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException:
> /solr/collection002/core_node17/data/index/write.lock for client
> 192.168.0.12 already exists
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2450)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2334)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:623)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1727)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
>
> at sun.reflect.GeneratedConstructorAccessor140.newInstance(Unknown
> Source)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce

An exception when running Solr on HDFS,why a solr server can not recognize the write.lock file is created by itself before?

2018-08-26 Thread zhenyuan wei
Hi all,
I found an exception when running Solr on HDFS。The detail is:
Running solr on HDFS,and update doc was running always,
then,kill -9 solr JVM or reboot linux os/shutdown linux os,then restart all.
The exception  appears like:

2018-08-26 22:23:12.529 ERROR
(coreContainerWorkExecutor-2-thread-1-processing-n:cluster-node001:8983_solr)
[   ] o.a.s.c.CoreContainer Error waiting for SolrCore to be loaded on
startup
org.apache.solr.common.SolrException: Unable to create core
[collection002_shard56_replica_n110]
at
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1061)
at
org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:640)
at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.solr.common.SolrException: Index dir
'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core
'collection002_shard56_replica_n110' is already locked. The most likely
cause is another Solr server (or another solr core in this server) also
configured to use this directory; other possible causes may be specific to
lockType: hdfs
at org.apache.solr.core.SolrCore.(SolrCore.java:1009)
at org.apache.solr.core.SolrCore.(SolrCore.java:864)
at
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1040)
... 7 more
Caused by: org.apache.lucene.store.LockObtainFailedException: Index dir
'hdfs://hdfs-cluster/solr/collection002/core_node113/data/index/' of core
'collection002_shard56_replica_n110' is already locked. The most likely
cause is another Solr server (or another solr core in this server) also
configured to use this directory; other possible causes may be specific to
lockType: hdfs
at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:746)
at org.apache.solr.core.SolrCore.(SolrCore.java:955)
... 9 more


In fact, a print out a hdfs api level exception stack, it reports like:

Caused by: org.apache.hadoop.fs.FileAlreadyExistsException:
/solr/collection002/core_node17/data/index/write.lock for client
192.168.0.12 already exists
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2563)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2450)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2334)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:623)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1727)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)

at sun.reflect.GeneratedConstructorAccessor140.newInstance(Unknown
Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
at
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1839)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624)
at
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448)
at
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.

Re: Running Solr on HDFS - Disk space

2018-06-07 Thread Hendrik Haddorp
The only option should be to configure Solr to just have a replication 
factor of 1 or HDFS to have no replication. I would go for the middle 
and configure both to use a factor of 2. This way a single failure in 
HDFS and Solr is not a problem. While in 1/3 or 3/1 option a single 
server error would bring the collection down.


Setting the HDFS replication factor is a bit tricky as Solr takes in 
some places the default replication factor set on HDFS and some times 
takes a default from the client side. HDFS allows you to set a 
replication factor for every file individually.


regards,
Hendrik

On 07.06.2018 15:30, Shawn Heisey wrote:

On 6/7/2018 6:41 AM, Greenhorn Techie wrote:

As HDFS has got its own replication mechanism, with a HDFS replication
factor of 3, and then SolrCloud replication factor of 3, does that mean
each document will probably have around 9 copies replicated 
underneath of
HDFS? If so, is there a way to configure HDFS or Solr such that only 
three

copies are maintained overall?


Yes, that is exactly what happens.

SolrCloud replication assumes that each of its replicas is a 
completely independent index.  I am not aware of anything in Solr's 
HDFS support that can use one HDFS index directory for multiple 
replicas.  At the most basic level, a Solr index is a Lucene index.  
Lucene goes to great lengths to make sure that an index *CANNOT* be 
used in more than one place.


Perhaps somebody who is more familiar with HDFSDirectoryFactory can 
offer you a solution.  But as far as I know, there isn't one.


Thanks,
Shawn





Re: Running Solr on HDFS - Disk space

2018-06-07 Thread Shawn Heisey

On 6/7/2018 6:41 AM, Greenhorn Techie wrote:

As HDFS has got its own replication mechanism, with a HDFS replication
factor of 3, and then SolrCloud replication factor of 3, does that mean
each document will probably have around 9 copies replicated underneath of
HDFS? If so, is there a way to configure HDFS or Solr such that only three
copies are maintained overall?


Yes, that is exactly what happens.

SolrCloud replication assumes that each of its replicas is a completely 
independent index.  I am not aware of anything in Solr's HDFS support 
that can use one HDFS index directory for multiple replicas.  At the 
most basic level, a Solr index is a Lucene index.  Lucene goes to great 
lengths to make sure that an index *CANNOT* be used in more than one place.


Perhaps somebody who is more familiar with HDFSDirectoryFactory can 
offer you a solution.  But as far as I know, there isn't one.


Thanks,
Shawn



Running Solr on HDFS - Disk space

2018-06-07 Thread Greenhorn Techie
Hi,

As HDFS has got its own replication mechanism, with a HDFS replication
factor of 3, and then SolrCloud replication factor of 3, does that mean
each document will probably have around 9 copies replicated underneath of
HDFS? If so, is there a way to configure HDFS or Solr such that only three
copies are maintained overall?

Thanks


Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Erick Erickson
bq: We also had an HDFS setup already so it looked like a good option
to not loos data. Earlier we had a few cases where we lost the
machines so HDFS looked safer for that.

right, that's one of the places where using HDFS to back Solr makes a
lot of sense. The other approach is to just have replicas for each
shard distributed across different physical machines. But whatever
works is fine.

And there are a bunch of parameters you can tune both on HDFS and for
local file systems so "it's more an art than a science".

bq: Frequent adds with commits, which is likely not good in general
anyway, does look quite a bit slower then local storage so far.

I think you can go a long way towards fixing this by doing some
autowarming. I wouldn't want to open a new searcher every second and
do much autowarming over HDFS, but if you can stand less frequent
commits (say every minute?) you might be able to smooth out the
performance

Best,
Erick

On Wed, Nov 22, 2017 at 11:31 AM, Hendrik Haddorp
 wrote:
> We actually use no auto warming. Our collections are pretty small and the
> query performance is not really a problem so far. We are using lots of
> collections and most Solr caches seem to be per core and not global so we
> also have a problem with caching. I have to test the HDFS cache some more as
> that should work cross collections.
>
> We also had an HDFS setup already so it looked like a good option to not
> loos data. Earlier we had a few cases where we lost the machines so HDFS
> looked safer for that.
>
> I would expect that the HDFS performance is also quite good if you have lots
> of document adds and not so frequent commits. Frequent adds with commits,
> which is likely not good in general anyway, does look quite a bit slower
> then local storage so far. As we didn't see that in our earlier tests, which
> were more, query focused, I said it large depends on what you are doing.
>
> Hendrik
>
> On 22.11.2017 18:41, Erick Erickson wrote:
>>
>> In my experience, for relatively static indexes the performance is
>> roughly similar. Once the data is read from whatever data source it's
>> in memory, where the data came from is (largely) secondary in
>> importance.
>>
>> In cases where there's a lot of I/O I expect HDFS to be slower, this
>> fits Hendrik's observation: "We now had a patter with lots of small
>> updates and commits and that seems to be quite a bit slower". He's
>> merging segments and (presumably) autowarming frequently, implying
>> lots of I/O and HDFS adds an extra layer.
>>
>> Personally I'd use whichever is most convenient and see if the
>> performance was "good enough". I wouldn't recommend _installing_ HDFS
>> just to use it with Solr, why add another complication? If you need
>> the redundancy add replicas. If you already have the HDFS
>> infrastructure in place and using HDFS is easier than local storage,
>> feel free
>>
>> Best,
>> Erick
>>
>>
>> On Wed, Nov 22, 2017 at 8:06 AM, Greenhorn Techie
>>  wrote:
>>>
>>> Hendrik,
>>>
>>> Thanks for your response.
>>>
>>> Regarding "But this seems to greatly depend on how your setup looks like
>>> and what actions you perform." May I know what are the factors influence
>>> and what considerations are to be taken in relation to this?
>>>
>>> Thanks
>>>
>>> On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp 
>>> wrote:
>>>
>>>> We did some testing and the performance was strangely even better with
>>>> HDFS then the with the local file system. But this seems to greatly
>>>> depend on how your setup looks like and what actions you perform. We now
>>>> had a patter with lots of small updates and commits and that seems to be
>>>> quite a bit slower. We are about to do performance testing on that now.
>>>>
>>>> The reason we switched to HDFS was largely connected to us using Docker
>>>> and Marathon/Mesos. With HDFS the data is in a shared file system and
>>>> thus it is possible to move the replica to a different instance on a a
>>>> different host.
>>>>
>>>> regards,
>>>> Hendrik
>>>>
>>>> On 22.11.2017 14:59, Greenhorn Techie wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Good Afternoon!!
>>>>>
>>>>> While the discussion around issues related to "Solr on HDFS" is live, I
>>>>> would like to understand if anyone has done any performance
>>>>> benchmarking
>>>>> for both Solr indexing and search between HDFS vs local file system.
>>>>>
>>>>> Also, from experience, what would the community folks suggest? Solr on
>>>>> local file system or Solr on HDFS? Has anyone done a comparative study
>>>>> of
>>>>> these choices?
>>>>>
>>>>> Thanks
>>>>>
>>>>
>


Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Hendrik Haddorp
We actually use no auto warming. Our collections are pretty small and 
the query performance is not really a problem so far. We are using lots 
of collections and most Solr caches seem to be per core and not global 
so we also have a problem with caching. I have to test the HDFS cache 
some more as that should work cross collections.


We also had an HDFS setup already so it looked like a good option to not 
loos data. Earlier we had a few cases where we lost the machines so HDFS 
looked safer for that.


I would expect that the HDFS performance is also quite good if you have 
lots of document adds and not so frequent commits. Frequent adds with 
commits, which is likely not good in general anyway, does look quite a 
bit slower then local storage so far. As we didn't see that in our 
earlier tests, which were more, query focused, I said it large depends 
on what you are doing.


Hendrik

On 22.11.2017 18:41, Erick Erickson wrote:

In my experience, for relatively static indexes the performance is
roughly similar. Once the data is read from whatever data source it's
in memory, where the data came from is (largely) secondary in
importance.

In cases where there's a lot of I/O I expect HDFS to be slower, this
fits Hendrik's observation: "We now had a patter with lots of small
updates and commits and that seems to be quite a bit slower". He's
merging segments and (presumably) autowarming frequently, implying
lots of I/O and HDFS adds an extra layer.

Personally I'd use whichever is most convenient and see if the
performance was "good enough". I wouldn't recommend _installing_ HDFS
just to use it with Solr, why add another complication? If you need
the redundancy add replicas. If you already have the HDFS
infrastructure in place and using HDFS is easier than local storage,
feel free

Best,
Erick


On Wed, Nov 22, 2017 at 8:06 AM, Greenhorn Techie
 wrote:

Hendrik,

Thanks for your response.

Regarding "But this seems to greatly depend on how your setup looks like
and what actions you perform." May I know what are the factors influence
and what considerations are to be taken in relation to this?

Thanks

On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp 
wrote:


We did some testing and the performance was strangely even better with
HDFS then the with the local file system. But this seems to greatly
depend on how your setup looks like and what actions you perform. We now
had a patter with lots of small updates and commits and that seems to be
quite a bit slower. We are about to do performance testing on that now.

The reason we switched to HDFS was largely connected to us using Docker
and Marathon/Mesos. With HDFS the data is in a shared file system and
thus it is possible to move the replica to a different instance on a a
different host.

regards,
Hendrik

On 22.11.2017 14:59, Greenhorn Techie wrote:

Hi,

Good Afternoon!!

While the discussion around issues related to "Solr on HDFS" is live, I
would like to understand if anyone has done any performance benchmarking
for both Solr indexing and search between HDFS vs local file system.

Also, from experience, what would the community folks suggest? Solr on
local file system or Solr on HDFS? Has anyone done a comparative study of
these choices?

Thanks







Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Erick Erickson
In my experience, for relatively static indexes the performance is
roughly similar. Once the data is read from whatever data source it's
in memory, where the data came from is (largely) secondary in
importance.

In cases where there's a lot of I/O I expect HDFS to be slower, this
fits Hendrik's observation: "We now had a patter with lots of small
updates and commits and that seems to be quite a bit slower". He's
merging segments and (presumably) autowarming frequently, implying
lots of I/O and HDFS adds an extra layer.

Personally I'd use whichever is most convenient and see if the
performance was "good enough". I wouldn't recommend _installing_ HDFS
just to use it with Solr, why add another complication? If you need
the redundancy add replicas. If you already have the HDFS
infrastructure in place and using HDFS is easier than local storage,
feel free

Best,
Erick


On Wed, Nov 22, 2017 at 8:06 AM, Greenhorn Techie
 wrote:
> Hendrik,
>
> Thanks for your response.
>
> Regarding "But this seems to greatly depend on how your setup looks like
> and what actions you perform." May I know what are the factors influence
> and what considerations are to be taken in relation to this?
>
> Thanks
>
> On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp 
> wrote:
>
>> We did some testing and the performance was strangely even better with
>> HDFS then the with the local file system. But this seems to greatly
>> depend on how your setup looks like and what actions you perform. We now
>> had a patter with lots of small updates and commits and that seems to be
>> quite a bit slower. We are about to do performance testing on that now.
>>
>> The reason we switched to HDFS was largely connected to us using Docker
>> and Marathon/Mesos. With HDFS the data is in a shared file system and
>> thus it is possible to move the replica to a different instance on a a
>> different host.
>>
>> regards,
>> Hendrik
>>
>> On 22.11.2017 14:59, Greenhorn Techie wrote:
>> > Hi,
>> >
>> > Good Afternoon!!
>> >
>> > While the discussion around issues related to "Solr on HDFS" is live, I
>> > would like to understand if anyone has done any performance benchmarking
>> > for both Solr indexing and search between HDFS vs local file system.
>> >
>> > Also, from experience, what would the community folks suggest? Solr on
>> > local file system or Solr on HDFS? Has anyone done a comparative study of
>> > these choices?
>> >
>> > Thanks
>> >
>>
>>


Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Greenhorn Techie
Hendrik,

Thanks for your response.

Regarding "But this seems to greatly depend on how your setup looks like
and what actions you perform." May I know what are the factors influence
and what considerations are to be taken in relation to this?

Thanks

On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp 
wrote:

> We did some testing and the performance was strangely even better with
> HDFS then the with the local file system. But this seems to greatly
> depend on how your setup looks like and what actions you perform. We now
> had a patter with lots of small updates and commits and that seems to be
> quite a bit slower. We are about to do performance testing on that now.
>
> The reason we switched to HDFS was largely connected to us using Docker
> and Marathon/Mesos. With HDFS the data is in a shared file system and
> thus it is possible to move the replica to a different instance on a a
> different host.
>
> regards,
> Hendrik
>
> On 22.11.2017 14:59, Greenhorn Techie wrote:
> > Hi,
> >
> > Good Afternoon!!
> >
> > While the discussion around issues related to "Solr on HDFS" is live, I
> > would like to understand if anyone has done any performance benchmarking
> > for both Solr indexing and search between HDFS vs local file system.
> >
> > Also, from experience, what would the community folks suggest? Solr on
> > local file system or Solr on HDFS? Has anyone done a comparative study of
> > these choices?
> >
> > Thanks
> >
>
>


Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Hendrik Haddorp
We did some testing and the performance was strangely even better with 
HDFS then the with the local file system. But this seems to greatly 
depend on how your setup looks like and what actions you perform. We now 
had a patter with lots of small updates and commits and that seems to be 
quite a bit slower. We are about to do performance testing on that now.


The reason we switched to HDFS was largely connected to us using Docker 
and Marathon/Mesos. With HDFS the data is in a shared file system and 
thus it is possible to move the replica to a different instance on a a 
different host.


regards,
Hendrik

On 22.11.2017 14:59, Greenhorn Techie wrote:

Hi,

Good Afternoon!!

While the discussion around issues related to "Solr on HDFS" is live, I
would like to understand if anyone has done any performance benchmarking
for both Solr indexing and search between HDFS vs local file system.

Also, from experience, what would the community folks suggest? Solr on
local file system or Solr on HDFS? Has anyone done a comparative study of
these choices?

Thanks





Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Greenhorn Techie
Hi,

Good Afternoon!!

While the discussion around issues related to "Solr on HDFS" is live, I
would like to understand if anyone has done any performance benchmarking
for both Solr indexing and search between HDFS vs local file system.

Also, from experience, what would the community folks suggest? Solr on
local file system or Solr on HDFS? Has anyone done a comparative study of
these choices?

Thanks


Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-02-22 Thread Hendrik Haddorp

I'm also not really an HDFS expert but I believe it is slightly different:

The HDFS data is replicated, lets say 3 times, between the HDFS data 
nodes but for an HDFS client it looks like one directory and it is 
hidden that the data is replicated. Every client should see the same 
data. Just like every client should see the same data in ZooKeeper 
(every ZK node also has a full replica). So with 2 replicas there should 
only be two disjoint data sets. Thus it should not matter which solr 
node claims the replica and then continues where things were left. Solr 
should only be concerned about the replication between the solr replicas 
but not about the replication between the HDFS data nodes, just as it 
does not have to deal with the replication between the ZK nodes.


Anyhow, for now I would be happy if my patch for SOLR-10092 could get 
included soon as the auto add replica feature does not work without that 
at all for me :-)


On 22.02.2017 16:15, Erick Erickson wrote:

bq: in the none HDFS case that sounds logical but in the HDFS case all
the index data is in the shared HDFS file system

That's not really the point, and it's not quite true. The Solr index
unique _per replica_. So replica1 points to an HDFS directory (that's
triply replicated to be sure). replica2 points to a totally different
set of index files. So with the default replication of 3 your two
replicas will have 6 copies of the index that are totally disjoint in
two sets of three. From Solr's point of view, the fact that HDFS
replicates the data doesn't really alter much.

Autoaddreplica will indeed, to be able to re-use the HDFS data if a
Solr node goes away. But that doesn't change the replication issue I
described.

At least that's my understanding, I admit I'm not an HDFS guy and it
may be out of date.

Erick

On Tue, Feb 21, 2017 at 10:30 PM, Hendrik Haddorp
 wrote:

Hi Erick,

in the none HDFS case that sounds logical but in the HDFS case all the index
data is in the shared HDFS file system. Even the transaction logs should be
in there. So the node that once had the replica should not really have more
information then any other node, especially if legacyClound is set to false
so having ZooKeeper truth.

regards,
Hendrik

On 22.02.2017 02:28, Erick Erickson wrote:

Hendrik:

bq: Not really sure why one replica needs to be up though.

I didn't write the code so I'm guessing a bit, but consider the
situation where you have no replicas for a shard up and add a new one.
Eventually it could become the leader but there would have been no
chance for it to check if it's version of the index was up to date.
But since it would be the leader, when other replicas for that shard
_do_ come on line they'd replicate the index down from the newly added
replica, possibly using very old data.

FWIW,
Erick

On Tue, Feb 21, 2017 at 1:12 PM, Hendrik Haddorp
 wrote:

Hi,

I had opened SOLR-10092
(https://issues.apache.org/jira/browse/SOLR-10092)
for this a while ago. I was now able to gt this feature working with a
very
small code change. After a few seconds Solr reassigns the replica to a
different Solr instance as long as one replica is still up. Not really
sure
why one replica needs to be up though. I added the patch based on Solr
6.3
to the bug report. Would be great if it could be merged soon.

regards,
Hendrik

On 19.01.2017 17:08, Hendrik Haddorp wrote:

HDFS is like a shared filesystem so every Solr Cloud instance can access
the data using the same path or URL. The clusterstate.json looks like
this:

"shards":{"shard1":{
  "range":"8000-7fff",
  "state":"active",
  "replicas":{
"core_node1":{
  "core":"test1.collection-0_shard1_replica1",
"dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/",
  "base_url":"http://slave3:9000/solr";,
  "node_name":"slave3:9000_solr",
  "state":"active",


"ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"},
"core_node2":{
  "core":"test1.collection-0_shard1_replica2",
"dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/",
  "base_url":"http://slave2:9000/solr";,
  "node_name":"slave2:9000_solr",
  "state":"active",


"ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog",
  "leader":"true"},
"core_node3":{
  "core":"test1.collection-0_shard1_replica3",
"dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/",
  "base_url":"http://slave4:9005/solr";,
  "node_name":"slave4:9005_solr",
  "state":"active",


"ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog"

So every replica is always assigned to one node and this is being stored
in ZK, pretty much the same as for none HDFS setups. Just as the data is
not
stored locally but on the network and as the path d

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-02-22 Thread Erick Erickson
bq: in the none HDFS case that sounds logical but in the HDFS case all
the index data is in the shared HDFS file system

That's not really the point, and it's not quite true. The Solr index
unique _per replica_. So replica1 points to an HDFS directory (that's
triply replicated to be sure). replica2 points to a totally different
set of index files. So with the default replication of 3 your two
replicas will have 6 copies of the index that are totally disjoint in
two sets of three. From Solr's point of view, the fact that HDFS
replicates the data doesn't really alter much.

Autoaddreplica will indeed, to be able to re-use the HDFS data if a
Solr node goes away. But that doesn't change the replication issue I
described.

At least that's my understanding, I admit I'm not an HDFS guy and it
may be out of date.

Erick

On Tue, Feb 21, 2017 at 10:30 PM, Hendrik Haddorp
 wrote:
> Hi Erick,
>
> in the none HDFS case that sounds logical but in the HDFS case all the index
> data is in the shared HDFS file system. Even the transaction logs should be
> in there. So the node that once had the replica should not really have more
> information then any other node, especially if legacyClound is set to false
> so having ZooKeeper truth.
>
> regards,
> Hendrik
>
> On 22.02.2017 02:28, Erick Erickson wrote:
>>
>> Hendrik:
>>
>> bq: Not really sure why one replica needs to be up though.
>>
>> I didn't write the code so I'm guessing a bit, but consider the
>> situation where you have no replicas for a shard up and add a new one.
>> Eventually it could become the leader but there would have been no
>> chance for it to check if it's version of the index was up to date.
>> But since it would be the leader, when other replicas for that shard
>> _do_ come on line they'd replicate the index down from the newly added
>> replica, possibly using very old data.
>>
>> FWIW,
>> Erick
>>
>> On Tue, Feb 21, 2017 at 1:12 PM, Hendrik Haddorp
>>  wrote:
>>>
>>> Hi,
>>>
>>> I had opened SOLR-10092
>>> (https://issues.apache.org/jira/browse/SOLR-10092)
>>> for this a while ago. I was now able to gt this feature working with a
>>> very
>>> small code change. After a few seconds Solr reassigns the replica to a
>>> different Solr instance as long as one replica is still up. Not really
>>> sure
>>> why one replica needs to be up though. I added the patch based on Solr
>>> 6.3
>>> to the bug report. Would be great if it could be merged soon.
>>>
>>> regards,
>>> Hendrik
>>>
>>> On 19.01.2017 17:08, Hendrik Haddorp wrote:

 HDFS is like a shared filesystem so every Solr Cloud instance can access
 the data using the same path or URL. The clusterstate.json looks like
 this:

 "shards":{"shard1":{
  "range":"8000-7fff",
  "state":"active",
  "replicas":{
"core_node1":{
  "core":"test1.collection-0_shard1_replica1",
 "dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/",
  "base_url":"http://slave3:9000/solr";,
  "node_name":"slave3:9000_solr",
  "state":"active",


 "ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"},
"core_node2":{
  "core":"test1.collection-0_shard1_replica2",
 "dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/",
  "base_url":"http://slave2:9000/solr";,
  "node_name":"slave2:9000_solr",
  "state":"active",


 "ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog",
  "leader":"true"},
"core_node3":{
  "core":"test1.collection-0_shard1_replica3",
 "dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/",
  "base_url":"http://slave4:9005/solr";,
  "node_name":"slave4:9005_solr",
  "state":"active",


 "ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog"

 So every replica is always assigned to one node and this is being stored
 in ZK, pretty much the same as for none HDFS setups. Just as the data is
 not
 stored locally but on the network and as the path does not contain any
 node
 information you can of course easily take over the work to a different
 Solr
 node. You should just need to update the owner of the replica in ZK and
 you
 should basically be done, I assume. That's why the documentation states
 that
 an advantage of using HDFS is that a failing node can be replaced by a
 different one. The Overseer just has to move the ownership of the
 replica,
 which seems like what the code is trying to do. There just seems to be a
 bug
 in the code so that the core does not get created on the target node.

 Each data directory also contains a lock file. The

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-02-21 Thread Hendrik Haddorp

Hi Erick,

in the none HDFS case that sounds logical but in the HDFS case all the 
index data is in the shared HDFS file system. Even the transaction logs 
should be in there. So the node that once had the replica should not 
really have more information then any other node, especially if 
legacyClound is set to false so having ZooKeeper truth.


regards,
Hendrik

On 22.02.2017 02:28, Erick Erickson wrote:

Hendrik:

bq: Not really sure why one replica needs to be up though.

I didn't write the code so I'm guessing a bit, but consider the
situation where you have no replicas for a shard up and add a new one.
Eventually it could become the leader but there would have been no
chance for it to check if it's version of the index was up to date.
But since it would be the leader, when other replicas for that shard
_do_ come on line they'd replicate the index down from the newly added
replica, possibly using very old data.

FWIW,
Erick

On Tue, Feb 21, 2017 at 1:12 PM, Hendrik Haddorp
 wrote:

Hi,

I had opened SOLR-10092 (https://issues.apache.org/jira/browse/SOLR-10092)
for this a while ago. I was now able to gt this feature working with a very
small code change. After a few seconds Solr reassigns the replica to a
different Solr instance as long as one replica is still up. Not really sure
why one replica needs to be up though. I added the patch based on Solr 6.3
to the bug report. Would be great if it could be merged soon.

regards,
Hendrik

On 19.01.2017 17:08, Hendrik Haddorp wrote:

HDFS is like a shared filesystem so every Solr Cloud instance can access
the data using the same path or URL. The clusterstate.json looks like this:

"shards":{"shard1":{
 "range":"8000-7fff",
 "state":"active",
 "replicas":{
   "core_node1":{
 "core":"test1.collection-0_shard1_replica1",
"dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/",
 "base_url":"http://slave3:9000/solr";,
 "node_name":"slave3:9000_solr",
 "state":"active",

"ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"},
   "core_node2":{
 "core":"test1.collection-0_shard1_replica2",
"dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/",
 "base_url":"http://slave2:9000/solr";,
 "node_name":"slave2:9000_solr",
 "state":"active",

"ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog",
 "leader":"true"},
   "core_node3":{
 "core":"test1.collection-0_shard1_replica3",
"dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/",
 "base_url":"http://slave4:9005/solr";,
 "node_name":"slave4:9005_solr",
 "state":"active",

"ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog"

So every replica is always assigned to one node and this is being stored
in ZK, pretty much the same as for none HDFS setups. Just as the data is not
stored locally but on the network and as the path does not contain any node
information you can of course easily take over the work to a different Solr
node. You should just need to update the owner of the replica in ZK and you
should basically be done, I assume. That's why the documentation states that
an advantage of using HDFS is that a failing node can be replaced by a
different one. The Overseer just has to move the ownership of the replica,
which seems like what the code is trying to do. There just seems to be a bug
in the code so that the core does not get created on the target node.

Each data directory also contains a lock file. The documentation states
that one should use the HdfsLockFactory, which unfortunately can easily lead
to SOLR-8335, which hopefully will be fixed by SOLR-8169. A manual cleanup
is however also easily done but seems to require a node restart to take
effect. But I'm also only recently playing around with all this ;-)

regards,
Hendrik

On 19.01.2017 16:40, Shawn Heisey wrote:

On 1/19/2017 4:09 AM, Hendrik Haddorp wrote:

Given that the data is on HDFS it shouldn't matter if any active
replica is left as the data does not need to get transferred from
another instance but the new core will just take over the existing
data. Thus a replication factor of 1 should also work just in that
case the shard would be down until the new core is up. Anyhow, it
looks like the above call is missing to set the shard id I guess or
some code is checking wrongly.

I know very little about how SolrCloud interacts with HDFS, so although
I'm reasonably certain about what comes below, I could be wrong.

I have not ever heard of SolrCloud being able to automatically take over
an existing index directory when it creates a replica, or even share
index directories unless the admin fools it into doing so without its
knowledge.  Sharing an index directory for replicas with SolrCloud would
NOT work correctly

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-02-21 Thread Erick Erickson
Hendrik:

bq: Not really sure why one replica needs to be up though.

I didn't write the code so I'm guessing a bit, but consider the
situation where you have no replicas for a shard up and add a new one.
Eventually it could become the leader but there would have been no
chance for it to check if it's version of the index was up to date.
But since it would be the leader, when other replicas for that shard
_do_ come on line they'd replicate the index down from the newly added
replica, possibly using very old data.

FWIW,
Erick

On Tue, Feb 21, 2017 at 1:12 PM, Hendrik Haddorp
 wrote:
> Hi,
>
> I had opened SOLR-10092 (https://issues.apache.org/jira/browse/SOLR-10092)
> for this a while ago. I was now able to gt this feature working with a very
> small code change. After a few seconds Solr reassigns the replica to a
> different Solr instance as long as one replica is still up. Not really sure
> why one replica needs to be up though. I added the patch based on Solr 6.3
> to the bug report. Would be great if it could be merged soon.
>
> regards,
> Hendrik
>
> On 19.01.2017 17:08, Hendrik Haddorp wrote:
>>
>> HDFS is like a shared filesystem so every Solr Cloud instance can access
>> the data using the same path or URL. The clusterstate.json looks like this:
>>
>> "shards":{"shard1":{
>> "range":"8000-7fff",
>> "state":"active",
>> "replicas":{
>>   "core_node1":{
>> "core":"test1.collection-0_shard1_replica1",
>> "dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/",
>> "base_url":"http://slave3:9000/solr";,
>> "node_name":"slave3:9000_solr",
>> "state":"active",
>>
>> "ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"},
>>   "core_node2":{
>> "core":"test1.collection-0_shard1_replica2",
>> "dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/",
>> "base_url":"http://slave2:9000/solr";,
>> "node_name":"slave2:9000_solr",
>> "state":"active",
>>
>> "ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog",
>> "leader":"true"},
>>   "core_node3":{
>> "core":"test1.collection-0_shard1_replica3",
>> "dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/",
>> "base_url":"http://slave4:9005/solr";,
>> "node_name":"slave4:9005_solr",
>> "state":"active",
>>
>> "ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog"
>>
>> So every replica is always assigned to one node and this is being stored
>> in ZK, pretty much the same as for none HDFS setups. Just as the data is not
>> stored locally but on the network and as the path does not contain any node
>> information you can of course easily take over the work to a different Solr
>> node. You should just need to update the owner of the replica in ZK and you
>> should basically be done, I assume. That's why the documentation states that
>> an advantage of using HDFS is that a failing node can be replaced by a
>> different one. The Overseer just has to move the ownership of the replica,
>> which seems like what the code is trying to do. There just seems to be a bug
>> in the code so that the core does not get created on the target node.
>>
>> Each data directory also contains a lock file. The documentation states
>> that one should use the HdfsLockFactory, which unfortunately can easily lead
>> to SOLR-8335, which hopefully will be fixed by SOLR-8169. A manual cleanup
>> is however also easily done but seems to require a node restart to take
>> effect. But I'm also only recently playing around with all this ;-)
>>
>> regards,
>> Hendrik
>>
>> On 19.01.2017 16:40, Shawn Heisey wrote:
>>>
>>> On 1/19/2017 4:09 AM, Hendrik Haddorp wrote:

 Given that the data is on HDFS it shouldn't matter if any active
 replica is left as the data does not need to get transferred from
 another instance but the new core will just take over the existing
 data. Thus a replication factor of 1 should also work just in that
 case the shard would be down until the new core is up. Anyhow, it
 looks like the above call is missing to set the shard id I guess or
 some code is checking wrongly.
>>>
>>> I know very little about how SolrCloud interacts with HDFS, so although
>>> I'm reasonably certain about what comes below, I could be wrong.
>>>
>>> I have not ever heard of SolrCloud being able to automatically take over
>>> an existing index directory when it creates a replica, or even share
>>> index directories unless the admin fools it into doing so without its
>>> knowledge.  Sharing an index directory for replicas with SolrCloud would
>>> NOT work correctly.  Solr must be able to update all replicas
>>> independently, which means that each of them will lock its index
>>> directory and write to it.
>>>
>>> It is my understandin

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-02-21 Thread Hendrik Haddorp

Hi,

I had opened SOLR-10092 
(https://issues.apache.org/jira/browse/SOLR-10092) for this a while ago. 
I was now able to gt this feature working with a very small code change. 
After a few seconds Solr reassigns the replica to a different Solr 
instance as long as one replica is still up. Not really sure why one 
replica needs to be up though. I added the patch based on Solr 6.3 to 
the bug report. Would be great if it could be merged soon.


regards,
Hendrik

On 19.01.2017 17:08, Hendrik Haddorp wrote:
HDFS is like a shared filesystem so every Solr Cloud instance can 
access the data using the same path or URL. The clusterstate.json 
looks like this:


"shards":{"shard1":{
"range":"8000-7fff",
"state":"active",
"replicas":{
  "core_node1":{
"core":"test1.collection-0_shard1_replica1",
"dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/",
"base_url":"http://slave3:9000/solr";,
"node_name":"slave3:9000_solr",
"state":"active",
"ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"}, 


  "core_node2":{
"core":"test1.collection-0_shard1_replica2",
"dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/",
"base_url":"http://slave2:9000/solr";,
"node_name":"slave2:9000_solr",
"state":"active",
"ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog", 


"leader":"true"},
  "core_node3":{
"core":"test1.collection-0_shard1_replica3",
"dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/",
"base_url":"http://slave4:9005/solr";,
"node_name":"slave4:9005_solr",
"state":"active",
"ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog" 



So every replica is always assigned to one node and this is being 
stored in ZK, pretty much the same as for none HDFS setups. Just as 
the data is not stored locally but on the network and as the path does 
not contain any node information you can of course easily take over 
the work to a different Solr node. You should just need to update the 
owner of the replica in ZK and you should basically be done, I assume. 
That's why the documentation states that an advantage of using HDFS is 
that a failing node can be replaced by a different one. The Overseer 
just has to move the ownership of the replica, which seems like what 
the code is trying to do. There just seems to be a bug in the code so 
that the core does not get created on the target node.


Each data directory also contains a lock file. The documentation 
states that one should use the HdfsLockFactory, which unfortunately 
can easily lead to SOLR-8335, which hopefully will be fixed by 
SOLR-8169. A manual cleanup is however also easily done but seems to 
require a node restart to take effect. But I'm also only recently 
playing around with all this ;-)


regards,
Hendrik

On 19.01.2017 16:40, Shawn Heisey wrote:

On 1/19/2017 4:09 AM, Hendrik Haddorp wrote:

Given that the data is on HDFS it shouldn't matter if any active
replica is left as the data does not need to get transferred from
another instance but the new core will just take over the existing
data. Thus a replication factor of 1 should also work just in that
case the shard would be down until the new core is up. Anyhow, it
looks like the above call is missing to set the shard id I guess or
some code is checking wrongly.

I know very little about how SolrCloud interacts with HDFS, so although
I'm reasonably certain about what comes below, I could be wrong.

I have not ever heard of SolrCloud being able to automatically take over
an existing index directory when it creates a replica, or even share
index directories unless the admin fools it into doing so without its
knowledge.  Sharing an index directory for replicas with SolrCloud would
NOT work correctly.  Solr must be able to update all replicas
independently, which means that each of them will lock its index
directory and write to it.

It is my understanding (from reading messages on mailing lists) that
when using HDFS, Solr replicas are all separate and consume additional
disk space, just like on a regular filesystem.

I found the code that generates the "No shard id" exception, but my
knowledge of how the zookeeper code in Solr works is not deep enough to
understand what it means or how to fix it.

Thanks,
Shawn







Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-19 Thread Hendrik Haddorp
HDFS is like a shared filesystem so every Solr Cloud instance can access 
the data using the same path or URL. The clusterstate.json looks like this:


"shards":{"shard1":{
"range":"8000-7fff",
"state":"active",
"replicas":{
  "core_node1":{
"core":"test1.collection-0_shard1_replica1",
"dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/",
"base_url":"http://slave3:9000/solr";,
"node_name":"slave3:9000_solr",
"state":"active",
"ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"},
  "core_node2":{
"core":"test1.collection-0_shard1_replica2",
"dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/",
"base_url":"http://slave2:9000/solr";,
"node_name":"slave2:9000_solr",
"state":"active",
"ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog",
"leader":"true"},
  "core_node3":{
"core":"test1.collection-0_shard1_replica3",
"dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/",
"base_url":"http://slave4:9005/solr";,
"node_name":"slave4:9005_solr",
"state":"active",
"ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog"

So every replica is always assigned to one node and this is being stored 
in ZK, pretty much the same as for none HDFS setups. Just as the data is 
not stored locally but on the network and as the path does not contain 
any node information you can of course easily take over the work to a 
different Solr node. You should just need to update the owner of the 
replica in ZK and you should basically be done, I assume. That's why the 
documentation states that an advantage of using HDFS is that a failing 
node can be replaced by a different one. The Overseer just has to move 
the ownership of the replica, which seems like what the code is trying 
to do. There just seems to be a bug in the code so that the core does 
not get created on the target node.


Each data directory also contains a lock file. The documentation states 
that one should use the HdfsLockFactory, which unfortunately can easily 
lead to SOLR-8335, which hopefully will be fixed by SOLR-8169. A manual 
cleanup is however also easily done but seems to require a node restart 
to take effect. But I'm also only recently playing around with all this ;-)


regards,
Hendrik

On 19.01.2017 16:40, Shawn Heisey wrote:

On 1/19/2017 4:09 AM, Hendrik Haddorp wrote:

Given that the data is on HDFS it shouldn't matter if any active
replica is left as the data does not need to get transferred from
another instance but the new core will just take over the existing
data. Thus a replication factor of 1 should also work just in that
case the shard would be down until the new core is up. Anyhow, it
looks like the above call is missing to set the shard id I guess or
some code is checking wrongly.

I know very little about how SolrCloud interacts with HDFS, so although
I'm reasonably certain about what comes below, I could be wrong.

I have not ever heard of SolrCloud being able to automatically take over
an existing index directory when it creates a replica, or even share
index directories unless the admin fools it into doing so without its
knowledge.  Sharing an index directory for replicas with SolrCloud would
NOT work correctly.  Solr must be able to update all replicas
independently, which means that each of them will lock its index
directory and write to it.

It is my understanding (from reading messages on mailing lists) that
when using HDFS, Solr replicas are all separate and consume additional
disk space, just like on a regular filesystem.

I found the code that generates the "No shard id" exception, but my
knowledge of how the zookeeper code in Solr works is not deep enough to
understand what it means or how to fix it.

Thanks,
Shawn





Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-19 Thread Shawn Heisey
On 1/19/2017 4:09 AM, Hendrik Haddorp wrote:
> Given that the data is on HDFS it shouldn't matter if any active
> replica is left as the data does not need to get transferred from
> another instance but the new core will just take over the existing
> data. Thus a replication factor of 1 should also work just in that
> case the shard would be down until the new core is up. Anyhow, it
> looks like the above call is missing to set the shard id I guess or
> some code is checking wrongly. 

I know very little about how SolrCloud interacts with HDFS, so although
I'm reasonably certain about what comes below, I could be wrong.

I have not ever heard of SolrCloud being able to automatically take over
an existing index directory when it creates a replica, or even share
index directories unless the admin fools it into doing so without its
knowledge.  Sharing an index directory for replicas with SolrCloud would
NOT work correctly.  Solr must be able to update all replicas
independently, which means that each of them will lock its index
directory and write to it.

It is my understanding (from reading messages on mailing lists) that
when using HDFS, Solr replicas are all separate and consume additional
disk space, just like on a regular filesystem.

I found the code that generates the "No shard id" exception, but my
knowledge of how the zookeeper code in Solr works is not deep enough to
understand what it means or how to fix it.

Thanks,
Shawn



Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-19 Thread Hendrik Haddorp

Hi,
I'm seeing the same issue on Solr 6.3 using HDFS and a replication 
factor of 3, even though I believe a replication factor of 1 should work 
the same. When I stop a Solr instance this is detected and Solr actually 
wants to create a replica on a different instance. The command for that 
does however fail:


o.a.s.c.OverseerAutoReplicaFailoverThread Exception trying to create new 
replica on 
http://...:9000/solr:org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: 
Error from server at http://...:9000/solr: Error CREATEing SolrCore 
'test2.collection-09_shard1_replica1': Unable to create core 
[test2.collection-09_shard1_replica1] Caused by: No shard id for 
CoreDescriptor[name=test2.collection-09_shard1_replica1;instanceDir=/var/opt/solr/test2.collection-09_shard1_replica1]
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:593)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:262)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:251)
at 
org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
at 
org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.createSolrCore(OverseerAutoReplicaFailoverThread.java:456)
at 
org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.lambda$addReplica$0(OverseerAutoReplicaFailoverThread.java:251)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Given that the data is on HDFS it shouldn't matter if any active replica 
is left as the data does not need to get transferred from another 
instance but the new core will just take over the existing data. Thus a 
replication factor of 1 should also work just in that case the shard 
would be down until the new core is up. Anyhow, it looks like the above 
call is missing to set the shard id I guess or some code is checking 
wrongly.


On 14.01.2017 02:44, Shawn Heisey wrote:

On 1/13/2017 5:46 PM, Chetas Joshi wrote:

One of the things I have observed is: if I use the collection API to
create a replica for that shard, it does not complain about the config
which has been set to ReplicationFactor=1. If replication factor was
the issue as suggested by Shawn, shouldn't it complain?

The replicationFactor value is used by exactly two things:  initial
collection creation, and autoAddReplicas.  It will not affect ANY other
command or operation, including ADDREPLICA.  You can create MORE
replicas than replicationFactor indicates, and there will be no error
messages or warnings.

In order to have a replica automatically added, your replicationFactor
must be at least two, and the number of active replicas in the cloud for
a shard must be less than that number.  If that's the case and the
expiration times have been reached without recovery, then Solr will
automatically add replicas until there are at least as many replicas
operational as specified in replicationFactor.


I would also like to mention that I experience some instance dirs
getting deleted and also found this open bug
(https://issues.apache.org/jira/browse/SOLR-8905)

The description on that issue is incomprehensible.  I can't make any
sense out of it.  It mentions the core.properties file, but the error
message shown doesn't talk about the properties file at all.  The error
and issue description seem to have nothing at all to do with the code
lines that were quoted.  Also, it was reported on version 4.10.3 ... but
this is going to be significantly different from current 6.x versions,
and the 4.x versions will NOT be updated with bugfixes.

Thanks,
Shawn





Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-13 Thread Shawn Heisey
On 1/13/2017 5:46 PM, Chetas Joshi wrote:
> One of the things I have observed is: if I use the collection API to
> create a replica for that shard, it does not complain about the config
> which has been set to ReplicationFactor=1. If replication factor was
> the issue as suggested by Shawn, shouldn't it complain? 

The replicationFactor value is used by exactly two things:  initial
collection creation, and autoAddReplicas.  It will not affect ANY other
command or operation, including ADDREPLICA.  You can create MORE
replicas than replicationFactor indicates, and there will be no error
messages or warnings.

In order to have a replica automatically added, your replicationFactor
must be at least two, and the number of active replicas in the cloud for
a shard must be less than that number.  If that's the case and the
expiration times have been reached without recovery, then Solr will
automatically add replicas until there are at least as many replicas
operational as specified in replicationFactor.

> I would also like to mention that I experience some instance dirs
> getting deleted and also found this open bug
> (https://issues.apache.org/jira/browse/SOLR-8905) 

The description on that issue is incomprehensible.  I can't make any
sense out of it.  It mentions the core.properties file, but the error
message shown doesn't talk about the properties file at all.  The error
and issue description seem to have nothing at all to do with the code
lines that were quoted.  Also, it was reported on version 4.10.3 ... but
this is going to be significantly different from current 6.x versions,
and the 4.x versions will NOT be updated with bugfixes.

Thanks,
Shawn



Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-13 Thread Chetas Joshi
Erick, I have not changed any config. I have autoaddReplica = true for
individual collection config as well as the overall cluster config. Still,
it does not add a replica when I decommission a node.

Adding a replica is overseer's job. I looked at the logs of the overseer of
the solrCloud but could not find anything there as well.

I am doing some testing using different configs. I would be happy to share
my finding.

One of the things I have observed is: if I use the collection API to create
a replica for that shard, it does not complain about the config which has
been set to ReplicationFactor=1. If replication factor was the issue as
suggested by Shawn, shouldn't it complain?

I would also like to mention that I experience some instance dirs getting
deleted and also found this open bug (
https://issues.apache.org/jira/browse/SOLR-8905)

Thanks!

On Thu, Jan 12, 2017 at 9:50 AM, Erick Erickson 
wrote:

> Hmmm, have you changed any of the settings for autoAddReplcia? There
> are several parameters that govern how long before a replica would be
> added.
>
> But I suggest you use the Cloudera resources for this question, not
> only did they write this functionality, but Cloudera support is deeply
> embedded in HDFS and I suspect has _by far_ the most experience with
> it.
>
> And that said, anything you find out that would suggest good ways to
> clarify the docs would be most welcome!
>
> Best,
> Erick
>
> On Thu, Jan 12, 2017 at 8:42 AM, Shawn Heisey  wrote:
> > On 1/11/2017 7:14 PM, Chetas Joshi wrote:
> >> This is what I understand about how Solr works on HDFS. Please correct
> me
> >> if I am wrong.
> >>
> >> Although solr shard replication Factor = 1, HDFS default replication =
> 3.
> >> When the node goes down, the solr server running on that node goes down
> and
> >> hence the instance (core) representing the replica goes down. The data
> in
> >> on HDFS (distributed across all the datanodes of the hadoop cluster
> with 3X
> >> replication).  This is the reason why I have kept replicationFactor=1.
> >>
> >> As per the link:
> >> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
> >> One benefit to running Solr in HDFS is the ability to automatically add
> new
> >> replicas when the Overseer notices that a shard has gone down. Because
> the
> >> "gone" index shards are stored in HDFS, a new core will be created and
> the
> >> new core will point to the existing indexes in HDFS.
> >>
> >> This is the expected behavior of Solr overseer which I am not able to
> see.
> >> After a couple of hours a node was assigned to host the shard but the
> >> status of the shard is still "down" and the instance dir is missing on
> that
> >> node for that particular shard_replica.
> >
> > As I said before, I know very little about HDFS, so the following could
> > be wrong, but it makes sense so I'll say it:
> >
> > I would imagine that Solr doesn't know or care what your HDFS
> > replication is ... the only replicas it knows about are the ones that it
> > is managing itself.  The autoAddReplicas feature manages *SolrCloud*
> > replicas, not HDFS replicas.
> >
> > I have seen people say that multiple SolrCloud replicas will take up
> > additional space in HDFS -- they do not point at the same index files.
> > This is because proper Lucene operation requires that it lock an index
> > and prevent any other thread/process from writing to the index at the
> > same time.  When you index, SolrCloud updates all replicas independently
> > -- the only time indexes are replicated is when you add a new replica or
> > a serious problem has occurred and an index needs to be recovered.
> >
> > Thanks,
> > Shawn
> >
>


Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-12 Thread Erick Erickson
Hmmm, have you changed any of the settings for autoAddReplcia? There
are several parameters that govern how long before a replica would be
added.

But I suggest you use the Cloudera resources for this question, not
only did they write this functionality, but Cloudera support is deeply
embedded in HDFS and I suspect has _by far_ the most experience with
it.

And that said, anything you find out that would suggest good ways to
clarify the docs would be most welcome!

Best,
Erick

On Thu, Jan 12, 2017 at 8:42 AM, Shawn Heisey  wrote:
> On 1/11/2017 7:14 PM, Chetas Joshi wrote:
>> This is what I understand about how Solr works on HDFS. Please correct me
>> if I am wrong.
>>
>> Although solr shard replication Factor = 1, HDFS default replication = 3.
>> When the node goes down, the solr server running on that node goes down and
>> hence the instance (core) representing the replica goes down. The data in
>> on HDFS (distributed across all the datanodes of the hadoop cluster with 3X
>> replication).  This is the reason why I have kept replicationFactor=1.
>>
>> As per the link:
>> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
>> One benefit to running Solr in HDFS is the ability to automatically add new
>> replicas when the Overseer notices that a shard has gone down. Because the
>> "gone" index shards are stored in HDFS, a new core will be created and the
>> new core will point to the existing indexes in HDFS.
>>
>> This is the expected behavior of Solr overseer which I am not able to see.
>> After a couple of hours a node was assigned to host the shard but the
>> status of the shard is still "down" and the instance dir is missing on that
>> node for that particular shard_replica.
>
> As I said before, I know very little about HDFS, so the following could
> be wrong, but it makes sense so I'll say it:
>
> I would imagine that Solr doesn't know or care what your HDFS
> replication is ... the only replicas it knows about are the ones that it
> is managing itself.  The autoAddReplicas feature manages *SolrCloud*
> replicas, not HDFS replicas.
>
> I have seen people say that multiple SolrCloud replicas will take up
> additional space in HDFS -- they do not point at the same index files.
> This is because proper Lucene operation requires that it lock an index
> and prevent any other thread/process from writing to the index at the
> same time.  When you index, SolrCloud updates all replicas independently
> -- the only time indexes are replicated is when you add a new replica or
> a serious problem has occurred and an index needs to be recovered.
>
> Thanks,
> Shawn
>


Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-12 Thread Shawn Heisey
On 1/11/2017 7:14 PM, Chetas Joshi wrote:
> This is what I understand about how Solr works on HDFS. Please correct me
> if I am wrong.
>
> Although solr shard replication Factor = 1, HDFS default replication = 3.
> When the node goes down, the solr server running on that node goes down and
> hence the instance (core) representing the replica goes down. The data in
> on HDFS (distributed across all the datanodes of the hadoop cluster with 3X
> replication).  This is the reason why I have kept replicationFactor=1.
>
> As per the link:
> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
> One benefit to running Solr in HDFS is the ability to automatically add new
> replicas when the Overseer notices that a shard has gone down. Because the
> "gone" index shards are stored in HDFS, a new core will be created and the
> new core will point to the existing indexes in HDFS.
>
> This is the expected behavior of Solr overseer which I am not able to see.
> After a couple of hours a node was assigned to host the shard but the
> status of the shard is still "down" and the instance dir is missing on that
> node for that particular shard_replica.

As I said before, I know very little about HDFS, so the following could
be wrong, but it makes sense so I'll say it:

I would imagine that Solr doesn't know or care what your HDFS
replication is ... the only replicas it knows about are the ones that it
is managing itself.  The autoAddReplicas feature manages *SolrCloud*
replicas, not HDFS replicas.

I have seen people say that multiple SolrCloud replicas will take up
additional space in HDFS -- they do not point at the same index files. 
This is because proper Lucene operation requires that it lock an index
and prevent any other thread/process from writing to the index at the
same time.  When you index, SolrCloud updates all replicas independently
-- the only time indexes are replicated is when you add a new replica or
a serious problem has occurred and an index needs to be recovered.

Thanks,
Shawn



Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-11 Thread Chetas Joshi
Hi Shawn,

This is what I understand about how Solr works on HDFS. Please correct me
if I am wrong.

Although solr shard replication Factor = 1, HDFS default replication = 3.
When the node goes down, the solr server running on that node goes down and
hence the instance (core) representing the replica goes down. The data in
on HDFS (distributed across all the datanodes of the hadoop cluster with 3X
replication).  This is the reason why I have kept replicationFactor=1.

As per the link:
https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
One benefit to running Solr in HDFS is the ability to automatically add new
replicas when the Overseer notices that a shard has gone down. Because the
"gone" index shards are stored in HDFS, a new core will be created and the
new core will point to the existing indexes in HDFS.

This is the expected behavior of Solr overseer which I am not able to see.
After a couple of hours a node was assigned to host the shard but the
status of the shard is still "down" and the instance dir is missing on that
node for that particular shard_replica.

Thanks!

On Wed, Jan 11, 2017 at 5:03 PM, Shawn Heisey  wrote:

> On 1/11/2017 1:47 PM, Chetas Joshi wrote:
> > I have deployed a SolrCloud (solr 5.5.0) on hdfs using cloudera 5.4.7.
> The
> > cloud has 86 nodes.
> >
> > This is my config for the collection
> >
> > numShards=80
> > ReplicationFactor=1
> > maxShardsPerNode=1
> > autoAddReplica=true
> >
> > I recently decommissioned a node to resolve some disk issues. The shard
> > that was being hosted on that host is now being shown as "gone" on the
> solr
> > admin UI.
> >
> > The got the cluster status using the collection API. It says
> > shard: active, replica: down
> >
> > The overseer does not seem to be creating an extra core even though
> > autoAddReplica=true (
> > https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS).
> >
> > Is this happening because the overseer sees the shard as active as
> > suggested by the cluster status?
> > If yes, is "autoAddReplica" not reliable? should I add a replica for this
> > shard when such cases arise?
>
> Your replicationFactor is one.  When there's one replica, you have no
> redundancy.  If that replica goes down, the shard is completely gone.
>
> As I understand it (I've got no experience with HDFS at all),
> autoAddReplicas is designed to automatically add replicas until
> replicationFactor is satisfied.  As already mentioned, your
> replicationFactor is one.  This means that it will always be satisfied.
>
> If autoAddReplicas were to kick in any time a replica went down, then
> Solr would be busy adding replicas anytime you restarted a node ...
> which would be a very bad idea.
>
> If your number of replicas is one, and that replica goes down, where
> would Solr go to get the data to create another replica?  The single
> replica is down, so there's nothing to copy from.  You might be thinking
> "from the leader" ... but a leader is nothing more than a replica that
> has been temporarily elected to have an extra job.  A replicationFactor
> of two doesn't mean a leader and two copies .. it means there are a
> total of two replicas, one of which is elected leader.
>
> If you want autoAddReplicas to work, you're going to need to have a
> replicationFactor of at least two, and you're probably going to have to
> delete the dead replica before another will be created.
>
> Thanks,
> Shawn
>
>


Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-11 Thread Shawn Heisey
On 1/11/2017 1:47 PM, Chetas Joshi wrote:
> I have deployed a SolrCloud (solr 5.5.0) on hdfs using cloudera 5.4.7. The
> cloud has 86 nodes.
>
> This is my config for the collection
>
> numShards=80
> ReplicationFactor=1
> maxShardsPerNode=1
> autoAddReplica=true
>
> I recently decommissioned a node to resolve some disk issues. The shard
> that was being hosted on that host is now being shown as "gone" on the solr
> admin UI.
>
> The got the cluster status using the collection API. It says
> shard: active, replica: down
>
> The overseer does not seem to be creating an extra core even though
> autoAddReplica=true (
> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS).
>
> Is this happening because the overseer sees the shard as active as
> suggested by the cluster status?
> If yes, is "autoAddReplica" not reliable? should I add a replica for this
> shard when such cases arise?

Your replicationFactor is one.  When there's one replica, you have no
redundancy.  If that replica goes down, the shard is completely gone.

As I understand it (I've got no experience with HDFS at all),
autoAddReplicas is designed to automatically add replicas until
replicationFactor is satisfied.  As already mentioned, your
replicationFactor is one.  This means that it will always be satisfied.

If autoAddReplicas were to kick in any time a replica went down, then
Solr would be busy adding replicas anytime you restarted a node ...
which would be a very bad idea.

If your number of replicas is one, and that replica goes down, where
would Solr go to get the data to create another replica?  The single
replica is down, so there's nothing to copy from.  You might be thinking
"from the leader" ... but a leader is nothing more than a replica that
has been temporarily elected to have an extra job.  A replicationFactor
of two doesn't mean a leader and two copies .. it means there are a
total of two replicas, one of which is elected leader.

If you want autoAddReplicas to work, you're going to need to have a
replicationFactor of at least two, and you're probably going to have to
delete the dead replica before another will be created.

Thanks,
Shawn



Solr on HDFS: AutoAddReplica does not add a replica

2017-01-11 Thread Chetas Joshi
Hello,

I have deployed a SolrCloud (solr 5.5.0) on hdfs using cloudera 5.4.7. The
cloud has 86 nodes.

This is my config for the collection

numShards=80
ReplicationFactor=1
maxShardsPerNode=1
autoAddReplica=true

I recently decommissioned a node to resolve some disk issues. The shard
that was being hosted on that host is now being shown as "gone" on the solr
admin UI.

The got the cluster status using the collection API. It says
shard: active, replica: down

The overseer does not seem to be creating an extra core even though
autoAddReplica=true (
https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS).

Is this happening because the overseer sees the shard as active as
suggested by the cluster status?
If yes, is "autoAddReplica" not reliable? should I add a replica for this
shard when such cases arise?

Thanks!


Re: Solr on HDFS: Streaming API performance tuning

2016-12-19 Thread Joel Bernstein
I took another look at the stack trace and I'm pretty sure the issue is
with NULL values in one of the sort fields. The null pointer is occurring
during the comparison of sort values. See line 85 of:
https://github.com/apache/lucene-solr/blob/branch_5_5/solr/solrj/src/java/org/apache/solr/client/solrj/io/comp/FieldComparator.java

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Dec 19, 2016 at 4:43 PM, Chetas Joshi 
wrote:

> Hi Joel,
>
> I don't have any solr documents that have NULL values for the sort fields I
> use in my queries.
>
> Thanks!
>
> On Sun, Dec 18, 2016 at 12:56 PM, Joel Bernstein 
> wrote:
>
> > Ok, based on the stack trace I suspect one of your sort fields has NULL
> > values, which in the 5x branch could produce null pointers if a segment
> had
> > no values for a sort field. This is also fixed in the Solr 6x branch.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Sat, Dec 17, 2016 at 2:44 PM, Chetas Joshi 
> > wrote:
> >
> > > Here is the stack trace.
> > >
> > > java.lang.NullPointerException
> > >
> > > at
> > > org.apache.solr.client.solrj.io.comp.FieldComparator$2.
> > > compare(FieldComparator.java:85)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.comp.FieldComparator.
> > > compare(FieldComparator.java:92)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.comp.FieldComparator.
> > > compare(FieldComparator.java:30)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.comp.MultiComp.compare(
> > MultiComp.java:45)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.comp.MultiComp.compare(
> > MultiComp.java:33)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> > > TupleWrapper.compareTo(CloudSolrStream.java:396)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> > > TupleWrapper.compareTo(CloudSolrStream.java:381)
> > >
> > > at java.util.TreeMap.put(TreeMap.java:560)
> > >
> > > at java.util.TreeSet.add(TreeSet.java:255)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.CloudSolrStream._
> > > read(CloudSolrStream.java:366)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.CloudSolrStream.
> > > read(CloudSolrStream.java:353)
> > >
> > > at
> > >
> > > *.*.*.*.SolrStreamResultIterator$$anon$1.run(SolrStreamResultIterator.
> > > scala:101)
> > >
> > > at java.lang.Thread.run(Thread.java:745)
> > >
> > > 16/11/17 13:04:31 *ERROR* SolrStreamResultIterator:missing exponent
> > > number:
> > > char=A,position=106596
> > > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
> > > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'
> > >
> > > org.noggit.JSONParser$ParseException: missing exponent number:
> > > char=A,position=106596
> > > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
> > > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'
> > >
> > > at org.noggit.JSONParser.err(JSONParser.java:356)
> > >
> > > at org.noggit.JSONParser.readExp(JSONParser.java:513)
> > >
> > > at org.noggit.JSONParser.readNumber(JSONParser.java:419)
> > >
> > > at org.noggit.JSONParser.next(JSONParser.java:845)
> > >
> > > at org.noggit.JSONParser.nextEvent(JSONParser.java:951)
> > >
> > > at org.noggit.ObjectBuilder.getObject(ObjectBuilder.java:127)
> > >
> > > at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:57)
> > >
> > > at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:37)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.JSONTupleStream.
> > > next(JSONTupleStream.java:84)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.SolrStream.read(
> > > SolrStream.java:147)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> > TupleWrapper.next(
> > > CloudSolrStream.java:413)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.CloudSolrStream._
> > > read(CloudSolrStream.java:365)
> > >
> > > at
> > > org.apache.solr.client.solrj.io.stream.CloudSolrStream.
> > > read(CloudSolrStream.java:353)
> > >
> > >
> > > Thanks!
> > >
> > > On Fri, Dec 16, 2016 at 11:45 PM, Reth RM 
> wrote:
> > >
> > > > If you could provide the json parse exception stack trace, it might
> > help
> > > to
> > > > predict issue there.
> > > >
> > > >
> > > > On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi <
> chetas.jo...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Joel,
> > > > >
> > > > > The only NON alpha-numeric characters I have in my data are '+' and
> > > '/'.
> > > > I
> > > > > don't have any backslashes.
> > > > >
> > > > > If the special characters was the issue, I should get the JSON
> > parsing
> > > > > exceptions every time irrespective of the index size and
> irrespective
> > > of
> > > > > the available memory on the machine. That is not the case here. The
> > > > > streaming API successfully returns all the 

Re: Solr on HDFS: Streaming API performance tuning

2016-12-19 Thread Chetas Joshi
Hi Joel,

I don't have any solr documents that have NULL values for the sort fields I
use in my queries.

Thanks!

On Sun, Dec 18, 2016 at 12:56 PM, Joel Bernstein  wrote:

> Ok, based on the stack trace I suspect one of your sort fields has NULL
> values, which in the 5x branch could produce null pointers if a segment had
> no values for a sort field. This is also fixed in the Solr 6x branch.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Sat, Dec 17, 2016 at 2:44 PM, Chetas Joshi 
> wrote:
>
> > Here is the stack trace.
> >
> > java.lang.NullPointerException
> >
> > at
> > org.apache.solr.client.solrj.io.comp.FieldComparator$2.
> > compare(FieldComparator.java:85)
> >
> > at
> > org.apache.solr.client.solrj.io.comp.FieldComparator.
> > compare(FieldComparator.java:92)
> >
> > at
> > org.apache.solr.client.solrj.io.comp.FieldComparator.
> > compare(FieldComparator.java:30)
> >
> > at
> > org.apache.solr.client.solrj.io.comp.MultiComp.compare(
> MultiComp.java:45)
> >
> > at
> > org.apache.solr.client.solrj.io.comp.MultiComp.compare(
> MultiComp.java:33)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> > TupleWrapper.compareTo(CloudSolrStream.java:396)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> > TupleWrapper.compareTo(CloudSolrStream.java:381)
> >
> > at java.util.TreeMap.put(TreeMap.java:560)
> >
> > at java.util.TreeSet.add(TreeSet.java:255)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream._
> > read(CloudSolrStream.java:366)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream.
> > read(CloudSolrStream.java:353)
> >
> > at
> >
> > *.*.*.*.SolrStreamResultIterator$$anon$1.run(SolrStreamResultIterator.
> > scala:101)
> >
> > at java.lang.Thread.run(Thread.java:745)
> >
> > 16/11/17 13:04:31 *ERROR* SolrStreamResultIterator:missing exponent
> > number:
> > char=A,position=106596
> > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
> > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'
> >
> > org.noggit.JSONParser$ParseException: missing exponent number:
> > char=A,position=106596
> > BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
> > AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'
> >
> > at org.noggit.JSONParser.err(JSONParser.java:356)
> >
> > at org.noggit.JSONParser.readExp(JSONParser.java:513)
> >
> > at org.noggit.JSONParser.readNumber(JSONParser.java:419)
> >
> > at org.noggit.JSONParser.next(JSONParser.java:845)
> >
> > at org.noggit.JSONParser.nextEvent(JSONParser.java:951)
> >
> > at org.noggit.ObjectBuilder.getObject(ObjectBuilder.java:127)
> >
> > at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:57)
> >
> > at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:37)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.JSONTupleStream.
> > next(JSONTupleStream.java:84)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.SolrStream.read(
> > SolrStream.java:147)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> TupleWrapper.next(
> > CloudSolrStream.java:413)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream._
> > read(CloudSolrStream.java:365)
> >
> > at
> > org.apache.solr.client.solrj.io.stream.CloudSolrStream.
> > read(CloudSolrStream.java:353)
> >
> >
> > Thanks!
> >
> > On Fri, Dec 16, 2016 at 11:45 PM, Reth RM  wrote:
> >
> > > If you could provide the json parse exception stack trace, it might
> help
> > to
> > > predict issue there.
> > >
> > >
> > > On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi 
> > > wrote:
> > >
> > > > Hi Joel,
> > > >
> > > > The only NON alpha-numeric characters I have in my data are '+' and
> > '/'.
> > > I
> > > > don't have any backslashes.
> > > >
> > > > If the special characters was the issue, I should get the JSON
> parsing
> > > > exceptions every time irrespective of the index size and irrespective
> > of
> > > > the available memory on the machine. That is not the case here. The
> > > > streaming API successfully returns all the documents when the index
> > size
> > > is
> > > > small and fits in the available memory. That's the reason I am
> > confused.
> > > >
> > > > Thanks!
> > > >
> > > > On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein 
> > > > wrote:
> > > >
> > > > > The Streaming API may have been throwing exceptions because the
> JSON
> > > > > special characters were not escaped. This was fixed in Solr 6.0.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Joel Bernstein
> > > > > http://joelsolr.blogspot.com/
> > > > >
> > > > > On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi <
> > chetas.jo...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I am running Solr 5.5.0.
> > > > > > It is a solrCloud of 50 nodes a

Re: Solr on HDFS: Streaming API performance tuning

2016-12-18 Thread Joel Bernstein
Ok, based on the stack trace I suspect one of your sort fields has NULL
values, which in the 5x branch could produce null pointers if a segment had
no values for a sort field. This is also fixed in the Solr 6x branch.

Joel Bernstein
http://joelsolr.blogspot.com/

On Sat, Dec 17, 2016 at 2:44 PM, Chetas Joshi 
wrote:

> Here is the stack trace.
>
> java.lang.NullPointerException
>
> at
> org.apache.solr.client.solrj.io.comp.FieldComparator$2.
> compare(FieldComparator.java:85)
>
> at
> org.apache.solr.client.solrj.io.comp.FieldComparator.
> compare(FieldComparator.java:92)
>
> at
> org.apache.solr.client.solrj.io.comp.FieldComparator.
> compare(FieldComparator.java:30)
>
> at
> org.apache.solr.client.solrj.io.comp.MultiComp.compare(MultiComp.java:45)
>
> at
> org.apache.solr.client.solrj.io.comp.MultiComp.compare(MultiComp.java:33)
>
> at
> org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> TupleWrapper.compareTo(CloudSolrStream.java:396)
>
> at
> org.apache.solr.client.solrj.io.stream.CloudSolrStream$
> TupleWrapper.compareTo(CloudSolrStream.java:381)
>
> at java.util.TreeMap.put(TreeMap.java:560)
>
> at java.util.TreeSet.add(TreeSet.java:255)
>
> at
> org.apache.solr.client.solrj.io.stream.CloudSolrStream._
> read(CloudSolrStream.java:366)
>
> at
> org.apache.solr.client.solrj.io.stream.CloudSolrStream.
> read(CloudSolrStream.java:353)
>
> at
>
> *.*.*.*.SolrStreamResultIterator$$anon$1.run(SolrStreamResultIterator.
> scala:101)
>
> at java.lang.Thread.run(Thread.java:745)
>
> 16/11/17 13:04:31 *ERROR* SolrStreamResultIterator:missing exponent
> number:
> char=A,position=106596
> BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
> AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'
>
> org.noggit.JSONParser$ParseException: missing exponent number:
> char=A,position=106596
> BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
> AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'
>
> at org.noggit.JSONParser.err(JSONParser.java:356)
>
> at org.noggit.JSONParser.readExp(JSONParser.java:513)
>
> at org.noggit.JSONParser.readNumber(JSONParser.java:419)
>
> at org.noggit.JSONParser.next(JSONParser.java:845)
>
> at org.noggit.JSONParser.nextEvent(JSONParser.java:951)
>
> at org.noggit.ObjectBuilder.getObject(ObjectBuilder.java:127)
>
> at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:57)
>
> at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:37)
>
> at
> org.apache.solr.client.solrj.io.stream.JSONTupleStream.
> next(JSONTupleStream.java:84)
>
> at
> org.apache.solr.client.solrj.io.stream.SolrStream.read(
> SolrStream.java:147)
>
> at
> org.apache.solr.client.solrj.io.stream.CloudSolrStream$TupleWrapper.next(
> CloudSolrStream.java:413)
>
> at
> org.apache.solr.client.solrj.io.stream.CloudSolrStream._
> read(CloudSolrStream.java:365)
>
> at
> org.apache.solr.client.solrj.io.stream.CloudSolrStream.
> read(CloudSolrStream.java:353)
>
>
> Thanks!
>
> On Fri, Dec 16, 2016 at 11:45 PM, Reth RM  wrote:
>
> > If you could provide the json parse exception stack trace, it might help
> to
> > predict issue there.
> >
> >
> > On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi 
> > wrote:
> >
> > > Hi Joel,
> > >
> > > The only NON alpha-numeric characters I have in my data are '+' and
> '/'.
> > I
> > > don't have any backslashes.
> > >
> > > If the special characters was the issue, I should get the JSON parsing
> > > exceptions every time irrespective of the index size and irrespective
> of
> > > the available memory on the machine. That is not the case here. The
> > > streaming API successfully returns all the documents when the index
> size
> > is
> > > small and fits in the available memory. That's the reason I am
> confused.
> > >
> > > Thanks!
> > >
> > > On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein 
> > > wrote:
> > >
> > > > The Streaming API may have been throwing exceptions because the JSON
> > > > special characters were not escaped. This was fixed in Solr 6.0.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > > On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi <
> chetas.jo...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I am running Solr 5.5.0.
> > > > > It is a solrCloud of 50 nodes and I have the following config for
> all
> > > the
> > > > > collections.
> > > > > maxShardsperNode: 1
> > > > > replicationFactor: 1
> > > > >
> > > > > I was using Streaming API to get back results from Solr. It worked
> > fine
> > > > for
> > > > > a while until the index data size reached beyond 40 GB per shard
> > (i.e.
> > > > per
> > > > > node). It started throwing JSON parsing exceptions while reading
> the
> > > > > TupleStream data. FYI: I have other services (Yarn, Spark) deployed
> > on

Re: Solr on HDFS: Streaming API performance tuning

2016-12-17 Thread Chetas Joshi
Here is the stack trace.

java.lang.NullPointerException

at
org.apache.solr.client.solrj.io.comp.FieldComparator$2.compare(FieldComparator.java:85)

at
org.apache.solr.client.solrj.io.comp.FieldComparator.compare(FieldComparator.java:92)

at
org.apache.solr.client.solrj.io.comp.FieldComparator.compare(FieldComparator.java:30)

at
org.apache.solr.client.solrj.io.comp.MultiComp.compare(MultiComp.java:45)

at
org.apache.solr.client.solrj.io.comp.MultiComp.compare(MultiComp.java:33)

at
org.apache.solr.client.solrj.io.stream.CloudSolrStream$TupleWrapper.compareTo(CloudSolrStream.java:396)

at
org.apache.solr.client.solrj.io.stream.CloudSolrStream$TupleWrapper.compareTo(CloudSolrStream.java:381)

at java.util.TreeMap.put(TreeMap.java:560)

at java.util.TreeSet.add(TreeSet.java:255)

at
org.apache.solr.client.solrj.io.stream.CloudSolrStream._read(CloudSolrStream.java:366)

at
org.apache.solr.client.solrj.io.stream.CloudSolrStream.read(CloudSolrStream.java:353)

at

*.*.*.*.SolrStreamResultIterator$$anon$1.run(SolrStreamResultIterator.scala:101)

at java.lang.Thread.run(Thread.java:745)

16/11/17 13:04:31 *ERROR* SolrStreamResultIterator:missing exponent number:
char=A,position=106596
BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'

org.noggit.JSONParser$ParseException: missing exponent number:
char=A,position=106596
BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'

at org.noggit.JSONParser.err(JSONParser.java:356)

at org.noggit.JSONParser.readExp(JSONParser.java:513)

at org.noggit.JSONParser.readNumber(JSONParser.java:419)

at org.noggit.JSONParser.next(JSONParser.java:845)

at org.noggit.JSONParser.nextEvent(JSONParser.java:951)

at org.noggit.ObjectBuilder.getObject(ObjectBuilder.java:127)

at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:57)

at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:37)

at
org.apache.solr.client.solrj.io.stream.JSONTupleStream.next(JSONTupleStream.java:84)

at
org.apache.solr.client.solrj.io.stream.SolrStream.read(SolrStream.java:147)

at
org.apache.solr.client.solrj.io.stream.CloudSolrStream$TupleWrapper.next(CloudSolrStream.java:413)

at
org.apache.solr.client.solrj.io.stream.CloudSolrStream._read(CloudSolrStream.java:365)

at
org.apache.solr.client.solrj.io.stream.CloudSolrStream.read(CloudSolrStream.java:353)


Thanks!

On Fri, Dec 16, 2016 at 11:45 PM, Reth RM  wrote:

> If you could provide the json parse exception stack trace, it might help to
> predict issue there.
>
>
> On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi 
> wrote:
>
> > Hi Joel,
> >
> > The only NON alpha-numeric characters I have in my data are '+' and '/'.
> I
> > don't have any backslashes.
> >
> > If the special characters was the issue, I should get the JSON parsing
> > exceptions every time irrespective of the index size and irrespective of
> > the available memory on the machine. That is not the case here. The
> > streaming API successfully returns all the documents when the index size
> is
> > small and fits in the available memory. That's the reason I am confused.
> >
> > Thanks!
> >
> > On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein 
> > wrote:
> >
> > > The Streaming API may have been throwing exceptions because the JSON
> > > special characters were not escaped. This was fixed in Solr 6.0.
> > >
> > >
> > >
> > >
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi 
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I am running Solr 5.5.0.
> > > > It is a solrCloud of 50 nodes and I have the following config for all
> > the
> > > > collections.
> > > > maxShardsperNode: 1
> > > > replicationFactor: 1
> > > >
> > > > I was using Streaming API to get back results from Solr. It worked
> fine
> > > for
> > > > a while until the index data size reached beyond 40 GB per shard
> (i.e.
> > > per
> > > > node). It started throwing JSON parsing exceptions while reading the
> > > > TupleStream data. FYI: I have other services (Yarn, Spark) deployed
> on
> > > the
> > > > same boxes on which Solr shards are running. Spark jobs also use a
> lot
> > of
> > > > disk cache. So, the free available disk cache on the boxes vary a
> > > > lot depending upon what else is running on the box.
> > > >
> > > > Due to this issue, I moved to using the cursor approach and it works
> > fine
> > > > but as we all know it is way slower than the streaming approach.
> > > >
> > > > Currently the index size per shard is 80GB (The machine has 512 GB of
> > RAM
> > > > and being used by different services/programs: heap/off-heap and the
> > disk
> > > > cache requirements).
> > > >
> > > > When I have enough RAM (more th

Re: Solr on HDFS: Streaming API performance tuning

2016-12-16 Thread Reth RM
If you could provide the json parse exception stack trace, it might help to
predict issue there.


On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi 
wrote:

> Hi Joel,
>
> The only NON alpha-numeric characters I have in my data are '+' and '/'. I
> don't have any backslashes.
>
> If the special characters was the issue, I should get the JSON parsing
> exceptions every time irrespective of the index size and irrespective of
> the available memory on the machine. That is not the case here. The
> streaming API successfully returns all the documents when the index size is
> small and fits in the available memory. That's the reason I am confused.
>
> Thanks!
>
> On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein 
> wrote:
>
> > The Streaming API may have been throwing exceptions because the JSON
> > special characters were not escaped. This was fixed in Solr 6.0.
> >
> >
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi 
> > wrote:
> >
> > > Hello,
> > >
> > > I am running Solr 5.5.0.
> > > It is a solrCloud of 50 nodes and I have the following config for all
> the
> > > collections.
> > > maxShardsperNode: 1
> > > replicationFactor: 1
> > >
> > > I was using Streaming API to get back results from Solr. It worked fine
> > for
> > > a while until the index data size reached beyond 40 GB per shard (i.e.
> > per
> > > node). It started throwing JSON parsing exceptions while reading the
> > > TupleStream data. FYI: I have other services (Yarn, Spark) deployed on
> > the
> > > same boxes on which Solr shards are running. Spark jobs also use a lot
> of
> > > disk cache. So, the free available disk cache on the boxes vary a
> > > lot depending upon what else is running on the box.
> > >
> > > Due to this issue, I moved to using the cursor approach and it works
> fine
> > > but as we all know it is way slower than the streaming approach.
> > >
> > > Currently the index size per shard is 80GB (The machine has 512 GB of
> RAM
> > > and being used by different services/programs: heap/off-heap and the
> disk
> > > cache requirements).
> > >
> > > When I have enough RAM (more than 80 GB so that all the index data
> could
> > > fit in memory) available on the machine, the streaming API succeeds
> > without
> > > running into any exceptions.
> > >
> > > Question:
> > > How different the index data caching mechanism (for HDFS) is for the
> > > Streaming API from the cursorMark approach?
> > > Why cursor works every time but streaming works only when there is a
> lot
> > of
> > > free disk cache?
> > >
> > > Thank you.
> > >
> >
>


Re: Solr on HDFS: Streaming API performance tuning

2016-12-16 Thread Chetas Joshi
Hi Joel,

The only NON alpha-numeric characters I have in my data are '+' and '/'. I
don't have any backslashes.

If the special characters was the issue, I should get the JSON parsing
exceptions every time irrespective of the index size and irrespective of
the available memory on the machine. That is not the case here. The
streaming API successfully returns all the documents when the index size is
small and fits in the available memory. That's the reason I am confused.

Thanks!

On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein  wrote:

> The Streaming API may have been throwing exceptions because the JSON
> special characters were not escaped. This was fixed in Solr 6.0.
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi 
> wrote:
>
> > Hello,
> >
> > I am running Solr 5.5.0.
> > It is a solrCloud of 50 nodes and I have the following config for all the
> > collections.
> > maxShardsperNode: 1
> > replicationFactor: 1
> >
> > I was using Streaming API to get back results from Solr. It worked fine
> for
> > a while until the index data size reached beyond 40 GB per shard (i.e.
> per
> > node). It started throwing JSON parsing exceptions while reading the
> > TupleStream data. FYI: I have other services (Yarn, Spark) deployed on
> the
> > same boxes on which Solr shards are running. Spark jobs also use a lot of
> > disk cache. So, the free available disk cache on the boxes vary a
> > lot depending upon what else is running on the box.
> >
> > Due to this issue, I moved to using the cursor approach and it works fine
> > but as we all know it is way slower than the streaming approach.
> >
> > Currently the index size per shard is 80GB (The machine has 512 GB of RAM
> > and being used by different services/programs: heap/off-heap and the disk
> > cache requirements).
> >
> > When I have enough RAM (more than 80 GB so that all the index data could
> > fit in memory) available on the machine, the streaming API succeeds
> without
> > running into any exceptions.
> >
> > Question:
> > How different the index data caching mechanism (for HDFS) is for the
> > Streaming API from the cursorMark approach?
> > Why cursor works every time but streaming works only when there is a lot
> of
> > free disk cache?
> >
> > Thank you.
> >
>


Re: Solr on HDFS: Streaming API performance tuning

2016-12-16 Thread Joel Bernstein
The Streaming API may have been throwing exceptions because the JSON
special characters were not escaped. This was fixed in Solr 6.0.






Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi 
wrote:

> Hello,
>
> I am running Solr 5.5.0.
> It is a solrCloud of 50 nodes and I have the following config for all the
> collections.
> maxShardsperNode: 1
> replicationFactor: 1
>
> I was using Streaming API to get back results from Solr. It worked fine for
> a while until the index data size reached beyond 40 GB per shard (i.e. per
> node). It started throwing JSON parsing exceptions while reading the
> TupleStream data. FYI: I have other services (Yarn, Spark) deployed on the
> same boxes on which Solr shards are running. Spark jobs also use a lot of
> disk cache. So, the free available disk cache on the boxes vary a
> lot depending upon what else is running on the box.
>
> Due to this issue, I moved to using the cursor approach and it works fine
> but as we all know it is way slower than the streaming approach.
>
> Currently the index size per shard is 80GB (The machine has 512 GB of RAM
> and being used by different services/programs: heap/off-heap and the disk
> cache requirements).
>
> When I have enough RAM (more than 80 GB so that all the index data could
> fit in memory) available on the machine, the streaming API succeeds without
> running into any exceptions.
>
> Question:
> How different the index data caching mechanism (for HDFS) is for the
> Streaming API from the cursorMark approach?
> Why cursor works every time but streaming works only when there is a lot of
> free disk cache?
>
> Thank you.
>


Solr on HDFS: Streaming API performance tuning

2016-12-16 Thread Chetas Joshi
Hello,

I am running Solr 5.5.0.
It is a solrCloud of 50 nodes and I have the following config for all the
collections.
maxShardsperNode: 1
replicationFactor: 1

I was using Streaming API to get back results from Solr. It worked fine for
a while until the index data size reached beyond 40 GB per shard (i.e. per
node). It started throwing JSON parsing exceptions while reading the
TupleStream data. FYI: I have other services (Yarn, Spark) deployed on the
same boxes on which Solr shards are running. Spark jobs also use a lot of
disk cache. So, the free available disk cache on the boxes vary a
lot depending upon what else is running on the box.

Due to this issue, I moved to using the cursor approach and it works fine
but as we all know it is way slower than the streaming approach.

Currently the index size per shard is 80GB (The machine has 512 GB of RAM
and being used by different services/programs: heap/off-heap and the disk
cache requirements).

When I have enough RAM (more than 80 GB so that all the index data could
fit in memory) available on the machine, the streaming API succeeds without
running into any exceptions.

Question:
How different the index data caching mechanism (for HDFS) is for the
Streaming API from the cursorMark approach?
Why cursor works every time but streaming works only when there is a lot of
free disk cache?

Thank you.


Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Shawn Heisey
On 12/16/2016 11:58 AM, Chetas Joshi wrote:
> How different the index data caching mechanism is for the Streaming
> API from the cursor approach?

Solr and Lucene do not handle that caching.  Systems external to Solr
(like the OS, or HDFS) handle the caching.  The cache effectiveness will
be a combination of the cache size, overall data size, and the data
access patterns of the application.  I do not know enough to tell you
how the cursorMark feature and the streaming API work when they access
the index data.  I would imagine them to be pretty similar, but cannot
be sure about that.

Thanks,
Shawn



Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Chetas Joshi
Thank you everyone. I would add nodes to the SolrCloud and split the shards.

Shawn,

Thank you for explaining why putting index data on local file system could
be a better idea than using HDFS. I need to find out how HDFS caches the
index files in a resource constrained environment.

I would also like to add that when I try the Streaming API instead of using
the cursor approach, it starts running into JSON parsing exceptions when my
nodes (running Solr shards) don't have enough RAM to fit the entire index
into memory. FYI: I have other services (Yarn, Spark) deployed on the same
boxes as well. Spark jobs also use a lot of disk cache.
When I have enough RAM (more than 70 GB so that all the index data could
fit in memory), the streaming API succeeds without running into any
exceptions. How different the index data caching mechanism is for the
Streaming API from the cursor approach?

Thanks!



On Fri, Dec 16, 2016 at 6:52 AM, Shawn Heisey  wrote:

> On 12/14/2016 11:58 AM, Chetas Joshi wrote:
> > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
> > the following config.
> > maxShardsperNode: 1
> > replicationFactor: 1
> >
> > I have been ingesting data into Solr for the last 3 months. With increase
> > in data, I am observing increase in the query time. Currently the size of
> > my indices is 70 GB per shard (i.e. per node).
>
> Query times will increase as the index size increases, but significant
> jumps in the query time may be an indication of a performance problem.
> Performance problems are usually caused by insufficient resources,
> memory in particular.
>
> With HDFS, I am honestly not sure *where* the cache memory is needed.  I
> would assume that it's needed on the HDFS hosts, that a lot of spare
> memory on the Solr (HDFS client) hosts probably won't make much
> difference.  I could be wrong -- I have no idea what kind of caching
> HDFS does.  If the HDFS client can cache data, then you probably would
> want extra memory on the Solr machines.
>
> > I am using cursor approach (/export handler) using SolrJ client to get
> back
> > results from Solr. All the fields I am querying on and all the fields
> that
> > I get back from Solr are indexed and have docValues enabled as well. What
> > could be the reason behind increase in query time?
>
> If actual disk access is required to satisfy a query, Solr is going to
> be slow.  Caching is absolutely required for good performance.  If your
> query times are really long but used to be short, chances are that your
> index size has exceeded your system's ability to cache it effectively.
>
> One thing to keep in mind:  Gigabit Ethernet is comparable in speed to
> the sustained transfer rate of a single modern SATA magnetic disk, so if
> the data has to traverse a gigabit network, it probably will be nearly
> as slow as it would be if it were coming from a single disk.  Having a
> 10gig network for your storage is probably a good idea ... but current
> fast memory chips can leave 10gig in the dust, so if the data can come
> from cache and the chips are new enough, then it can be faster than
> network storage.
>
> Because the network can be a potential bottleneck, I strongly recommend
> putting index data on local disks.  If you have enough memory, the disk
> doesn't even need to be super-fast.
>
> > Has this got something to do with the OS disk cache that is used for
> > loading the Solr indices? When a query is fired, will Solr wait for all
> > (70GB) of disk cache being available so that it can load the index file?
>
> Caching the files on the disk is not handled by Solr, so Solr won't wait
> for the entire index to be cached unless the underlying storage waits
> for some reason.  The caching is usually handled by the OS.  For HDFS,
> it might be handled by a combination of the OS and Hadoop, but I don't
> know enough about HDFS to comment.  Solr makes a request for the parts
> of the index files that it needs to satisfy the request.  If the
> underlying system is capable of caching the data, if that feature is
> enabled, and if there's memory available for that purpose, then it gets
> cached.
>
> Thanks,
> Shawn
>
>


Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Shawn Heisey
On 12/14/2016 11:58 AM, Chetas Joshi wrote:
> I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
> the following config.
> maxShardsperNode: 1
> replicationFactor: 1
>
> I have been ingesting data into Solr for the last 3 months. With increase
> in data, I am observing increase in the query time. Currently the size of
> my indices is 70 GB per shard (i.e. per node).

Query times will increase as the index size increases, but significant
jumps in the query time may be an indication of a performance problem. 
Performance problems are usually caused by insufficient resources,
memory in particular.

With HDFS, I am honestly not sure *where* the cache memory is needed.  I
would assume that it's needed on the HDFS hosts, that a lot of spare
memory on the Solr (HDFS client) hosts probably won't make much
difference.  I could be wrong -- I have no idea what kind of caching
HDFS does.  If the HDFS client can cache data, then you probably would
want extra memory on the Solr machines.

> I am using cursor approach (/export handler) using SolrJ client to get back
> results from Solr. All the fields I am querying on and all the fields that
> I get back from Solr are indexed and have docValues enabled as well. What
> could be the reason behind increase in query time?

If actual disk access is required to satisfy a query, Solr is going to
be slow.  Caching is absolutely required for good performance.  If your
query times are really long but used to be short, chances are that your
index size has exceeded your system's ability to cache it effectively.

One thing to keep in mind:  Gigabit Ethernet is comparable in speed to
the sustained transfer rate of a single modern SATA magnetic disk, so if
the data has to traverse a gigabit network, it probably will be nearly
as slow as it would be if it were coming from a single disk.  Having a
10gig network for your storage is probably a good idea ... but current
fast memory chips can leave 10gig in the dust, so if the data can come
from cache and the chips are new enough, then it can be faster than
network storage.

Because the network can be a potential bottleneck, I strongly recommend
putting index data on local disks.  If you have enough memory, the disk
doesn't even need to be super-fast.

> Has this got something to do with the OS disk cache that is used for
> loading the Solr indices? When a query is fired, will Solr wait for all
> (70GB) of disk cache being available so that it can load the index file?

Caching the files on the disk is not handled by Solr, so Solr won't wait
for the entire index to be cached unless the underlying storage waits
for some reason.  The caching is usually handled by the OS.  For HDFS,
it might be handled by a combination of the OS and Hadoop, but I don't
know enough about HDFS to comment.  Solr makes a request for the parts
of the index files that it needs to satisfy the request.  If the
underlying system is capable of caching the data, if that feature is
enabled, and if there's memory available for that purpose, then it gets
cached.

Thanks,
Shawn



Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Piyush Kunal
I think 70GB is too huge for a shard.
How much memory does the system is having?
Incase solr does not have sufficient memory to load the indexes, it will
use only the amount of memory defined in your Solr Caches.

Although you are on HFDS, solr performances will be really bad if it has do
disk IO at the query time.

The best option for you is to shard it into atleast 8-10 nodes and create
appropriate replicas according to your read traffic.

Regards,
Piyush

On Fri, Dec 16, 2016 at 12:15 PM, Reth RM  wrote:

> I think the shard index size is huge and should be split.
>
> On Wed, Dec 14, 2016 at 10:58 AM, Chetas Joshi 
> wrote:
>
> > Hi everyone,
> >
> > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
> > the following config.
> > maxShardsperNode: 1
> > replicationFactor: 1
> >
> > I have been ingesting data into Solr for the last 3 months. With increase
> > in data, I am observing increase in the query time. Currently the size of
> > my indices is 70 GB per shard (i.e. per node).
> >
> > I am using cursor approach (/export handler) using SolrJ client to get
> back
> > results from Solr. All the fields I am querying on and all the fields
> that
> > I get back from Solr are indexed and have docValues enabled as well. What
> > could be the reason behind increase in query time?
> >
> > Has this got something to do with the OS disk cache that is used for
> > loading the Solr indices? When a query is fired, will Solr wait for all
> > (70GB) of disk cache being available so that it can load the index file?
> >
> > Thnaks!
> >
>


Re: Solr on HDFS: increase in query time with increase in data

2016-12-15 Thread Reth RM
I think the shard index size is huge and should be split.

On Wed, Dec 14, 2016 at 10:58 AM, Chetas Joshi 
wrote:

> Hi everyone,
>
> I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
> the following config.
> maxShardsperNode: 1
> replicationFactor: 1
>
> I have been ingesting data into Solr for the last 3 months. With increase
> in data, I am observing increase in the query time. Currently the size of
> my indices is 70 GB per shard (i.e. per node).
>
> I am using cursor approach (/export handler) using SolrJ client to get back
> results from Solr. All the fields I am querying on and all the fields that
> I get back from Solr are indexed and have docValues enabled as well. What
> could be the reason behind increase in query time?
>
> Has this got something to do with the OS disk cache that is used for
> loading the Solr indices? When a query is fired, will Solr wait for all
> (70GB) of disk cache being available so that it can load the index file?
>
> Thnaks!
>


Solr on HDFS: increase in query time with increase in data

2016-12-14 Thread Chetas Joshi
Hi everyone,

I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
the following config.
maxShardsperNode: 1
replicationFactor: 1

I have been ingesting data into Solr for the last 3 months. With increase
in data, I am observing increase in the query time. Currently the size of
my indices is 70 GB per shard (i.e. per node).

I am using cursor approach (/export handler) using SolrJ client to get back
results from Solr. All the fields I am querying on and all the fields that
I get back from Solr are indexed and have docValues enabled as well. What
could be the reason behind increase in query time?

Has this got something to do with the OS disk cache that is used for
loading the Solr indices? When a query is fired, will Solr wait for all
(70GB) of disk cache being available so that it can load the index file?

Thnaks!


Re: Solr on HDFS: adding a shard replica

2016-09-14 Thread Erick Erickson
The core_node name is largely irrelevant, you should have names more
descriptive in the state.json file like collection1_shard1_replica1.
You happen to see 19 because you have only one replica per shard,

Exactly how are you creating the replica? What version of Solr? If
you're using the "core admin" UI, it's tricky to get right. I'd
strongly recommend using the "collections API, ADDREPLICA" command,
see: 
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api_addreplica

Best,
Erick

On Tue, Sep 13, 2016 at 7:11 PM, Chetas Joshi  wrote:
> Is this happening because I have set replicationFactor=1?
> So even if I manually add replica for the shard that's down, it will just
> create a dataDir but would not copy any of the data into the dataDir?
>
> On Tue, Sep 13, 2016 at 6:07 PM, Chetas Joshi 
> wrote:
>
>> Hi,
>>
>> I just started experimenting with solr cloud.
>>
>> I have a solr cloud of 20 nodes. I have one collection with 18 shards
>> running on 18 different nodes with replication factor=1.
>>
>> When one of my shards goes down, I create a replica using the Solr UI. On
>> HDFS I see a core getting added. But the data (index table and tlog)
>> information does not get copied over to that directory. For example, on
>> HDFS I have
>>
>> /solr/collection/core_node_1/data/index
>> /solr/collection/core_node_1/data/tlog
>>
>> when I create a replica of a shard, it creates
>>
>> /solr/collection/core_node_19/data/index
>> /solr/collection/core_node_19/data/tlog
>>
>> (core_node_19 as I already have 18 shards for the collection). The issue
>> is both my folders  core_node_19/data/index and core_node_19/data/tlog are
>> empty. Data does not get copied over from core_node_1/data/index and
>> core_node_1/data/tlog.
>>
>> I need to remove core_node_1 and just keep core_node_19 (the replica). Why
>> the data is not getting copied over? Do I need to manually move all the
>> data from one folder to the other?
>>
>> Thank you,
>> Chetas.
>>
>>


Re: Solr on HDFS: adding a shard replica

2016-09-13 Thread Chetas Joshi
Is this happening because I have set replicationFactor=1?
So even if I manually add replica for the shard that's down, it will just
create a dataDir but would not copy any of the data into the dataDir?

On Tue, Sep 13, 2016 at 6:07 PM, Chetas Joshi 
wrote:

> Hi,
>
> I just started experimenting with solr cloud.
>
> I have a solr cloud of 20 nodes. I have one collection with 18 shards
> running on 18 different nodes with replication factor=1.
>
> When one of my shards goes down, I create a replica using the Solr UI. On
> HDFS I see a core getting added. But the data (index table and tlog)
> information does not get copied over to that directory. For example, on
> HDFS I have
>
> /solr/collection/core_node_1/data/index
> /solr/collection/core_node_1/data/tlog
>
> when I create a replica of a shard, it creates
>
> /solr/collection/core_node_19/data/index
> /solr/collection/core_node_19/data/tlog
>
> (core_node_19 as I already have 18 shards for the collection). The issue
> is both my folders  core_node_19/data/index and core_node_19/data/tlog are
> empty. Data does not get copied over from core_node_1/data/index and
> core_node_1/data/tlog.
>
> I need to remove core_node_1 and just keep core_node_19 (the replica). Why
> the data is not getting copied over? Do I need to manually move all the
> data from one folder to the other?
>
> Thank you,
> Chetas.
>
>


Solr on HDFS: adding a shard replica

2016-09-13 Thread Chetas Joshi
Hi,

I just started experimenting with solr cloud.

I have a solr cloud of 20 nodes. I have one collection with 18 shards
running on 18 different nodes with replication factor=1.

When one of my shards goes down, I create a replica using the Solr UI. On
HDFS I see a core getting added. But the data (index table and tlog)
information does not get copied over to that directory. For example, on
HDFS I have

/solr/collection/core_node_1/data/index
/solr/collection/core_node_1/data/tlog

when I create a replica of a shard, it creates

/solr/collection/core_node_19/data/index
/solr/collection/core_node_19/data/tlog

(core_node_19 as I already have 18 shards for the collection). The issue is
both my folders  core_node_19/data/index and core_node_19/data/tlog are
empty. Data does not get copied over from core_node_1/data/index and
core_node_1/data/tlog.

I need to remove core_node_1 and just keep core_node_19 (the replica). Why
the data is not getting copied over? Do I need to manually move all the
data from one folder to the other?

Thank you,
Chetas.


Re: Solr on HDFS in a Hadoop cluster

2015-01-08 Thread Charles VALLEE
Thanks a lot Otis,

While reading the SolrCloud documentation to understand how SolrCloud 
could run on HDFS, I got confused with leader, replica, "non-replica" 
shards, core, index, and collections.
Once it is specified that one cannot add shards, then that one can add 
replica-only shards, then that last "Shard Splitting" paragraph states 
that something changed starting with Solr 4.3.
But it doesn't states that splitting shards can end in a new non-replica 
shard, in a just added node, thus increasing the amount of storage 
available to the index / collection. It states that "split action 
effectively makes two copies of the data as new shards" instead, which 
tastes a lot like replica style shards.
So does it?
Could there be some sort of tutorial describing how to add available 
storage capacity for index / collection, thus adding a node / shard - core 
that one can send new documents to be indexed? (of course, load-balancing 
would be trigered, so it looks like documents would be added to shards out 
of a set of nodes).
Thanks,



 
Charles VALLEE
Centre de compétence Big data
EDF – DSP - CSP IT-O
DATACENTER - Expertise en Energie Informatique (EEI)
32 avenue Pablo Picasso
92000 Nanterre
 
charles.val...@edf.fr
Tél. : + (0) 1 78 66 69 81

Un geste simple pour l'environnement, n'imprimez ce message que si vous en 
avez l'utilité.




De :otis.gospodne...@gmail.com
A : solr-user@lucene.apache.org
Date :  06/01/2015 18:55
Objet : Re: Solr on HDFS in a Hadoop cluster



Oh, and https://issues.apache.org/jira/browse/SOLR-6743

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 12:52 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Hi Charles,
>
> See http://search-lucene.com/?q=solr+hdfs and
> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
>
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE 
> wrote:
>
>> I am considering using *Solr* to extend *Hortonworks Data Platform*
>> capabilities to search.
>>
>> - I found tutorials to index documents into a Solr instance from 
*HDFS*,
>> but I guess this solution would require a Solr cluster distinct to the
>> Hadoop cluster. Is it possible to have a Solr integrated into the 
Hadoop
>> cluster instead? - *With the index stored in HDFS?*
>>
>> - Where would the processing take place (could it be handed down to
>> Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to
>> integrate with *Yarn*?
>>
>> - What about *SolrCloud*: what does it bring regarding Hadoop based
>> use-cases? Does it stand for a Solr-only cluster?
>>
>> - Well, if that could lead to something working with a roles-based
>> authorization-compliant *Banana*, it would be Christmass again!
>>
>> Thanks a lot for any help!
>>
>> Charles
>>




Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.


Re: Solr on HDFS in a Hadoop cluster

2015-01-06 Thread Otis Gospodnetic
Oh, and https://issues.apache.org/jira/browse/SOLR-6743

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 12:52 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Hi Charles,
>
> See http://search-lucene.com/?q=solr+hdfs and
> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
>
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE 
> wrote:
>
>> I am considering using *Solr* to extend *Hortonworks Data Platform*
>> capabilities to search.
>>
>> - I found tutorials to index documents into a Solr instance from *HDFS*,
>> but I guess this solution would require a Solr cluster distinct to the
>> Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop
>> cluster instead? - *With the index stored in HDFS?*
>>
>> - Where would the processing take place (could it be handed down to
>> Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to
>> integrate with *Yarn*?
>>
>> - What about *SolrCloud*: what does it bring regarding Hadoop based
>> use-cases? Does it stand for a Solr-only cluster?
>>
>> - Well, if that could lead to something working with a roles-based
>> authorization-compliant *Banana*, it would be Christmass again!
>>
>> Thanks a lot for any help!
>>
>> Charles
>>
>>
>>
>> Ce message et toutes les pièces jointes (ci-après le 'Message') sont
>> établis à l'intention exclusive des destinataires et les informations qui y
>> figurent sont strictement confidentielles. Toute utilisation de ce Message
>> non conforme à sa destination, toute diffusion ou toute publication totale
>> ou partielle, est interdite sauf autorisation expresse.
>>
>> Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de
>> le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou
>> partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de
>> votre système, ainsi que toutes ses copies, et de n'en garder aucune trace
>> sur quelque support que ce soit. Nous vous remercions également d'en
>> avertir immédiatement l'expéditeur par retour du message.
>>
>> Il est impossible de garantir que les communications par messagerie
>> électronique arrivent en temps utile, sont sécurisées ou dénuées de toute
>> erreur ou virus.
>> 
>>
>> This message and any attachments (the 'Message') are intended solely for
>> the addressees. The information contained in this Message is confidential.
>> Any use of information contained in this Message not in accord with its
>> purpose, any dissemination or disclosure, either whole or partial, is
>> prohibited except formal approval.
>>
>> If you are not the addressee, you may not copy, forward, disclose or use
>> any part of it. If you have received this message in error, please delete
>> it and all copies from your system and notify the sender immediately by
>> return message.
>>
>> E-mail communication cannot be guaranteed to be timely secure, error or
>> virus-free.
>>
>>
>


Re: Solr on HDFS in a Hadoop cluster

2015-01-06 Thread Otis Gospodnetic
Hi Charles,

See http://search-lucene.com/?q=solr+hdfs and
https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE 
wrote:

> I am considering using *Solr* to extend *Hortonworks Data Platform*
> capabilities to search.
>
> - I found tutorials to index documents into a Solr instance from *HDFS*,
> but I guess this solution would require a Solr cluster distinct to the
> Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop
> cluster instead? - *With the index stored in HDFS?*
>
> - Where would the processing take place (could it be handed down to
> Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to
> integrate with *Yarn*?
>
> - What about *SolrCloud*: what does it bring regarding Hadoop based
> use-cases? Does it stand for a Solr-only cluster?
>
> - Well, if that could lead to something working with a roles-based
> authorization-compliant *Banana*, it would be Christmass again!
>
> Thanks a lot for any help!
>
> Charles
>
>
>
> Ce message et toutes les pièces jointes (ci-après le 'Message') sont
> établis à l'intention exclusive des destinataires et les informations qui y
> figurent sont strictement confidentielles. Toute utilisation de ce Message
> non conforme à sa destination, toute diffusion ou toute publication totale
> ou partielle, est interdite sauf autorisation expresse.
>
> Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de
> le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou
> partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de
> votre système, ainsi que toutes ses copies, et de n'en garder aucune trace
> sur quelque support que ce soit. Nous vous remercions également d'en
> avertir immédiatement l'expéditeur par retour du message.
>
> Il est impossible de garantir que les communications par messagerie
> électronique arrivent en temps utile, sont sécurisées ou dénuées de toute
> erreur ou virus.
> 
>
> This message and any attachments (the 'Message') are intended solely for
> the addressees. The information contained in this Message is confidential.
> Any use of information contained in this Message not in accord with its
> purpose, any dissemination or disclosure, either whole or partial, is
> prohibited except formal approval.
>
> If you are not the addressee, you may not copy, forward, disclose or use
> any part of it. If you have received this message in error, please delete
> it and all copies from your system and notify the sender immediately by
> return message.
>
> E-mail communication cannot be guaranteed to be timely secure, error or
> virus-free.
>
>


Solr on HDFS in a Hadoop cluster

2015-01-06 Thread Charles VALLEE
I am considering using Solr to extend Hortonworks Data Platform 
capabilities to search.

- I found tutorials to index documents into a Solr instance from HDFS, but 
I guess this solution would require a Solr cluster distinct to the Hadoop 
cluster. Is it possible to have a Solr integrated into the Hadoop cluster 
instead? - With the index stored in HDFS?
- Where would the processing take place (could it be handed down to 
Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to 
integrate with Yarn?
- What about SolrCloud: what does it bring regarding Hadoop based 
use-cases? Does it stand for a Solr-only cluster?
- Well, if that could lead to something working with a roles-based 
authorization-compliant Banana, it would be Christmass again!
Thanks a lot for any help!
Charles


Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.


Re: SOLR on hdfs

2014-07-08 Thread shlash
Hi all,
I am new to Solr and hdfs, actually, I am trying to index text content
extracted from binary files like PDF, MS Office...etc which are stored on
hdfs (single node), till now I've running Solr on HDFS, and create the core
but I couldn't send the files to solr for indexing.
Can someone please help me to do that.

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-on-hdfs-tp4045128p4146049.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR on hdfs

2013-03-07 Thread Otis Gospodnetic
Hi Joseph,

I believe Nutch can index into Solr/SolrCloud just fine.  Sounds like that
is the approach you should take.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Thu, Mar 7, 2013 at 12:10 AM, Joseph Lim  wrote:

> Hi Amit,
>
> Currently I am designing a Learning Management System where it is based on
> Hadoop and hbase . Right now I want to integrate nutch with solr in it as
> part of crawler module, so that users will only be able to search relevant
> documents from specific source. And since crawling and indexing takes so
> much of the time, (might be 5 to 6 hours ~ 5gb) so hope that if there is
> anything happen to the server, there will be replicates to back it up.
>
> I just saw what solrcloud can do but will need to check out if nutch is
> able to work with it. Not knowing of other constraints I will encounter, so
> was asking if I can just output the solr dir into a hdfs in the first
> place.
>
> Cheers.
>
> On Thursday, March 7, 2013, Amit Nithian wrote:
>
> > Joseph,
> >
> > Doing what Otis said will do literally what you want which is copying the
> > index to HDFS. It's no different than copying it to a different machine
> > which btw is what Solr's master/slave replication scheme does.
> > Alternatively, I think people are starting to setup new Solr instances
> with
> > SolrCloud which doesn't have the concept of master/slave but rather a
> > series of nodes with the option of having replicas (what I believe to be
> > backup nodes) so that you have the redundancy you want.
> >
> > Honestly HDFS in the way that you are looking for is probably no
> different
> > than storing  your solr index in a RAIDed storage format but I don't
> > pretend to know much about RAID arrays.
> >
> > What exactly are you trying to achieve from a systems perspective? Why do
> > you want Hadoop in the mix here and how does copying the index to HDFS
> help
> > you? If SolrCloud seems complicated try just setting up a simple
> > master/slave replication scheme for that's really easy.
> >
> > Cheers
> > Amit
> >
> >
> > On Wed, Mar 6, 2013 at 9:55 PM, Joseph Lim  wrote:
> >
> > > Hi Amit,
> > >
> > > so you mean that if I just want to get redundancy for solr in hdfs, the
> > > only best way to do it is to as per what Otis suggested using the
> > following
> > > command
> > >
> > > hadoop fs -copyFromLocal  URI
> > >
> > > Ok let me try out solrcloud as I will need to make sure it works well
> > with
> > > nutch too..
> > >
> > > Thanks for the help..
> > >
> > >
> > > On Thu, Mar 7, 2013 at 5:47 AM, Amit Nithian 
> wrote:
> > >
> > > > Why wouldn't SolrCloud help you here? You can setup shards and
> replicas
> > > etc
> > > > to have redundancy b/c HDFS isn't designed to serve real time queries
> > as
> > > > far as I understand. If you are using HDFS as a backup mechanism to
> me
> > > > you'd be better served having multiple slaves tethered to a master
> (in
> > a
> > > > non-cloud environment) or setup SolrCloud either option would give
> you
> > > more
> > > > redundancy than copying an index to HDFS.
> > > >
> > > > - Amit
> > > >
> > > >
> > > > On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim 
> wrote:
> > > >
> > > > > Hi Upayavira,
> > > > >
> > > > > sure, let me explain. I am setting up Nutch and SOLR in hadoop
> > > > environment.
> > > > > Since I am using hdfs, in the event if there is any crashes to the
> > > > > localhost(running solr), i will still have the shards of data being
> > > > stored
> > > > > in hdfs.
> > > > >
> > > > > Thanks you so much =)
> > > > >
> > > > > On Thu, Mar 7, 2013 at 1:19 AM, Upayavira  wrote:
> > > > >
> > > > > > What are you actually trying to achieve? If you can share what
> you
> > > are
> > > > > > trying to achieve maybe folks can help you find the right way to
> do
> > > it.
> > > > > >
> > > > > > Upayavira
> > > > > >
> > > > > > On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote:
> > > > > > > Hello Otis ,
> > > > > > >
> > > > > > > Is there any configuration where it will index into hdfs
> instead?
> > > > > > >
> > > > > > > I tried crawlzilla and  lily but I hope to update specific
> > package
> > > > such
> > > > > > > as
> > > > > > > Hadoop only or nutch only when there are updates.
> > > > > > >
> > > > > > > That's y would prefer to install separately .
> > > > > > >
> > > > > > > Thanks so much. Looking forward for your reply.
> > > > > > >
> > > > > > > On Wednesday, March 6, 2013, Otis Gospodnetic wrote:
> > > > > > >
> > > > > > > > Hello Joseph,
> > > > > > > >
> > > > > > > > You can certainly put them there, as in:
> > > > > > > >   hadoop fs -copyFromLocal  URI
> > > > > > > >
> > > > > > > > But searching such an index will be slow.
> > > > > > > > See also: http://katta.sourceforge.net/
> > > > > > > >
> > > > > > > > Otis
> > > > > > > > --
> > > > > > > > Solr & ElasticSearch Support
> > > > > > > > http://sematext.com/
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Mar 

Re: SOLR on hdfs

2013-03-06 Thread Joseph Lim
Hi Amit,

Currently I am designing a Learning Management System where it is based on
Hadoop and hbase . Right now I want to integrate nutch with solr in it as
part of crawler module, so that users will only be able to search relevant
documents from specific source. And since crawling and indexing takes so
much of the time, (might be 5 to 6 hours ~ 5gb) so hope that if there is
anything happen to the server, there will be replicates to back it up.

I just saw what solrcloud can do but will need to check out if nutch is
able to work with it. Not knowing of other constraints I will encounter, so
was asking if I can just output the solr dir into a hdfs in the first place.

Cheers.

On Thursday, March 7, 2013, Amit Nithian wrote:

> Joseph,
>
> Doing what Otis said will do literally what you want which is copying the
> index to HDFS. It's no different than copying it to a different machine
> which btw is what Solr's master/slave replication scheme does.
> Alternatively, I think people are starting to setup new Solr instances with
> SolrCloud which doesn't have the concept of master/slave but rather a
> series of nodes with the option of having replicas (what I believe to be
> backup nodes) so that you have the redundancy you want.
>
> Honestly HDFS in the way that you are looking for is probably no different
> than storing  your solr index in a RAIDed storage format but I don't
> pretend to know much about RAID arrays.
>
> What exactly are you trying to achieve from a systems perspective? Why do
> you want Hadoop in the mix here and how does copying the index to HDFS help
> you? If SolrCloud seems complicated try just setting up a simple
> master/slave replication scheme for that's really easy.
>
> Cheers
> Amit
>
>
> On Wed, Mar 6, 2013 at 9:55 PM, Joseph Lim  wrote:
>
> > Hi Amit,
> >
> > so you mean that if I just want to get redundancy for solr in hdfs, the
> > only best way to do it is to as per what Otis suggested using the
> following
> > command
> >
> > hadoop fs -copyFromLocal  URI
> >
> > Ok let me try out solrcloud as I will need to make sure it works well
> with
> > nutch too..
> >
> > Thanks for the help..
> >
> >
> > On Thu, Mar 7, 2013 at 5:47 AM, Amit Nithian  wrote:
> >
> > > Why wouldn't SolrCloud help you here? You can setup shards and replicas
> > etc
> > > to have redundancy b/c HDFS isn't designed to serve real time queries
> as
> > > far as I understand. If you are using HDFS as a backup mechanism to me
> > > you'd be better served having multiple slaves tethered to a master (in
> a
> > > non-cloud environment) or setup SolrCloud either option would give you
> > more
> > > redundancy than copying an index to HDFS.
> > >
> > > - Amit
> > >
> > >
> > > On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim  wrote:
> > >
> > > > Hi Upayavira,
> > > >
> > > > sure, let me explain. I am setting up Nutch and SOLR in hadoop
> > > environment.
> > > > Since I am using hdfs, in the event if there is any crashes to the
> > > > localhost(running solr), i will still have the shards of data being
> > > stored
> > > > in hdfs.
> > > >
> > > > Thanks you so much =)
> > > >
> > > > On Thu, Mar 7, 2013 at 1:19 AM, Upayavira  wrote:
> > > >
> > > > > What are you actually trying to achieve? If you can share what you
> > are
> > > > > trying to achieve maybe folks can help you find the right way to do
> > it.
> > > > >
> > > > > Upayavira
> > > > >
> > > > > On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote:
> > > > > > Hello Otis ,
> > > > > >
> > > > > > Is there any configuration where it will index into hdfs instead?
> > > > > >
> > > > > > I tried crawlzilla and  lily but I hope to update specific
> package
> > > such
> > > > > > as
> > > > > > Hadoop only or nutch only when there are updates.
> > > > > >
> > > > > > That's y would prefer to install separately .
> > > > > >
> > > > > > Thanks so much. Looking forward for your reply.
> > > > > >
> > > > > > On Wednesday, March 6, 2013, Otis Gospodnetic wrote:
> > > > > >
> > > > > > > Hello Joseph,
> > > > > > >
> > > > > > > You can certainly put them there, as in:
> > > > > > >   hadoop fs -copyFromLocal  URI
> > > > > > >
> > > > > > > But searching such an index will be slow.
> > > > > > > See also: http://katta.sourceforge.net/
> > > > > > >
> > > > > > > Otis
> > > > > > > --
> > > > > > > Solr & ElasticSearch Support
> > > > > > > http://sematext.com/
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim  > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > > Would like to know how can i put the indexed solr shards into
> > > hdfs?
> > > > > > > >
> > > > > > > > Thanks..
> > > > > > > >
> > > > > > > > Joseph
> > > > *Joseph*
> >
>


-- 
Best Regards,
*Joseph*


Re: SOLR on hdfs

2013-03-06 Thread Amit Nithian
Joseph,

Doing what Otis said will do literally what you want which is copying the
index to HDFS. It's no different than copying it to a different machine
which btw is what Solr's master/slave replication scheme does.
Alternatively, I think people are starting to setup new Solr instances with
SolrCloud which doesn't have the concept of master/slave but rather a
series of nodes with the option of having replicas (what I believe to be
backup nodes) so that you have the redundancy you want.

Honestly HDFS in the way that you are looking for is probably no different
than storing  your solr index in a RAIDed storage format but I don't
pretend to know much about RAID arrays.

What exactly are you trying to achieve from a systems perspective? Why do
you want Hadoop in the mix here and how does copying the index to HDFS help
you? If SolrCloud seems complicated try just setting up a simple
master/slave replication scheme for that's really easy.

Cheers
Amit


On Wed, Mar 6, 2013 at 9:55 PM, Joseph Lim  wrote:

> Hi Amit,
>
> so you mean that if I just want to get redundancy for solr in hdfs, the
> only best way to do it is to as per what Otis suggested using the following
> command
>
> hadoop fs -copyFromLocal  URI
>
> Ok let me try out solrcloud as I will need to make sure it works well with
> nutch too..
>
> Thanks for the help..
>
>
> On Thu, Mar 7, 2013 at 5:47 AM, Amit Nithian  wrote:
>
> > Why wouldn't SolrCloud help you here? You can setup shards and replicas
> etc
> > to have redundancy b/c HDFS isn't designed to serve real time queries as
> > far as I understand. If you are using HDFS as a backup mechanism to me
> > you'd be better served having multiple slaves tethered to a master (in a
> > non-cloud environment) or setup SolrCloud either option would give you
> more
> > redundancy than copying an index to HDFS.
> >
> > - Amit
> >
> >
> > On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim  wrote:
> >
> > > Hi Upayavira,
> > >
> > > sure, let me explain. I am setting up Nutch and SOLR in hadoop
> > environment.
> > > Since I am using hdfs, in the event if there is any crashes to the
> > > localhost(running solr), i will still have the shards of data being
> > stored
> > > in hdfs.
> > >
> > > Thanks you so much =)
> > >
> > > On Thu, Mar 7, 2013 at 1:19 AM, Upayavira  wrote:
> > >
> > > > What are you actually trying to achieve? If you can share what you
> are
> > > > trying to achieve maybe folks can help you find the right way to do
> it.
> > > >
> > > > Upayavira
> > > >
> > > > On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote:
> > > > > Hello Otis ,
> > > > >
> > > > > Is there any configuration where it will index into hdfs instead?
> > > > >
> > > > > I tried crawlzilla and  lily but I hope to update specific package
> > such
> > > > > as
> > > > > Hadoop only or nutch only when there are updates.
> > > > >
> > > > > That's y would prefer to install separately .
> > > > >
> > > > > Thanks so much. Looking forward for your reply.
> > > > >
> > > > > On Wednesday, March 6, 2013, Otis Gospodnetic wrote:
> > > > >
> > > > > > Hello Joseph,
> > > > > >
> > > > > > You can certainly put them there, as in:
> > > > > >   hadoop fs -copyFromLocal  URI
> > > > > >
> > > > > > But searching such an index will be slow.
> > > > > > See also: http://katta.sourceforge.net/
> > > > > >
> > > > > > Otis
> > > > > > --
> > > > > > Solr & ElasticSearch Support
> > > > > > http://sematext.com/
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim  > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > > Would like to know how can i put the indexed solr shards into
> > hdfs?
> > > > > > >
> > > > > > > Thanks..
> > > > > > >
> > > > > > > Joseph
> > > > > > > On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" <
> > > > otis.gospodne...@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Joseph,
> > > > > > > >
> > > > > > > > What exactly are you looking to to?
> > > > > > > > See http://incubator.apache.org/blur/
> > > > > > > >
> > > > > > > > Otis
> > > > > > > > --
> > > > > > > > Solr & ElasticSearch Support
> > > > > > > > http://sematext.com/
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim <
> ysli...@gmail.com
> > > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi I am running hadoop distributed file system, how do I
> put
> > my
> > > > > > output
> > > > > > > of
> > > > > > > > > the solr dir into hdfs automatically?
> > > > > > > > >
> > > > > > > > > Thanks so much..
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best Regards,
> > > > > > > > > *Joseph*
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards,
> > > > > *Joseph*
> > > >
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > > *Joseph*
> > >
> >
>
>
>
> --
> Best Regards,
> *Joseph*
>


Re: SOLR on hdfs

2013-03-06 Thread Joseph Lim
Hi Amit,

so you mean that if I just want to get redundancy for solr in hdfs, the
only best way to do it is to as per what Otis suggested using the following
command

hadoop fs -copyFromLocal  URI

Ok let me try out solrcloud as I will need to make sure it works well with
nutch too..

Thanks for the help..


On Thu, Mar 7, 2013 at 5:47 AM, Amit Nithian  wrote:

> Why wouldn't SolrCloud help you here? You can setup shards and replicas etc
> to have redundancy b/c HDFS isn't designed to serve real time queries as
> far as I understand. If you are using HDFS as a backup mechanism to me
> you'd be better served having multiple slaves tethered to a master (in a
> non-cloud environment) or setup SolrCloud either option would give you more
> redundancy than copying an index to HDFS.
>
> - Amit
>
>
> On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim  wrote:
>
> > Hi Upayavira,
> >
> > sure, let me explain. I am setting up Nutch and SOLR in hadoop
> environment.
> > Since I am using hdfs, in the event if there is any crashes to the
> > localhost(running solr), i will still have the shards of data being
> stored
> > in hdfs.
> >
> > Thanks you so much =)
> >
> > On Thu, Mar 7, 2013 at 1:19 AM, Upayavira  wrote:
> >
> > > What are you actually trying to achieve? If you can share what you are
> > > trying to achieve maybe folks can help you find the right way to do it.
> > >
> > > Upayavira
> > >
> > > On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote:
> > > > Hello Otis ,
> > > >
> > > > Is there any configuration where it will index into hdfs instead?
> > > >
> > > > I tried crawlzilla and  lily but I hope to update specific package
> such
> > > > as
> > > > Hadoop only or nutch only when there are updates.
> > > >
> > > > That's y would prefer to install separately .
> > > >
> > > > Thanks so much. Looking forward for your reply.
> > > >
> > > > On Wednesday, March 6, 2013, Otis Gospodnetic wrote:
> > > >
> > > > > Hello Joseph,
> > > > >
> > > > > You can certainly put them there, as in:
> > > > >   hadoop fs -copyFromLocal  URI
> > > > >
> > > > > But searching such an index will be slow.
> > > > > See also: http://katta.sourceforge.net/
> > > > >
> > > > > Otis
> > > > > --
> > > > > Solr & ElasticSearch Support
> > > > > http://sematext.com/
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim  > > >
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > > Would like to know how can i put the indexed solr shards into
> hdfs?
> > > > > >
> > > > > > Thanks..
> > > > > >
> > > > > > Joseph
> > > > > > On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" <
> > > otis.gospodne...@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Joseph,
> > > > > > >
> > > > > > > What exactly are you looking to to?
> > > > > > > See http://incubator.apache.org/blur/
> > > > > > >
> > > > > > > Otis
> > > > > > > --
> > > > > > > Solr & ElasticSearch Support
> > > > > > > http://sematext.com/
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim  > > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi I am running hadoop distributed file system, how do I put
> my
> > > > > output
> > > > > > of
> > > > > > > > the solr dir into hdfs automatically?
> > > > > > > >
> > > > > > > > Thanks so much..
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best Regards,
> > > > > > > > *Joseph*
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best Regards,
> > > > *Joseph*
> > >
> >
> >
> >
> > --
> > Best Regards,
> > *Joseph*
> >
>



-- 
Best Regards,
*Joseph*


Re: SOLR on hdfs

2013-03-06 Thread Amit Nithian
Why wouldn't SolrCloud help you here? You can setup shards and replicas etc
to have redundancy b/c HDFS isn't designed to serve real time queries as
far as I understand. If you are using HDFS as a backup mechanism to me
you'd be better served having multiple slaves tethered to a master (in a
non-cloud environment) or setup SolrCloud either option would give you more
redundancy than copying an index to HDFS.

- Amit


On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim  wrote:

> Hi Upayavira,
>
> sure, let me explain. I am setting up Nutch and SOLR in hadoop environment.
> Since I am using hdfs, in the event if there is any crashes to the
> localhost(running solr), i will still have the shards of data being stored
> in hdfs.
>
> Thanks you so much =)
>
> On Thu, Mar 7, 2013 at 1:19 AM, Upayavira  wrote:
>
> > What are you actually trying to achieve? If you can share what you are
> > trying to achieve maybe folks can help you find the right way to do it.
> >
> > Upayavira
> >
> > On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote:
> > > Hello Otis ,
> > >
> > > Is there any configuration where it will index into hdfs instead?
> > >
> > > I tried crawlzilla and  lily but I hope to update specific package such
> > > as
> > > Hadoop only or nutch only when there are updates.
> > >
> > > That's y would prefer to install separately .
> > >
> > > Thanks so much. Looking forward for your reply.
> > >
> > > On Wednesday, March 6, 2013, Otis Gospodnetic wrote:
> > >
> > > > Hello Joseph,
> > > >
> > > > You can certainly put them there, as in:
> > > >   hadoop fs -copyFromLocal  URI
> > > >
> > > > But searching such an index will be slow.
> > > > See also: http://katta.sourceforge.net/
> > > >
> > > > Otis
> > > > --
> > > > Solr & ElasticSearch Support
> > > > http://sematext.com/
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim  > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > > Would like to know how can i put the indexed solr shards into hdfs?
> > > > >
> > > > > Thanks..
> > > > >
> > > > > Joseph
> > > > > On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" <
> > otis.gospodne...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Joseph,
> > > > > >
> > > > > > What exactly are you looking to to?
> > > > > > See http://incubator.apache.org/blur/
> > > > > >
> > > > > > Otis
> > > > > > --
> > > > > > Solr & ElasticSearch Support
> > > > > > http://sematext.com/
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim  > >
> > > > wrote:
> > > > > >
> > > > > > > Hi I am running hadoop distributed file system, how do I put my
> > > > output
> > > > > of
> > > > > > > the solr dir into hdfs automatically?
> > > > > > >
> > > > > > > Thanks so much..
> > > > > > >
> > > > > > > --
> > > > > > > Best Regards,
> > > > > > > *Joseph*
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best Regards,
> > > *Joseph*
> >
>
>
>
> --
> Best Regards,
> *Joseph*
>


Re: SOLR on hdfs

2013-03-06 Thread Joseph Lim
Hi Upayavira,

sure, let me explain. I am setting up Nutch and SOLR in hadoop environment.
Since I am using hdfs, in the event if there is any crashes to the
localhost(running solr), i will still have the shards of data being stored
in hdfs.

Thanks you so much =)

On Thu, Mar 7, 2013 at 1:19 AM, Upayavira  wrote:

> What are you actually trying to achieve? If you can share what you are
> trying to achieve maybe folks can help you find the right way to do it.
>
> Upayavira
>
> On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote:
> > Hello Otis ,
> >
> > Is there any configuration where it will index into hdfs instead?
> >
> > I tried crawlzilla and  lily but I hope to update specific package such
> > as
> > Hadoop only or nutch only when there are updates.
> >
> > That's y would prefer to install separately .
> >
> > Thanks so much. Looking forward for your reply.
> >
> > On Wednesday, March 6, 2013, Otis Gospodnetic wrote:
> >
> > > Hello Joseph,
> > >
> > > You can certainly put them there, as in:
> > >   hadoop fs -copyFromLocal  URI
> > >
> > > But searching such an index will be slow.
> > > See also: http://katta.sourceforge.net/
> > >
> > > Otis
> > > --
> > > Solr & ElasticSearch Support
> > > http://sematext.com/
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim  >
> > > wrote:
> > >
> > > > Hi,
> > > > Would like to know how can i put the indexed solr shards into hdfs?
> > > >
> > > > Thanks..
> > > >
> > > > Joseph
> > > > On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" <
> otis.gospodne...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi Joseph,
> > > > >
> > > > > What exactly are you looking to to?
> > > > > See http://incubator.apache.org/blur/
> > > > >
> > > > > Otis
> > > > > --
> > > > > Solr & ElasticSearch Support
> > > > > http://sematext.com/
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim  >
> > > wrote:
> > > > >
> > > > > > Hi I am running hadoop distributed file system, how do I put my
> > > output
> > > > of
> > > > > > the solr dir into hdfs automatically?
> > > > > >
> > > > > > Thanks so much..
> > > > > >
> > > > > > --
> > > > > > Best Regards,
> > > > > > *Joseph*
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best Regards,
> > *Joseph*
>



-- 
Best Regards,
*Joseph*


Re: SOLR on hdfs

2013-03-06 Thread Upayavira
What are you actually trying to achieve? If you can share what you are
trying to achieve maybe folks can help you find the right way to do it.

Upayavira

On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote:
> Hello Otis ,
> 
> Is there any configuration where it will index into hdfs instead?
> 
> I tried crawlzilla and  lily but I hope to update specific package such
> as
> Hadoop only or nutch only when there are updates.
> 
> That's y would prefer to install separately .
> 
> Thanks so much. Looking forward for your reply.
> 
> On Wednesday, March 6, 2013, Otis Gospodnetic wrote:
> 
> > Hello Joseph,
> >
> > You can certainly put them there, as in:
> >   hadoop fs -copyFromLocal  URI
> >
> > But searching such an index will be slow.
> > See also: http://katta.sourceforge.net/
> >
> > Otis
> > --
> > Solr & ElasticSearch Support
> > http://sematext.com/
> >
> >
> >
> >
> >
> > On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim >
> > wrote:
> >
> > > Hi,
> > > Would like to know how can i put the indexed solr shards into hdfs?
> > >
> > > Thanks..
> > >
> > > Joseph
> > > On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" 
> > > 
> > >
> > > wrote:
> > >
> > > > Hi Joseph,
> > > >
> > > > What exactly are you looking to to?
> > > > See http://incubator.apache.org/blur/
> > > >
> > > > Otis
> > > > --
> > > > Solr & ElasticSearch Support
> > > > http://sematext.com/
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim 
> > > > >
> > wrote:
> > > >
> > > > > Hi I am running hadoop distributed file system, how do I put my
> > output
> > > of
> > > > > the solr dir into hdfs automatically?
> > > > >
> > > > > Thanks so much..
> > > > >
> > > > > --
> > > > > Best Regards,
> > > > > *Joseph*
> > > > >
> > > >
> > >
> >
> 
> 
> -- 
> Best Regards,
> *Joseph*


Re: SOLR on hdfs

2013-03-06 Thread Joseph Lim
Hello Otis ,

Is there any configuration where it will index into hdfs instead?

I tried crawlzilla and  lily but I hope to update specific package such as
Hadoop only or nutch only when there are updates.

That's y would prefer to install separately .

Thanks so much. Looking forward for your reply.

On Wednesday, March 6, 2013, Otis Gospodnetic wrote:

> Hello Joseph,
>
> You can certainly put them there, as in:
>   hadoop fs -copyFromLocal  URI
>
> But searching such an index will be slow.
> See also: http://katta.sourceforge.net/
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
>
>
>
>
>
> On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim >
> wrote:
>
> > Hi,
> > Would like to know how can i put the indexed solr shards into hdfs?
> >
> > Thanks..
> >
> > Joseph
> > On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" 
> > 
> >
> > wrote:
> >
> > > Hi Joseph,
> > >
> > > What exactly are you looking to to?
> > > See http://incubator.apache.org/blur/
> > >
> > > Otis
> > > --
> > > Solr & ElasticSearch Support
> > > http://sematext.com/
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim 
> > > >
> wrote:
> > >
> > > > Hi I am running hadoop distributed file system, how do I put my
> output
> > of
> > > > the solr dir into hdfs automatically?
> > > >
> > > > Thanks so much..
> > > >
> > > > --
> > > > Best Regards,
> > > > *Joseph*
> > > >
> > >
> >
>


-- 
Best Regards,
*Joseph*


Re: SOLR on hdfs

2013-03-06 Thread Otis Gospodnetic
Hello Joseph,

You can certainly put them there, as in:
  hadoop fs -copyFromLocal  URI

But searching such an index will be slow.
See also: http://katta.sourceforge.net/

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim  wrote:

> Hi,
> Would like to know how can i put the indexed solr shards into hdfs?
>
> Thanks..
>
> Joseph
> On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" 
> wrote:
>
> > Hi Joseph,
> >
> > What exactly are you looking to to?
> > See http://incubator.apache.org/blur/
> >
> > Otis
> > --
> > Solr & ElasticSearch Support
> > http://sematext.com/
> >
> >
> >
> >
> >
> > On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim  wrote:
> >
> > > Hi I am running hadoop distributed file system, how do I put my output
> of
> > > the solr dir into hdfs automatically?
> > >
> > > Thanks so much..
> > >
> > > --
> > > Best Regards,
> > > *Joseph*
> > >
> >
>


Re: SOLR on hdfs

2013-03-06 Thread Joseph Lim
Hi,
Would like to know how can i put the indexed solr shards into hdfs?

Thanks..

Joseph
On Mar 6, 2013 7:28 PM, "Otis Gospodnetic" 
wrote:

> Hi Joseph,
>
> What exactly are you looking to to?
> See http://incubator.apache.org/blur/
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
>
>
>
>
>
> On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim  wrote:
>
> > Hi I am running hadoop distributed file system, how do I put my output of
> > the solr dir into hdfs automatically?
> >
> > Thanks so much..
> >
> > --
> > Best Regards,
> > *Joseph*
> >
>


Re: SOLR on hdfs

2013-03-06 Thread Otis Gospodnetic
Hi Joseph,

What exactly are you looking to to?
See http://incubator.apache.org/blur/

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim  wrote:

> Hi I am running hadoop distributed file system, how do I put my output of
> the solr dir into hdfs automatically?
>
> Thanks so much..
>
> --
> Best Regards,
> *Joseph*
>