Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

2013-04-21 Thread Furkan KAMACI
All in all is there anything that we can say before measuring the
performance comparison of storing the stored values of documents at Hbase?
I mean as like:

* I will need to communicate with Hbase and this will produce more latency
than Lucene
* I will loose some built-in functionality that integrates Lucene and Solr
* I will loose some good things as like caching at memory with Lucene
* bla bla bala..

(These are not true, I just wrote them as an example)

Any ideas?



2013/4/17 adfel70 adfe...@gmail.com

 Any rule of thumb regarding the size of document limitation when storing it
 in solr?



 Otis Gospodnetic-5 wrote
  Use Solr.  It's pretty clear you don't yet have any problems that
  would make you think about alternatives.  Using Solr to store and not
  just index will make your life simpler (and your app simpler and
  likely faster).
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Tue, Apr 16, 2013 at 6:31 PM, Furkan KAMACI lt;

  furkankamaci@

  gt; wrote:
  Thanks again for your answer. If I find any document about such
  comparisons
  that I would like to read.
 
  By the way, is there any advantage for using Lucene instead of anything
  else as like that:
 
  Using Lucene is naturally supported at Solr and if I use anything else I
  may face with some compatibility problems or communicating issues?
 
 
  2013/4/17 Otis Gospodnetic lt;

  otis.gospodnetic@

  gt;
 
  People do use other data stores to retrieve data sometimes. e.g. Mongo
  is popular for that.  Like I hinted in another email, I wouldn't
  necessarily recommend this for common cases.  Don't do it unless you
  really know you need it.  Otherwise, just store in Solr.
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Tue, Apr 16, 2013 at 5:32 PM, Furkan KAMACI lt;

  furkankamaci@

  gt;
  wrote:
   Hi Otis and Jack;
  
   I have made a research about highlights and debugged code. I see that
   highlight are query dependent and not stored. Why Solr uses Lucene
 for
   storing text, I mean i.e. content of a web page. Is there any
  comparison
   about to store texts at Hbase or any other databases versus Lucene.
  
   Also I want to learn that is there anybody who has used anything else
  from
   Lucene to store text of document at our solr user list?
  
   2013/4/11 Otis Gospodnetic lt;

  otis.gospodnetic@

  gt;
  
   Source code is your best bet.  Wiki has info about how to use it,
 but
   not how highlighting is implemented.  But you don't need to
  understand
   the implementation details to understand that they are dynamic,
   computed specifically for each query for each matching document, so
   you cannot store them anywhere ahead of time.
  
   Otis
   --
   Solr  ElasticSearch Support
   http://sematext.com/
  
  
  
  
  
   On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI lt;

  furkankamaci@

  gt; 
   wrote:
Hi Otis;
   
It seems that I should read more about highlights. Is there any
  where
   that
explains in detail how highlights are generated at Solr?
   
2013/4/11 Otis Gospodnetic lt;

  otis.gospodnetic@

  gt;
   
Hi,
   
You can't store highlights ahead of time because they are query
dependent.  You could store documents in HBase and use Solr just
  for
indexing.  Is that what you want to do?  If so, a custom
SearchComponent executed after QueryComponent could fetch data
  from
external store like HBase.  I'm not sure if I'd recommend that.
   
Otis
--
Solr  ElasticSearch Support
http://sematext.com/
   
   
   
   
   
On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI 
 

  furkankamaci@

   
wrote:
 Actually I don't think to store documents at Solr. I want to
  store
   just
 highlights (snippets) at Hbase and I want to retrieve them from
  Hbase
when
 needed.
 What do you think about separating just highlights from Solr
 and
   storing
 them into Hbase at Solrclod. By the way if you explain at which
   process
and
 how highlights are genareted at Solr you are welcome.


 2013/4/9 Otis Gospodnetic lt;

  otis.gospodnetic@

  gt;

 You may also be interested in looking at things like solrbase
  (on
Github).

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI 
  

  furkankamaci@

 
 wrote:
  Hi;
 
  First of all should mention that I am new to Solr and making
  a
research
  about it. What I am trying to do that I will crawl some
  websites
   with
 Nutch
  and then I will index them with Solr. (Nutch 2.1,
  Solr-SolrCloud
   4.2 )
 
  I wonder about something. I have a cloud of machines that
  crawls
websites
  and stores that documents. Then I send that documents into
   SolrCloud.
 Solr
  indexes that documents and generates indexes and save them.
 I
  know
that
  

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

2013-04-17 Thread adfel70
Any rule of thumb regarding the size of document limitation when storing it
in solr?



Otis Gospodnetic-5 wrote
 Use Solr.  It's pretty clear you don't yet have any problems that
 would make you think about alternatives.  Using Solr to store and not
 just index will make your life simpler (and your app simpler and
 likely faster).
 
 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/
 
 
 
 
 
 On Tue, Apr 16, 2013 at 6:31 PM, Furkan KAMACI lt;

 furkankamaci@

 gt; wrote:
 Thanks again for your answer. If I find any document about such
 comparisons
 that I would like to read.

 By the way, is there any advantage for using Lucene instead of anything
 else as like that:

 Using Lucene is naturally supported at Solr and if I use anything else I
 may face with some compatibility problems or communicating issues?


 2013/4/17 Otis Gospodnetic lt;

 otis.gospodnetic@

 gt;

 People do use other data stores to retrieve data sometimes. e.g. Mongo
 is popular for that.  Like I hinted in another email, I wouldn't
 necessarily recommend this for common cases.  Don't do it unless you
 really know you need it.  Otherwise, just store in Solr.

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Tue, Apr 16, 2013 at 5:32 PM, Furkan KAMACI lt;

 furkankamaci@

 gt;
 wrote:
  Hi Otis and Jack;
 
  I have made a research about highlights and debugged code. I see that
  highlight are query dependent and not stored. Why Solr uses Lucene for
  storing text, I mean i.e. content of a web page. Is there any
 comparison
  about to store texts at Hbase or any other databases versus Lucene.
 
  Also I want to learn that is there anybody who has used anything else
 from
  Lucene to store text of document at our solr user list?
 
  2013/4/11 Otis Gospodnetic lt;

 otis.gospodnetic@

 gt;
 
  Source code is your best bet.  Wiki has info about how to use it, but
  not how highlighting is implemented.  But you don't need to
 understand
  the implementation details to understand that they are dynamic,
  computed specifically for each query for each matching document, so
  you cannot store them anywhere ahead of time.
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI lt;

 furkankamaci@

 gt; 
  wrote:
   Hi Otis;
  
   It seems that I should read more about highlights. Is there any
 where
  that
   explains in detail how highlights are generated at Solr?
  
   2013/4/11 Otis Gospodnetic lt;

 otis.gospodnetic@

 gt;
  
   Hi,
  
   You can't store highlights ahead of time because they are query
   dependent.  You could store documents in HBase and use Solr just
 for
   indexing.  Is that what you want to do?  If so, a custom
   SearchComponent executed after QueryComponent could fetch data
 from
   external store like HBase.  I'm not sure if I'd recommend that.
  
   Otis
   --
   Solr  ElasticSearch Support
   http://sematext.com/
  
  
  
  
  
   On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI 
 

 furkankamaci@

  
   wrote:
Actually I don't think to store documents at Solr. I want to
 store
  just
highlights (snippets) at Hbase and I want to retrieve them from
 Hbase
   when
needed.
What do you think about separating just highlights from Solr and
  storing
them into Hbase at Solrclod. By the way if you explain at which
  process
   and
how highlights are genareted at Solr you are welcome.
   
   
2013/4/9 Otis Gospodnetic lt;

 otis.gospodnetic@

 gt;
   
You may also be interested in looking at things like solrbase
 (on
   Github).
   
Otis
--
Solr  ElasticSearch Support
http://sematext.com/
   
   
   
   
   
On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI 
  

 furkankamaci@


wrote:
 Hi;

 First of all should mention that I am new to Solr and making
 a
   research
 about it. What I am trying to do that I will crawl some
 websites
  with
Nutch
 and then I will index them with Solr. (Nutch 2.1,
 Solr-SolrCloud
  4.2 )

 I wonder about something. I have a cloud of machines that
 crawls
   websites
 and stores that documents. Then I send that documents into
  SolrCloud.
Solr
 indexes that documents and generates indexes and save them. I
 know
   that
 from Information Retrieval theory: it *may* not be efficient
 to
  store
 indexes at a NoSQL database (they are something like linked
 lists
  and
   if
 you store them in such kind of database you *may* have a
 sparse
 representation -by the way there may be some solutions for
 it.
 If
  you
 explain them you are welcome.)

 However Solr stores some documents too (i.e. highlights) So
 some
  of my
 documents will be doubled somehow. If I consider that I will
 have
  many
 documents, that dobuled documents may cause a problem for me.
 So is
   there
 any way not storing that documents at Solr and pointing to
 them
 at
 Hbase(where I save my 

Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

2013-04-16 Thread Furkan KAMACI
Hi Otis and Jack;

I have made a research about highlights and debugged code. I see that
highlight are query dependent and not stored. Why Solr uses Lucene for
storing text, I mean i.e. content of a web page. Is there any comparison
about to store texts at Hbase or any other databases versus Lucene.

Also I want to learn that is there anybody who has used anything else from
Lucene to store text of document at our solr user list?

2013/4/11 Otis Gospodnetic otis.gospodne...@gmail.com

 Source code is your best bet.  Wiki has info about how to use it, but
 not how highlighting is implemented.  But you don't need to understand
 the implementation details to understand that they are dynamic,
 computed specifically for each query for each matching document, so
 you cannot store them anywhere ahead of time.

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:
  Hi Otis;
 
  It seems that I should read more about highlights. Is there any where
 that
  explains in detail how highlights are generated at Solr?
 
  2013/4/11 Otis Gospodnetic otis.gospodne...@gmail.com
 
  Hi,
 
  You can't store highlights ahead of time because they are query
  dependent.  You could store documents in HBase and use Solr just for
  indexing.  Is that what you want to do?  If so, a custom
  SearchComponent executed after QueryComponent could fetch data from
  external store like HBase.  I'm not sure if I'd recommend that.
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI furkankam...@gmail.com
 
  wrote:
   Actually I don't think to store documents at Solr. I want to store
 just
   highlights (snippets) at Hbase and I want to retrieve them from Hbase
  when
   needed.
   What do you think about separating just highlights from Solr and
 storing
   them into Hbase at Solrclod. By the way if you explain at which
 process
  and
   how highlights are genareted at Solr you are welcome.
  
  
   2013/4/9 Otis Gospodnetic otis.gospodne...@gmail.com
  
   You may also be interested in looking at things like solrbase (on
  Github).
  
   Otis
   --
   Solr  ElasticSearch Support
   http://sematext.com/
  
  
  
  
  
   On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI 
 furkankam...@gmail.com
   wrote:
Hi;
   
First of all should mention that I am new to Solr and making a
  research
about it. What I am trying to do that I will crawl some websites
 with
   Nutch
and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud
 4.2 )
   
I wonder about something. I have a cloud of machines that crawls
  websites
and stores that documents. Then I send that documents into
 SolrCloud.
   Solr
indexes that documents and generates indexes and save them. I know
  that
from Information Retrieval theory: it *may* not be efficient to
 store
indexes at a NoSQL database (they are something like linked lists
 and
  if
you store them in such kind of database you *may* have a sparse
representation -by the way there may be some solutions for it. If
 you
explain them you are welcome.)
   
However Solr stores some documents too (i.e. highlights) So some
 of my
documents will be doubled somehow. If I consider that I will have
 many
documents, that dobuled documents may cause a problem for me. So is
  there
any way not storing that documents at Solr and pointing to them at
Hbase(where I save my crawled documents) or instead of pointing
  directly
storing them at Hbase (is it efficient or not)?
  
 



Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

2013-04-16 Thread Otis Gospodnetic
People do use other data stores to retrieve data sometimes. e.g. Mongo
is popular for that.  Like I hinted in another email, I wouldn't
necessarily recommend this for common cases.  Don't do it unless you
really know you need it.  Otherwise, just store in Solr.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Tue, Apr 16, 2013 at 5:32 PM, Furkan KAMACI furkankam...@gmail.com wrote:
 Hi Otis and Jack;

 I have made a research about highlights and debugged code. I see that
 highlight are query dependent and not stored. Why Solr uses Lucene for
 storing text, I mean i.e. content of a web page. Is there any comparison
 about to store texts at Hbase or any other databases versus Lucene.

 Also I want to learn that is there anybody who has used anything else from
 Lucene to store text of document at our solr user list?

 2013/4/11 Otis Gospodnetic otis.gospodne...@gmail.com

 Source code is your best bet.  Wiki has info about how to use it, but
 not how highlighting is implemented.  But you don't need to understand
 the implementation details to understand that they are dynamic,
 computed specifically for each query for each matching document, so
 you cannot store them anywhere ahead of time.

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:
  Hi Otis;
 
  It seems that I should read more about highlights. Is there any where
 that
  explains in detail how highlights are generated at Solr?
 
  2013/4/11 Otis Gospodnetic otis.gospodne...@gmail.com
 
  Hi,
 
  You can't store highlights ahead of time because they are query
  dependent.  You could store documents in HBase and use Solr just for
  indexing.  Is that what you want to do?  If so, a custom
  SearchComponent executed after QueryComponent could fetch data from
  external store like HBase.  I'm not sure if I'd recommend that.
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI furkankam...@gmail.com
 
  wrote:
   Actually I don't think to store documents at Solr. I want to store
 just
   highlights (snippets) at Hbase and I want to retrieve them from Hbase
  when
   needed.
   What do you think about separating just highlights from Solr and
 storing
   them into Hbase at Solrclod. By the way if you explain at which
 process
  and
   how highlights are genareted at Solr you are welcome.
  
  
   2013/4/9 Otis Gospodnetic otis.gospodne...@gmail.com
  
   You may also be interested in looking at things like solrbase (on
  Github).
  
   Otis
   --
   Solr  ElasticSearch Support
   http://sematext.com/
  
  
  
  
  
   On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI 
 furkankam...@gmail.com
   wrote:
Hi;
   
First of all should mention that I am new to Solr and making a
  research
about it. What I am trying to do that I will crawl some websites
 with
   Nutch
and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud
 4.2 )
   
I wonder about something. I have a cloud of machines that crawls
  websites
and stores that documents. Then I send that documents into
 SolrCloud.
   Solr
indexes that documents and generates indexes and save them. I know
  that
from Information Retrieval theory: it *may* not be efficient to
 store
indexes at a NoSQL database (they are something like linked lists
 and
  if
you store them in such kind of database you *may* have a sparse
representation -by the way there may be some solutions for it. If
 you
explain them you are welcome.)
   
However Solr stores some documents too (i.e. highlights) So some
 of my
documents will be doubled somehow. If I consider that I will have
 many
documents, that dobuled documents may cause a problem for me. So is
  there
any way not storing that documents at Solr and pointing to them at
Hbase(where I save my crawled documents) or instead of pointing
  directly
storing them at Hbase (is it efficient or not)?
  
 



Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

2013-04-16 Thread Furkan KAMACI
Thanks again for your answer. If I find any document about such comparisons
that I would like to read.

By the way, is there any advantage for using Lucene instead of anything
else as like that:

Using Lucene is naturally supported at Solr and if I use anything else I
may face with some compatibility problems or communicating issues?


2013/4/17 Otis Gospodnetic otis.gospodne...@gmail.com

 People do use other data stores to retrieve data sometimes. e.g. Mongo
 is popular for that.  Like I hinted in another email, I wouldn't
 necessarily recommend this for common cases.  Don't do it unless you
 really know you need it.  Otherwise, just store in Solr.

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Tue, Apr 16, 2013 at 5:32 PM, Furkan KAMACI furkankam...@gmail.com
 wrote:
  Hi Otis and Jack;
 
  I have made a research about highlights and debugged code. I see that
  highlight are query dependent and not stored. Why Solr uses Lucene for
  storing text, I mean i.e. content of a web page. Is there any comparison
  about to store texts at Hbase or any other databases versus Lucene.
 
  Also I want to learn that is there anybody who has used anything else
 from
  Lucene to store text of document at our solr user list?
 
  2013/4/11 Otis Gospodnetic otis.gospodne...@gmail.com
 
  Source code is your best bet.  Wiki has info about how to use it, but
  not how highlighting is implemented.  But you don't need to understand
  the implementation details to understand that they are dynamic,
  computed specifically for each query for each matching document, so
  you cannot store them anywhere ahead of time.
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI furkankam...@gmail.com
 
  wrote:
   Hi Otis;
  
   It seems that I should read more about highlights. Is there any where
  that
   explains in detail how highlights are generated at Solr?
  
   2013/4/11 Otis Gospodnetic otis.gospodne...@gmail.com
  
   Hi,
  
   You can't store highlights ahead of time because they are query
   dependent.  You could store documents in HBase and use Solr just for
   indexing.  Is that what you want to do?  If so, a custom
   SearchComponent executed after QueryComponent could fetch data from
   external store like HBase.  I'm not sure if I'd recommend that.
  
   Otis
   --
   Solr  ElasticSearch Support
   http://sematext.com/
  
  
  
  
  
   On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI 
 furkankam...@gmail.com
  
   wrote:
Actually I don't think to store documents at Solr. I want to store
  just
highlights (snippets) at Hbase and I want to retrieve them from
 Hbase
   when
needed.
What do you think about separating just highlights from Solr and
  storing
them into Hbase at Solrclod. By the way if you explain at which
  process
   and
how highlights are genareted at Solr you are welcome.
   
   
2013/4/9 Otis Gospodnetic otis.gospodne...@gmail.com
   
You may also be interested in looking at things like solrbase (on
   Github).
   
Otis
--
Solr  ElasticSearch Support
http://sematext.com/
   
   
   
   
   
On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI 
  furkankam...@gmail.com
wrote:
 Hi;

 First of all should mention that I am new to Solr and making a
   research
 about it. What I am trying to do that I will crawl some websites
  with
Nutch
 and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud
  4.2 )

 I wonder about something. I have a cloud of machines that crawls
   websites
 and stores that documents. Then I send that documents into
  SolrCloud.
Solr
 indexes that documents and generates indexes and save them. I
 know
   that
 from Information Retrieval theory: it *may* not be efficient to
  store
 indexes at a NoSQL database (they are something like linked
 lists
  and
   if
 you store them in such kind of database you *may* have a sparse
 representation -by the way there may be some solutions for it.
 If
  you
 explain them you are welcome.)

 However Solr stores some documents too (i.e. highlights) So some
  of my
 documents will be doubled somehow. If I consider that I will
 have
  many
 documents, that dobuled documents may cause a problem for me.
 So is
   there
 any way not storing that documents at Solr and pointing to them
 at
 Hbase(where I save my crawled documents) or instead of pointing
   directly
 storing them at Hbase (is it efficient or not)?
   
  
 



Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

2013-04-16 Thread Otis Gospodnetic
Use Solr.  It's pretty clear you don't yet have any problems that
would make you think about alternatives.  Using Solr to store and not
just index will make your life simpler (and your app simpler and
likely faster).

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Tue, Apr 16, 2013 at 6:31 PM, Furkan KAMACI furkankam...@gmail.com wrote:
 Thanks again for your answer. If I find any document about such comparisons
 that I would like to read.

 By the way, is there any advantage for using Lucene instead of anything
 else as like that:

 Using Lucene is naturally supported at Solr and if I use anything else I
 may face with some compatibility problems or communicating issues?


 2013/4/17 Otis Gospodnetic otis.gospodne...@gmail.com

 People do use other data stores to retrieve data sometimes. e.g. Mongo
 is popular for that.  Like I hinted in another email, I wouldn't
 necessarily recommend this for common cases.  Don't do it unless you
 really know you need it.  Otherwise, just store in Solr.

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Tue, Apr 16, 2013 at 5:32 PM, Furkan KAMACI furkankam...@gmail.com
 wrote:
  Hi Otis and Jack;
 
  I have made a research about highlights and debugged code. I see that
  highlight are query dependent and not stored. Why Solr uses Lucene for
  storing text, I mean i.e. content of a web page. Is there any comparison
  about to store texts at Hbase or any other databases versus Lucene.
 
  Also I want to learn that is there anybody who has used anything else
 from
  Lucene to store text of document at our solr user list?
 
  2013/4/11 Otis Gospodnetic otis.gospodne...@gmail.com
 
  Source code is your best bet.  Wiki has info about how to use it, but
  not how highlighting is implemented.  But you don't need to understand
  the implementation details to understand that they are dynamic,
  computed specifically for each query for each matching document, so
  you cannot store them anywhere ahead of time.
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI furkankam...@gmail.com
 
  wrote:
   Hi Otis;
  
   It seems that I should read more about highlights. Is there any where
  that
   explains in detail how highlights are generated at Solr?
  
   2013/4/11 Otis Gospodnetic otis.gospodne...@gmail.com
  
   Hi,
  
   You can't store highlights ahead of time because they are query
   dependent.  You could store documents in HBase and use Solr just for
   indexing.  Is that what you want to do?  If so, a custom
   SearchComponent executed after QueryComponent could fetch data from
   external store like HBase.  I'm not sure if I'd recommend that.
  
   Otis
   --
   Solr  ElasticSearch Support
   http://sematext.com/
  
  
  
  
  
   On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI 
 furkankam...@gmail.com
  
   wrote:
Actually I don't think to store documents at Solr. I want to store
  just
highlights (snippets) at Hbase and I want to retrieve them from
 Hbase
   when
needed.
What do you think about separating just highlights from Solr and
  storing
them into Hbase at Solrclod. By the way if you explain at which
  process
   and
how highlights are genareted at Solr you are welcome.
   
   
2013/4/9 Otis Gospodnetic otis.gospodne...@gmail.com
   
You may also be interested in looking at things like solrbase (on
   Github).
   
Otis
--
Solr  ElasticSearch Support
http://sematext.com/
   
   
   
   
   
On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI 
  furkankam...@gmail.com
wrote:
 Hi;

 First of all should mention that I am new to Solr and making a
   research
 about it. What I am trying to do that I will crawl some websites
  with
Nutch
 and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud
  4.2 )

 I wonder about something. I have a cloud of machines that crawls
   websites
 and stores that documents. Then I send that documents into
  SolrCloud.
Solr
 indexes that documents and generates indexes and save them. I
 know
   that
 from Information Retrieval theory: it *may* not be efficient to
  store
 indexes at a NoSQL database (they are something like linked
 lists
  and
   if
 you store them in such kind of database you *may* have a sparse
 representation -by the way there may be some solutions for it.
 If
  you
 explain them you are welcome.)

 However Solr stores some documents too (i.e. highlights) So some
  of my
 documents will be doubled somehow. If I consider that I will
 have
  many
 documents, that dobuled documents may cause a problem for me.
 So is
   there
 any way not storing that documents at Solr and pointing to them
 at
 Hbase(where I save my crawled documents) or instead of pointing
   directly
 storing them at Hbase (is it efficient or not)?
   
  
 



Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

2013-04-11 Thread Furkan KAMACI
Actually I don't think to store documents at Solr. I want to store just
highlights (snippets) at Hbase and I want to retrieve them from Hbase when
needed.
What do you think about separating just highlights from Solr and storing
them into Hbase at Solrclod. By the way if you explain at which process and
how highlights are genareted at Solr you are welcome.


2013/4/9 Otis Gospodnetic otis.gospodne...@gmail.com

 You may also be interested in looking at things like solrbase (on Github).

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI furkankam...@gmail.com
 wrote:
  Hi;
 
  First of all should mention that I am new to Solr and making a research
  about it. What I am trying to do that I will crawl some websites with
 Nutch
  and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )
 
  I wonder about something. I have a cloud of machines that crawls websites
  and stores that documents. Then I send that documents into SolrCloud.
 Solr
  indexes that documents and generates indexes and save them. I know that
  from Information Retrieval theory: it *may* not be efficient to store
  indexes at a NoSQL database (they are something like linked lists and if
  you store them in such kind of database you *may* have a sparse
  representation -by the way there may be some solutions for it. If you
  explain them you are welcome.)
 
  However Solr stores some documents too (i.e. highlights) So some of my
  documents will be doubled somehow. If I consider that I will have many
  documents, that dobuled documents may cause a problem for me. So is there
  any way not storing that documents at Solr and pointing to them at
  Hbase(where I save my crawled documents) or instead of pointing directly
  storing them at Hbase (is it efficient or not)?



Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

2013-04-11 Thread Otis Gospodnetic
Hi,

You can't store highlights ahead of time because they are query
dependent.  You could store documents in HBase and use Solr just for
indexing.  Is that what you want to do?  If so, a custom
SearchComponent executed after QueryComponent could fetch data from
external store like HBase.  I'm not sure if I'd recommend that.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI furkankam...@gmail.com wrote:
 Actually I don't think to store documents at Solr. I want to store just
 highlights (snippets) at Hbase and I want to retrieve them from Hbase when
 needed.
 What do you think about separating just highlights from Solr and storing
 them into Hbase at Solrclod. By the way if you explain at which process and
 how highlights are genareted at Solr you are welcome.


 2013/4/9 Otis Gospodnetic otis.gospodne...@gmail.com

 You may also be interested in looking at things like solrbase (on Github).

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI furkankam...@gmail.com
 wrote:
  Hi;
 
  First of all should mention that I am new to Solr and making a research
  about it. What I am trying to do that I will crawl some websites with
 Nutch
  and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )
 
  I wonder about something. I have a cloud of machines that crawls websites
  and stores that documents. Then I send that documents into SolrCloud.
 Solr
  indexes that documents and generates indexes and save them. I know that
  from Information Retrieval theory: it *may* not be efficient to store
  indexes at a NoSQL database (they are something like linked lists and if
  you store them in such kind of database you *may* have a sparse
  representation -by the way there may be some solutions for it. If you
  explain them you are welcome.)
 
  However Solr stores some documents too (i.e. highlights) So some of my
  documents will be doubled somehow. If I consider that I will have many
  documents, that dobuled documents may cause a problem for me. So is there
  any way not storing that documents at Solr and pointing to them at
  Hbase(where I save my crawled documents) or instead of pointing directly
  storing them at Hbase (is it efficient or not)?



Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

2013-04-11 Thread Furkan KAMACI
Hi Otis;

It seems that I should read more about highlights. Is there any where that
explains in detail how highlights are generated at Solr?

2013/4/11 Otis Gospodnetic otis.gospodne...@gmail.com

 Hi,

 You can't store highlights ahead of time because they are query
 dependent.  You could store documents in HBase and use Solr just for
 indexing.  Is that what you want to do?  If so, a custom
 SearchComponent executed after QueryComponent could fetch data from
 external store like HBase.  I'm not sure if I'd recommend that.

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:
  Actually I don't think to store documents at Solr. I want to store just
  highlights (snippets) at Hbase and I want to retrieve them from Hbase
 when
  needed.
  What do you think about separating just highlights from Solr and storing
  them into Hbase at Solrclod. By the way if you explain at which process
 and
  how highlights are genareted at Solr you are welcome.
 
 
  2013/4/9 Otis Gospodnetic otis.gospodne...@gmail.com
 
  You may also be interested in looking at things like solrbase (on
 Github).
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI furkankam...@gmail.com
  wrote:
   Hi;
  
   First of all should mention that I am new to Solr and making a
 research
   about it. What I am trying to do that I will crawl some websites with
  Nutch
   and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )
  
   I wonder about something. I have a cloud of machines that crawls
 websites
   and stores that documents. Then I send that documents into SolrCloud.
  Solr
   indexes that documents and generates indexes and save them. I know
 that
   from Information Retrieval theory: it *may* not be efficient to store
   indexes at a NoSQL database (they are something like linked lists and
 if
   you store them in such kind of database you *may* have a sparse
   representation -by the way there may be some solutions for it. If you
   explain them you are welcome.)
  
   However Solr stores some documents too (i.e. highlights) So some of my
   documents will be doubled somehow. If I consider that I will have many
   documents, that dobuled documents may cause a problem for me. So is
 there
   any way not storing that documents at Solr and pointing to them at
   Hbase(where I save my crawled documents) or instead of pointing
 directly
   storing them at Hbase (is it efficient or not)?
 



Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

2013-04-11 Thread Otis Gospodnetic
Source code is your best bet.  Wiki has info about how to use it, but
not how highlighting is implemented.  But you don't need to understand
the implementation details to understand that they are dynamic,
computed specifically for each query for each matching document, so
you cannot store them anywhere ahead of time.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI furkankam...@gmail.com wrote:
 Hi Otis;

 It seems that I should read more about highlights. Is there any where that
 explains in detail how highlights are generated at Solr?

 2013/4/11 Otis Gospodnetic otis.gospodne...@gmail.com

 Hi,

 You can't store highlights ahead of time because they are query
 dependent.  You could store documents in HBase and use Solr just for
 indexing.  Is that what you want to do?  If so, a custom
 SearchComponent executed after QueryComponent could fetch data from
 external store like HBase.  I'm not sure if I'd recommend that.

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:
  Actually I don't think to store documents at Solr. I want to store just
  highlights (snippets) at Hbase and I want to retrieve them from Hbase
 when
  needed.
  What do you think about separating just highlights from Solr and storing
  them into Hbase at Solrclod. By the way if you explain at which process
 and
  how highlights are genareted at Solr you are welcome.
 
 
  2013/4/9 Otis Gospodnetic otis.gospodne...@gmail.com
 
  You may also be interested in looking at things like solrbase (on
 Github).
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI furkankam...@gmail.com
  wrote:
   Hi;
  
   First of all should mention that I am new to Solr and making a
 research
   about it. What I am trying to do that I will crawl some websites with
  Nutch
   and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )
  
   I wonder about something. I have a cloud of machines that crawls
 websites
   and stores that documents. Then I send that documents into SolrCloud.
  Solr
   indexes that documents and generates indexes and save them. I know
 that
   from Information Retrieval theory: it *may* not be efficient to store
   indexes at a NoSQL database (they are something like linked lists and
 if
   you store them in such kind of database you *may* have a sparse
   representation -by the way there may be some solutions for it. If you
   explain them you are welcome.)
  
   However Solr stores some documents too (i.e. highlights) So some of my
   documents will be doubled somehow. If I consider that I will have many
   documents, that dobuled documents may cause a problem for me. So is
 there
   any way not storing that documents at Solr and pointing to them at
   Hbase(where I save my crawled documents) or instead of pointing
 directly
   storing them at Hbase (is it efficient or not)?
 



Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

2013-04-09 Thread Otis Gospodnetic
You may also be interested in looking at things like solrbase (on Github).

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI furkankam...@gmail.com wrote:
 Hi;

 First of all should mention that I am new to Solr and making a research
 about it. What I am trying to do that I will crawl some websites with Nutch
 and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )

 I wonder about something. I have a cloud of machines that crawls websites
 and stores that documents. Then I send that documents into SolrCloud. Solr
 indexes that documents and generates indexes and save them. I know that
 from Information Retrieval theory: it *may* not be efficient to store
 indexes at a NoSQL database (they are something like linked lists and if
 you store them in such kind of database you *may* have a sparse
 representation -by the way there may be some solutions for it. If you
 explain them you are welcome.)

 However Solr stores some documents too (i.e. highlights) So some of my
 documents will be doubled somehow. If I consider that I will have many
 documents, that dobuled documents may cause a problem for me. So is there
 any way not storing that documents at Solr and pointing to them at
 Hbase(where I save my crawled documents) or instead of pointing directly
 storing them at Hbase (is it efficient or not)?


Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase

2013-04-06 Thread Jack Krupansky
Solr would not be storing the original source form of the documents in any 
case. Whether you use Tika or SolrCell, only the text stream of the content 
and the metadata would ever get indexed or stored in Solr.


Solr completely decouples indexing and storing of data values. If you 
don't want to store the text stream in Solr, then don't.


If you want to store the original blob of the source documents in some 
other data store, that's your choice. You can store the original URL or a 
document ID or URL for some alternate document store. That's your choice to 
make. Solr in no way forces you one way or the other. And whether that URL 
or document ID refers to HBase or a web site, doesn't matter to Solr either.


Whether or not you could more efficiently store the original document bytes 
in Lucene/Solr DocValues vs. HBase is a separate matter - I don't know one 
way or the other whether DocValues help or not. Or whether a Solr 
BinaryField might be suitable for store the original bytes of a document 
(but without indexing the bytes.)


In other words, maybe you could just use two separate Solr servers, one for 
text index and metadata store, and the other for raw store of the original 
document bytes.


-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Saturday, April 06, 2013 6:01 PM
To: solr-user@lucene.apache.org
Subject: Pointing to Hbase for Docuements or Directly Saving Documents at 
Hbase


Hi;

First of all should mention that I am new to Solr and making a research
about it. What I am trying to do that I will crawl some websites with Nutch
and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 )

I wonder about something. I have a cloud of machines that crawls websites
and stores that documents. Then I send that documents into SolrCloud. Solr
indexes that documents and generates indexes and save them. I know that
from Information Retrieval theory: it *may* not be efficient to store
indexes at a NoSQL database (they are something like linked lists and if
you store them in such kind of database you *may* have a sparse
representation -by the way there may be some solutions for it. If you
explain them you are welcome.)

However Solr stores some documents too (i.e. highlights) So some of my
documents will be doubled somehow. If I consider that I will have many
documents, that dobuled documents may cause a problem for me. So is there
any way not storing that documents at Solr and pointing to them at
Hbase(where I save my crawled documents) or instead of pointing directly
storing them at Hbase (is it efficient or not)?