Re: To warm the whole cache of Solr other than the only autowarmcount

2014-07-24 Thread YouPeng Yang
To Matt

  Thank you,your opinion is very valuable ,So I have checked the source
codes about how the cache warming  up. It seems to just put items of the
old caches into the new caches.
  I will pull Mark Miller into this discussion.He is the one of the
developer of the Solr whom  I had  contacted with.

 To Mark Miller

   Would you please check out what we are discussing in the last two
posts.I need your help.


Regards.





2014-07-25 2:50 GMT+08:00 Matt Kuiper (Springblox) <
matt.kui...@springblox.com>:

> I don't believe this would work.  My understanding (please correct if I
> have this wrong) is that the underlying Lucene document ids have a
> potential to change and so when a newSearcher is created the caches must be
> regenerated and not copied.
>
> Matt
>
> -Original Message-
> From: YouPeng Yang [mailto:yypvsxf19870...@gmail.com]
> Sent: Thursday, July 24, 2014 10:26 AM
> To: solr-user@lucene.apache.org
> Subject: To warm the whole cache of Solr other than the only autowarmcount
>
> Hi
>
>I think it is wonderful to have caches autowarmed when commit or soft
> commit happens. However ,If I want to warm the whole cache other than the
> only autowarmcount,the default the auto warming operation will take long
> long ~~long time.So it comes up with that maybe it good idea  to just
> change the reference of the caches of the newsearcher with the caches of
> the oldsearcher. It will increase the time of the autowarming,also increase
> the query time of NRT.
>   It is just not a mature idea.I am pust this idea,and hope to get more
> hints or help to make more cleat about the idea.
>
>
>
> regards
>


Re: Multipart documents with different update cycles

2014-07-24 Thread Alexandre Rafalovitch
Do you search the frequently changing user-metadata? If not, maybe the
external file field is helpful.
https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Fri, Jul 25, 2014 at 12:04 AM, Aurélien MAZOYER
 wrote:
> Hello,
>
> I have to index a dataset containing multipart documents. The "main" part
> and the "user metadata" part have different update cycles : we want to
> update the "user metadata part" frequently without having to refetch the
> main part from the datasource nor storing every fields in order to use
> atomic update. As there is no true field level update in Solr yet, I am
> afraid that I have to build an index for both parts and to perform a query
> time join, with all the well-known performance limitation. I have also heard
> of side car index. Is it a solution that can meet my requirements? Is it
> stable enough to be usable in production? Does the community plan to make it
> part of the trunk code?
>
> Thanks,
>
> Aurelien
>


Re: Where can I get information about sold Cloud H/W spec

2014-07-24 Thread Alexandre Rafalovitch
http://search-lucene.com/ is helpful for looking around.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Fri, Jul 25, 2014 at 8:16 AM, Lee Chunki  wrote:
> Hi Alex,
>
> Thank you for your reply.
> let me check mailing list archive again.
>
> Regards,
> Chunki.
>


Re: Are there any performance impact of using a non-standard length UUID as the unique key of Solr?

2014-07-24 Thread Mark Miller
Some good info on unique id’s for Lucene / Solr can be found here: 
http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html
-- 
Mark Miller
about.me/markrmiller

On July 24, 2014 at 9:51:28 PM, He haobo (haob...@gmail.com) wrote:

Hi,  

In our Solr collection (Solr 4.8), we have the following unique key  
definition.  
  

id  


In our external java program, we will generate an UUID with  
UUID.randomUUID().toString() first. Then, we will use Cryptographic hash to  
generate a 32 bytes length text and finally use it as id.  

For now, we might need to post more than 20k Solr docs per second. Then  
UUID.randomUUID() or the Cryptographic hash stuff might take time. We might  
have a simple workaround to share one Cryptographic hash stuff for many  
Solr docs. Namely, we want to append sequence to Cryptographic hash such  
as 9AD0BB6DDD7AA9FE4D9EB1FF16B3BDFY00,  
9AD0BB6DDD7AA9FE4D9EB1FF16B3BDFY01,  
9AD0BB6DDD7AA9FE4D9EB1FF16B3BDFY02, etc.  


What we want to know, if we use a 38 bytes length id, are there any  
performance impact for Solr data insert or query? Or, if we use Solr's  
default automatically generated id implementation, should it be more  
efficient?  



Thanks,  
Eternal  


Are there any performance impact of using a non-standard length UUID as the unique key of Solr?

2014-07-24 Thread He haobo
Hi,

In our Solr collection (Solr 4.8), we have the following unique key
definition.
 

 id


In our external java program, we will generate an UUID with
UUID.randomUUID().toString() first. Then, we will use Cryptographic hash to
generate a 32 bytes length text and finally use it as id.

For now, we might need to post more than 20k Solr docs per second. Then
UUID.randomUUID() or the Cryptographic hash stuff might take time. We might
have a simple workaround to share one Cryptographic hash stuff for many
Solr docs. Namely, we want to append sequence to Cryptographic hash such
as 9AD0BB6DDD7AA9FE4D9EB1FF16B3BDFY00,
9AD0BB6DDD7AA9FE4D9EB1FF16B3BDFY01,
9AD0BB6DDD7AA9FE4D9EB1FF16B3BDFY02, etc.


What we want to know, if we use a 38 bytes length id, are there any
performance impact for Solr data insert or query? Or, if we use Solr's
default automatically generated id implementation, should it be more
efficient?



Thanks,
Eternal


Re: Where can I get information about sold Cloud H/W spec

2014-07-24 Thread Lee Chunki
Hi Alex,

Thank you for your reply.
let me check mailing list archive again.

Regards,
Chunki.

On Jul 24, 2014, at 6:11 PM, Alexandre Rafalovitch  wrote:

> Have you tried searching the mailing list archives? Some of these
> things have been discussed a number of times. SSDs are definitely good
> for Solr. But also you may get more specific help if you say what kind
> of volume/throughput of data you are looking at.
> 
> Regards,
>   Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
> 
> 
> On Thu, Jul 24, 2014 at 3:27 PM, Lee Chunki  wrote:
>> Hi,
>> 
>> I am trying to build sold cloud.
>> Do you know where can I get informations like :
>> 
>> * solr cloud support heterogeneous servers
>> * HDD
>>   * SDD vs. SAS vs. ….
>>   * no RAID vs. RAID-5 vs. RAID-0 vs. ….
>> * Network
>>   * 100MB vs. 1GB vs. ….
>> * ….
>> 
>> of course, it will depend on data size, traffic and so on.
>> but please let me know general or  minimum H/W spec.
>> 
>> Thanks,
>> Chunki.



Re: Understanding the Debug explanations for Query Result Scoring/Ranking

2014-07-24 Thread Koji Sekiguchi

Hi,

In addition, this might be useful:

Fundamentals of Information Retrieval, Illustration with Apache Lucene
https://www.youtube.com/watch?v=SCsS5ePGmCs

This video is about 40 minutes long, but you can fast forward to 24:00
to learn scoring based on vector space model and how Lucene customize it.

Koji
--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

(2014/07/25 8:00), Uwe Reh wrote:

Hi,

to get an idea of the meaning of all this numbers, have a look on 
http://explain.solr.pl. I like
this tool, it's great.

Uwe

Am 25.07.2014 00:45, schrieb O. Olson:

Hi,

If you add /*&debug=true*/ to the Solr request /(and &wt=xml if your
current output is not XML)/, you would get a node in the resulting XML that
is named "debug". There is a child node to this called "explain" to this
which has a list showing why the results are ranked in a particular order.
I'm curious if there is some documentation on understanding these
numbers/results.

I am new to Solr, so I apologize that I may be using the wrong terms to
describe my problem. I also aware of
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
though I have not completely understood it.

My problem is trying to understand something like this:

1.5797625 = (MATCH) sum of: 0.4717142 = (MATCH) weight(text:televis in
44109) [DefaultSimilarity], result of: 0.4717142 = score(doc=44109,freq=1.0
= termFreq=1.0 ), product of: 0.71447384 = queryWeight, product of:
7.0424104 = idf(docFreq=896, maxDocs=377553) 0.10145303 = queryNorm 0.660226
= fieldWeight in 44109, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 =
termFreq=1.0 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109) 1.1080483 = (MATCH) weight(text:tv in 44109)
[DefaultSimilarity], result of: 1.1080483 = score(doc=44109,freq=6.0 =
termFreq=6.0 ), product of: 0.6996622 = queryWeight, product of: 6.896415 =
idf(docFreq=1037, maxDocs=377553) 0.10145303 = queryNorm 1.5836904 =
fieldWeight in 44109, product of: 2.4494898 = tf(freq=6.0), with freq of:
6.0 = termFreq=6.0 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109)

*Note:* I have searched for "televisions". My search field is a single
catch-all field. The Edismax parser seems to break up my search term into
"televis" and "tv"

Is there some documentation on how to understand these numbers. They do not
seem to be properly delimited. At the minimum, I can understand something
like:
1.5797625 =  0.4717142 + 1.1080483
and
0.71447384  = 7.0424104 * 0.10145303

But, I cannot understand if something like "0.10145303 = queryNorm 0.660226
= fieldWeight in 44109" is used in the calculation anywhere. Also since
there were only two terms /("televis" and "tv")/ I could use subtraction to
find out 1.1080483 was the start of a new result.

I'd also appreciate if someone can tell me which class dumps out the above
data. If I know it, I can edit that class to make the output a bit more
understandable for me.

Thank you,
O. O.






--
View this message in context:
http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137.html

Sent from the Solr - User mailing list archive at Nabble.com.










Re: Understanding the Debug explanations for Query Result Scoring/Ranking

2014-07-24 Thread Uwe Reh

Hi,

to get an idea of the meaning of all this numbers, have a look on 
http://explain.solr.pl. I like this tool, it's great.


Uwe

Am 25.07.2014 00:45, schrieb O. Olson:

Hi,

If you add /*&debug=true*/ to the Solr request /(and &wt=xml if your
current output is not XML)/, you would get a node in the resulting XML that
is named "debug". There is a child node to this called "explain" to this
which has a list showing why the results are ranked in a particular order.
I'm curious if there is some documentation on understanding these
numbers/results.

I am new to Solr, so I apologize that I may be using the wrong terms to
describe my problem. I also aware of
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
though I have not completely understood it.

My problem is trying to understand something like this:

1.5797625 = (MATCH) sum of: 0.4717142 = (MATCH) weight(text:televis in
44109) [DefaultSimilarity], result of: 0.4717142 = score(doc=44109,freq=1.0
= termFreq=1.0 ), product of: 0.71447384 = queryWeight, product of:
7.0424104 = idf(docFreq=896, maxDocs=377553) 0.10145303 = queryNorm 0.660226
= fieldWeight in 44109, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 =
termFreq=1.0 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109) 1.1080483 = (MATCH) weight(text:tv in 44109)
[DefaultSimilarity], result of: 1.1080483 = score(doc=44109,freq=6.0 =
termFreq=6.0 ), product of: 0.6996622 = queryWeight, product of: 6.896415 =
idf(docFreq=1037, maxDocs=377553) 0.10145303 = queryNorm 1.5836904 =
fieldWeight in 44109, product of: 2.4494898 = tf(freq=6.0), with freq of:
6.0 = termFreq=6.0 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109)

*Note:* I have searched for "televisions". My search field is a single
catch-all field. The Edismax parser seems to break up my search term into
"televis" and "tv"

Is there some documentation on how to understand these numbers. They do not
seem to be properly delimited. At the minimum, I can understand something
like:
1.5797625 =  0.4717142 + 1.1080483
and
0.71447384  = 7.0424104 * 0.10145303

But, I cannot understand if something like "0.10145303 = queryNorm 0.660226
= fieldWeight in 44109" is used in the calculation anywhere. Also since
there were only two terms /("televis" and "tv")/ I could use subtraction to
find out 1.1080483 was the start of a new result.

I'd also appreciate if someone can tell me which class dumps out the above
data. If I know it, I can edit that class to make the output a bit more
understandable for me.

Thank you,
O. O.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137.html
Sent from the Solr - User mailing list archive at Nabble.com.





Understanding the Debug explanations for Query Result Scoring/Ranking

2014-07-24 Thread O. Olson
Hi,

If you add /*&debug=true*/ to the Solr request /(and &wt=xml if your
current output is not XML)/, you would get a node in the resulting XML that
is named "debug". There is a child node to this called "explain" to this
which has a list showing why the results are ranked in a particular order.
I'm curious if there is some documentation on understanding these
numbers/results. 

I am new to Solr, so I apologize that I may be using the wrong terms to
describe my problem. I also aware of
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
though I have not completely understood it. 

My problem is trying to understand something like this: 

1.5797625 = (MATCH) sum of: 0.4717142 = (MATCH) weight(text:televis in
44109) [DefaultSimilarity], result of: 0.4717142 = score(doc=44109,freq=1.0
= termFreq=1.0 ), product of: 0.71447384 = queryWeight, product of:
7.0424104 = idf(docFreq=896, maxDocs=377553) 0.10145303 = queryNorm 0.660226
= fieldWeight in 44109, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 =
termFreq=1.0 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109) 1.1080483 = (MATCH) weight(text:tv in 44109)
[DefaultSimilarity], result of: 1.1080483 = score(doc=44109,freq=6.0 =
termFreq=6.0 ), product of: 0.6996622 = queryWeight, product of: 6.896415 =
idf(docFreq=1037, maxDocs=377553) 0.10145303 = queryNorm 1.5836904 =
fieldWeight in 44109, product of: 2.4494898 = tf(freq=6.0), with freq of:
6.0 = termFreq=6.0 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109)

*Note:* I have searched for "televisions". My search field is a single
catch-all field. The Edismax parser seems to break up my search term into
"televis" and "tv"

Is there some documentation on how to understand these numbers. They do not
seem to be properly delimited. At the minimum, I can understand something
like: 
1.5797625 =  0.4717142 + 1.1080483
and
0.71447384  = 7.0424104 * 0.10145303

But, I cannot understand if something like "0.10145303 = queryNorm 0.660226
= fieldWeight in 44109" is used in the calculation anywhere. Also since
there were only two terms /("televis" and "tv")/ I could use subtraction to
find out 1.1080483 was the start of a new result.

I'd also appreciate if someone can tell me which class dumps out the above
data. If I know it, I can edit that class to make the output a bit more
understandable for me.

Thank you,
O. O.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Shuffling results

2014-07-24 Thread Joel Bernstein
Here's blog describing the RankQuery API:

http://heliosearch.org/solrs-new-rankquery-feature/

Joel Bernstein
Search Engineer at Heliosearch


On Thu, Jul 24, 2014 at 6:22 PM, Joel Bernstein  wrote:

> This is the kind of use case the RankQuery API was created for. It allows
> you to write your own Lucene ranking collector and plug it in. It's an
> expert level java API so you'll need to program in Java and understand a
> lot about how Lucene collectors work, but it's cool stuff to learn.
>
>
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Thu, Jul 24, 2014 at 2:59 PM, babenis  wrote:
>
>> Could you possibly elaborate on what that function could look like and
>> how to
>> use it?
>>
>> I have an ecommerce site with lots of products and some categories have 50
>> times more products than others, and i would like to "shuffle" resultset
>> in
>> a way that if the search is conducted by parent category id then all child
>> categories found in the resultset should be shuffled in a way that some
>> categories will have a little more results, while others have a little
>> less
>> of the results and products from the same brand should be scattered across
>> the resultset, but to still be able to use pagination (ie 12 results per
>> page, page 1 2 and 3 should have different products, etc.)?
>>
>> ideally i'd like to use products tags as well to separate similar products
>> from each other even further
>>
>> thank you so much for any help
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Shuffling-results-tp497372p4149092.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>


Re: Shuffling results

2014-07-24 Thread Joel Bernstein
This is the kind of use case the RankQuery API was created for. It allows
you to write your own Lucene ranking collector and plug it in. It's an
expert level java API so you'll need to program in Java and understand a
lot about how Lucene collectors work, but it's cool stuff to learn.



Joel Bernstein
Search Engineer at Heliosearch


On Thu, Jul 24, 2014 at 2:59 PM, babenis  wrote:

> Could you possibly elaborate on what that function could look like and how
> to
> use it?
>
> I have an ecommerce site with lots of products and some categories have 50
> times more products than others, and i would like to "shuffle" resultset in
> a way that if the search is conducted by parent category id then all child
> categories found in the resultset should be shuffled in a way that some
> categories will have a little more results, while others have a little less
> of the results and products from the same brand should be scattered across
> the resultset, but to still be able to use pagination (ie 12 results per
> page, page 1 2 and 3 should have different products, etc.)?
>
> ideally i'd like to use products tags as well to separate similar products
> from each other even further
>
> thank you so much for any help
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Shuffling-results-tp497372p4149092.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Shuffle results a little

2014-07-24 Thread Ahmet Arslan
Hi Babenis,

https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking 
is a good place to implement such diversity functionality.

There is no stock solution currently other than field collapsing and random 
fields.

Ahmet



On Thursday, July 24, 2014 10:04 PM, babenis  wrote:
were you ever able to figure out a way to do this "Shuffling" of the result
set?

I'm looking to do the same thing, but shuffle results not only by brand, but
also by child categories, mainly because we have very dominant categories
and would still like products from other cats to be visible throughout the
site without explicit searches, otherwise it seems like we specialize on a
couple of categories only



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Shuffle-results-a-little-tp1891206p4149091.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: SolrCloud extended warmup support

2014-07-24 Thread Toke Eskildsen
Jeff Wartes [jwar...@whitepages.com] wrote:
> Well, I’m not sure what to say. I’ve been observing a noticeable latency
> decrease over the first few thousand queries.

How exactly do you get the index files fully cached? The cp-command will (at 
least for some systems) happily skip copying if the destination is /dev/null. 
One way is to ensure caching is to cat all the files to /dev/null.

- Toke Eskildsen


Re: SolrCloud extended warmup support

2014-07-24 Thread Jeff Wartes

Well, I’m not sure what to say. I’ve been observing a noticeable latency
decrease over the first few thousand queries. I’m not doing anything too
tricky either. Same exact query pattern, only one fq, always on the same
field, no faceting. The only potential suspects that occur to me could be
that it’s a large index (although properly sharded to fit in system
memory), and that it’s doing geo filtering & ordering.

Since I don’t have a good mechanism for many queries, I’ll probably just
do a few queries in firstSearcher for now and cross my fingers, but I’m
not optimistic.

For what it’s worth, I did verify that a replica doesn’t make itself
available to other nodes until after the firstSearcher queries are
completed.



On 7/21/14, 5:57 PM, "Erick Erickson"  wrote:

>I've never seen it necessary to run "thousands of queries"
>to warm Solr. Usually less than a dozen will work fine. My
>challenge would be for you to measure performance differences
>on queries after running, say, 12 well-chosen queries as
>opposed to hundreds/thousands. I bet that if
>1> you search across all the relevant fields, you'll fill up the
> low-level caches for those fields.
>2> you facet on all the fields you intend to facet on.
>3> you sort on all the fields you intend to sort on.
>4> you specify some filter queries. This is fuzzy since
> really depends on you being able to predict what
> those will be for firstSearcher. Things like "in the
> last day/week/month" can be pre-configured, but
> others you won't get. BTW, here's a blog about
> why "in the last day" fq clauses can be tricky.
>   http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/
>
>that you'll pretty much nail warmup and be fine. Note that
>you can do all the faceting on a single query. Specifying
>the primary, secondary & etc. sorts will fill those caches.
>
>Best,
>Erick
>
>
>On Mon, Jul 21, 2014 at 5:07 PM, Jeff Wartes 
>wrote:
>
>>
>> On 7/21/14, 4:50 PM, "Shawn Heisey"  wrote:
>>
>> >On 7/21/2014 5:37 PM, Jeff Wartes wrote:
>> >> I¹d like to ensure an extended warmup is done on each SolrCloud node
>> >>prior to that node serving traffic.
>> >> I can do certain things prior to starting Solr, such as pump the
>>index
>> >>dir through /dev/null to pre-warm the filesystem cache, and
>>post-start I
>> >>can use the ping handler with a health check file to prevent the node
>> >>from entering the clients load balancer until I¹m ready.
>> >> What I seem to be missing is control over when a node starts
>> >>participating in queries sent to the other nodes.
>> >>
>> >> I can, of course, add solrconfig.xml firstSearcher queries, which I
>> >>assume (and fervently hope!) happens before a node registers itself in
>> >>ZK clusterstate.json as ready for work, but that doesn¹t scale so well
>> >>if I want that initial warmup to run thousands of queries, or run them
>> >>with some paralleism. I¹m storing solrconfig.xml in ZK, so I¹m
>>sensitive
>> >>to the size.
>> >>
>> >> Any ideas, or corrections to my assumptions?
>> >
>> >I think that firstSearcher/newSearcher (and making sure useColdSearcher
>> >is set to false) is going to be the only way you can do this in a way
>> >that's compatible with SolrCloud.  If you were doing manual distributed
>> >search without SolrCloud, you'd have more options available.
>> >
>> >If useColdSearcher is set to false, that should keep *everything* from
>> >using the searcher until the warmup has finished.  I cannot be certain
>> >that this is the case, but I have some reasonable confidence that this
>> >is how it works.  If you find that it doesn't behave this way, I'd call
>> >it a bug.
>> >
>> >Thanks,
>> >Shawn
>>
>>
>> Thanks for the quick reply. Since distributed search latency is the max
>>of
>> the shard sub-requests, I¹m trying my best to minimize any spikes in
>> cluster latency due to node restarts.
>> I double-checked useColdSearcher was false, but the doc says this means
>> requests ³block until the first searcher is done warming², which
>> translates pretty clearly to ³latency spike². The more I think about it,
>> the more worried I am that a node might indeed register itself in
>> live_nodes and get distributed requests before it¹s got a searcher to
>>work
>> with. *Especially* if I have lots of serial firstSearcher queries.
>>
>> I¹ll look through the code myself tomorrow, but if anyone can help
>> confirm/deny the order of operations here, I¹d appreciate it.
>>



Re: integrating Accumulo with solr

2014-07-24 Thread Jack Krupansky
Like I said, you're going to have to be a real, hard-core gunslinger to do 
that well. Sqrrl uses Lucene directly, BTW:


"Full-Text Search: Utilizing open-source Lucene and custom indexing methods, 
Sqrrl Enterprise users can conduct real-time, full-text search across data 
in Sqrrl Enterprise."


See:
http://sqrrl.com/product/search/

Out of curiosity, why are you not using that integrated Lucene support of 
Sqrrl Enterprise?


-- Jack Krupansky

-Original Message- 
From: Ali Nazemian

Sent: Thursday, July 24, 2014 3:07 PM
To: solr-user@lucene.apache.org
Subject: Re: integrating Accumulo with solr

Dear Jack,
Thank you. I am aware of datastax but I am looking for integrating accumulo
with solr. This is something like what sqrrl guys offer.
Regards.


On Thu, Jul 24, 2014 at 7:27 PM, Jack Krupansky 
wrote:


If you are not a "true hard-core gunslinger" who is willing to dive in and
integrate the code yourself, instead you should give serious consideration
to a product such as DataStax Enterprise that fully integrates and 
packages

a NoSQL database (Cassandra) and Solr for search. The security aspects are
still a work in progress, but certainly headed in the right direction. And
it has Hadoop and Spark integration as well.

See:
http://www.datastax.com/what-we-offer/products-services/
datastax-enterprise

-- Jack Krupansky

-Original Message- From: Ali Nazemian
Sent: Thursday, July 24, 2014 10:30 AM
To: solr-user@lucene.apache.org
Subject: Re: integrating Accumulo with solr


Thank you very much. Nice Idea but how can Solr and Accumulo can be
synchronized in this way?
I know that Solr can be integrated with HDFS and also Accumulo works on 
the
top of HDFS. So can I use HDFS as integration point? I mean set Solr to 
use

HDFS as a source of documents as well as the destination of documents.
Regards.


On Thu, Jul 24, 2014 at 4:33 PM, Joe Gresock  wrote:

 Ali,


Sounds like a good choice.  It's pretty standard to store the primary
storage id as a field in Solr so that you can search the full text in 
Solr

and then retrieve the full document elsewhere.

I would recommend creating a document structure in Solr with whatever
fields you want indexed (most likely as text_en, etc.), and then store a
"string" field named "content_id", which would be the Accumulo row id 
that

you look up with a scan.

One caveat -- Accumulo will be protected at the cell level, but if you
need
your Solr search results to be protected by complex authorization strings
similar to Accumulo, you will need to write your own QParserPlugin and 
use

post filtering:
http://java.dzone.com/articles/custom-security-filtering-solr

The code you see in that article is written for an earlier version of
Solr,
but it's not too difficult to adjust it for the latest (we've done so in
our project).  Once you've implemented this, you would store an
"authorizations" string field in each Solr document, and pass in the
authorizations that the user has access to in the fq parameter of every
query.  It's also not too bad to write something that parses the Accumulo
authorizations string (like A&B&(C|D|E|F)) and interpret it accordingly 
in

the QParserPlugin.

This will give you true row level security in Solr and Accumulo, and it
performs quite well in Solr.

Let me know if you have any other questions.

Joe


On Thu, Jul 24, 2014 at 4:07 AM, Ali Nazemian 
wrote:

> Dear Joe,
> Hi,
> I am going to store the crawl web pages in accumulo as the main storage
> part of my project and I need to give these data to solr for indexing >
and
> user searches. I need to do some social and web analysis on my data as
well
> as having some security features. Therefore accumulo is my choice for >
the
> database part and for index and search I am going to use Solr. Would 
> you

> please guide me through that?
>
>
>
> On Thu, Jul 24, 2014 at 1:28 AM, Joe Gresock 
wrote:
>
> > We store data in both Solr and Accumulo -- do you have more details
about
> > what kind of data and indexing you want?  Is there a reason you're
> thinking
> > of using both databases in particular?
> >
> >
> > On Wed, Jul 23, 2014 at 5:17 AM, Ali Nazemian 
> > wrote:
> >
> > > Dear All,
> > > Hi,
> > > I was wondering is there anybody out there that tried to integrate
Solr
> > > with Accumulo? I was thinking about using Accumulo on top of HDFS >
> > and
> > using
> > > Solr to index data inside Accumulo? Do you have any idea how can I
> > > do
> > such
> > > integration?
> > >
> > > Best regards.
> > >
> > > --
> > > A.Nazemian
> > >
> >
> >
> >
> > --
> > I know what it is to be in need, and I know what it is to have 
> > plenty.

 I
> > have learned the secret of being content in any and every situation,
> > whether well fed or hungry, whether living in plenty or in want.  I >
> can
> do
> > all this through him who gives me strength.*-Philippians 4:12-13*
> >
>
>
>
> --
> A.Nazemian
>



--
I know what it is to be in need, and I know what it is to have plenty.  I
have learned th

Re: integrating Accumulo with solr

2014-07-24 Thread Ali Nazemian
Dear Jack,
Thank you. I am aware of datastax but I am looking for integrating accumulo
with solr. This is something like what sqrrl guys offer.
Regards.


On Thu, Jul 24, 2014 at 7:27 PM, Jack Krupansky 
wrote:

> If you are not a "true hard-core gunslinger" who is willing to dive in and
> integrate the code yourself, instead you should give serious consideration
> to a product such as DataStax Enterprise that fully integrates and packages
> a NoSQL database (Cassandra) and Solr for search. The security aspects are
> still a work in progress, but certainly headed in the right direction. And
> it has Hadoop and Spark integration as well.
>
> See:
> http://www.datastax.com/what-we-offer/products-services/
> datastax-enterprise
>
> -- Jack Krupansky
>
> -Original Message- From: Ali Nazemian
> Sent: Thursday, July 24, 2014 10:30 AM
> To: solr-user@lucene.apache.org
> Subject: Re: integrating Accumulo with solr
>
>
> Thank you very much. Nice Idea but how can Solr and Accumulo can be
> synchronized in this way?
> I know that Solr can be integrated with HDFS and also Accumulo works on the
> top of HDFS. So can I use HDFS as integration point? I mean set Solr to use
> HDFS as a source of documents as well as the destination of documents.
> Regards.
>
>
> On Thu, Jul 24, 2014 at 4:33 PM, Joe Gresock  wrote:
>
>  Ali,
>>
>> Sounds like a good choice.  It's pretty standard to store the primary
>> storage id as a field in Solr so that you can search the full text in Solr
>> and then retrieve the full document elsewhere.
>>
>> I would recommend creating a document structure in Solr with whatever
>> fields you want indexed (most likely as text_en, etc.), and then store a
>> "string" field named "content_id", which would be the Accumulo row id that
>> you look up with a scan.
>>
>> One caveat -- Accumulo will be protected at the cell level, but if you
>> need
>> your Solr search results to be protected by complex authorization strings
>> similar to Accumulo, you will need to write your own QParserPlugin and use
>> post filtering:
>> http://java.dzone.com/articles/custom-security-filtering-solr
>>
>> The code you see in that article is written for an earlier version of
>> Solr,
>> but it's not too difficult to adjust it for the latest (we've done so in
>> our project).  Once you've implemented this, you would store an
>> "authorizations" string field in each Solr document, and pass in the
>> authorizations that the user has access to in the fq parameter of every
>> query.  It's also not too bad to write something that parses the Accumulo
>> authorizations string (like A&B&(C|D|E|F)) and interpret it accordingly in
>> the QParserPlugin.
>>
>> This will give you true row level security in Solr and Accumulo, and it
>> performs quite well in Solr.
>>
>> Let me know if you have any other questions.
>>
>> Joe
>>
>>
>> On Thu, Jul 24, 2014 at 4:07 AM, Ali Nazemian 
>> wrote:
>>
>> > Dear Joe,
>> > Hi,
>> > I am going to store the crawl web pages in accumulo as the main storage
>> > part of my project and I need to give these data to solr for indexing >
>> and
>> > user searches. I need to do some social and web analysis on my data as
>> well
>> > as having some security features. Therefore accumulo is my choice for >
>> the
>> > database part and for index and search I am going to use Solr. Would you
>> > please guide me through that?
>> >
>> >
>> >
>> > On Thu, Jul 24, 2014 at 1:28 AM, Joe Gresock 
>> wrote:
>> >
>> > > We store data in both Solr and Accumulo -- do you have more details
>> about
>> > > what kind of data and indexing you want?  Is there a reason you're
>> > thinking
>> > > of using both databases in particular?
>> > >
>> > >
>> > > On Wed, Jul 23, 2014 at 5:17 AM, Ali Nazemian 
>> > > wrote:
>> > >
>> > > > Dear All,
>> > > > Hi,
>> > > > I was wondering is there anybody out there that tried to integrate
>> Solr
>> > > > with Accumulo? I was thinking about using Accumulo on top of HDFS >
>> > > and
>> > > using
>> > > > Solr to index data inside Accumulo? Do you have any idea how can I
>> > > > do
>> > > such
>> > > > integration?
>> > > >
>> > > > Best regards.
>> > > >
>> > > > --
>> > > > A.Nazemian
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > I know what it is to be in need, and I know what it is to have plenty.
>>  I
>> > > have learned the secret of being content in any and every situation,
>> > > whether well fed or hungry, whether living in plenty or in want.  I >
>> > can
>> > do
>> > > all this through him who gives me strength.*-Philippians 4:12-13*
>> > >
>> >
>> >
>> >
>> > --
>> > A.Nazemian
>> >
>>
>>
>>
>> --
>> I know what it is to be in need, and I know what it is to have plenty.  I
>> have learned the secret of being content in any and every situation,
>> whether well fed or hungry, whether living in plenty or in want.  I can do
>> all this through him who gives me strength.*-Philippians 4:12-13*
>>
>>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian


Re: Shuffle results a little

2014-07-24 Thread babenis
were you ever able to figure out a way to do this "Shuffling" of the result
set?

I'm looking to do the same thing, but shuffle results not only by brand, but
also by child categories, mainly because we have very dominant categories
and would still like products from other cats to be visible throughout the
site without explicit searches, otherwise it seems like we specialize on a
couple of categories only



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Shuffle-results-a-little-tp1891206p4149091.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Shuffling results

2014-07-24 Thread babenis
Could you possibly elaborate on what that function could look like and how to
use it?

I have an ecommerce site with lots of products and some categories have 50
times more products than others, and i would like to "shuffle" resultset in
a way that if the search is conducted by parent category id then all child
categories found in the resultset should be shuffled in a way that some
categories will have a little more results, while others have a little less
of the results and products from the same brand should be scattered across
the resultset, but to still be able to use pagination (ie 12 results per
page, page 1 2 and 3 should have different products, etc.)?

ideally i'd like to use products tags as well to separate similar products
from each other even further

thank you so much for any help



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Shuffling-results-tp497372p4149092.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: To warm the whole cache of Solr other than the only autowarmcount

2014-07-24 Thread Matt Kuiper (Springblox)
I don't believe this would work.  My understanding (please correct if I have 
this wrong) is that the underlying Lucene document ids have a potential to 
change and so when a newSearcher is created the caches must be regenerated and 
not copied.

Matt

-Original Message-
From: YouPeng Yang [mailto:yypvsxf19870...@gmail.com] 
Sent: Thursday, July 24, 2014 10:26 AM
To: solr-user@lucene.apache.org
Subject: To warm the whole cache of Solr other than the only autowarmcount

Hi

   I think it is wonderful to have caches autowarmed when commit or soft commit 
happens. However ,If I want to warm the whole cache other than the only 
autowarmcount,the default the auto warming operation will take long long ~~long 
time.So it comes up with that maybe it good idea  to just change the reference 
of the caches of the newsearcher with the caches of the oldsearcher. It will 
increase the time of the autowarming,also increase the query time of NRT.
  It is just not a mature idea.I am pust this idea,and hope to get more hints 
or help to make more cleat about the idea.



regards


Re: spatial search: find result in bbox OR first result outside bbox

2014-07-24 Thread david.w.smi...@gmail.com
Hi Elisabeth,

Sorry for not responding sooner; I forgot.

You’re in need of some spatial nearest-neighbor code I wrote but it isn’t
open-sourced yet.  It works on the RPT grid.

Any way, you should consider doing this in two searches: the first query
tries the bbox provided, and if that returns nothing then issue a second
for the closest within the a 1000km distance.  The first query is
straight-forward as documented.  The second would be close to what you gave
in your example but sort by distance and return rows=1.  It will *not*
compute the distance to every document, just those within the 1000km radius
plus some grid internal grid squares *if* you use spatial RPT
(“location_rpt” in the example schema).  But use LatLonType for optimal
sorting performance, not RPT.

With respect to doing this in one search vs two, that would involve writing
a custom request handler.  I have a patch to make this easier:
https://issues.apache.org/jira/browse/SOLR-5005.  If in your case there are
absolutely no other filters and it’s not a distributed search (no
sharding), then you could approach this with a custom query parser that
generates and executes one query to know if it should return that query or
return the fallback.

Please let me know how this goes.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Jul 22, 2014 at 3:12 AM, elisabeth benoit  wrote:

> Hello,
>
> I am using solr 4.2.1. I have the following use case.
>
> I should find results inside bbox OR if there is none, first result outside
> bbox within a 1000 km distance. I was wondering what is the best way to
> proceed.
>
> I was considering doing a geofilt search from the center of my bounding box
> and post filtering results.
>
> fq={!geofilt sfield=store}&pt=45.15,-93.85&d=1000
>
> From a performance point of view I don't think it's a good solution though,
> since solr will have to calculate every document distance, then sort.
>
> I was wondering if there was another way to do this and avoid sending more
> than one request to solr.
>
> Thanks,
> Elisabeth
>


Re: How to migrate content of a collection to a new collection

2014-07-24 Thread Chris Hostetter

: I tried this "poor mans" cursor approach out ad-hoc, but I get OOM. Pretty
: sure this is because you need all uniqueKey-values in FieldCache in order to
: be able to sort on it. We do not have memory for that - and never will. Our
: uniqueKey field is not DocValue.
: Just out of curiosity
: * Will I have the same OOM problem using the CURSOR-feature in later Solrs?
: * Will the "poor mans" cursor approach still be efficient if my uniqueKey was
: DocValued, knowing that all values for uniqueKey (the DocValue file) cannot
: fit in memory (OS file cache)?

If you are getting an OOM from the sort because of the FieldCache size, 
then it doesn't matter if you use cursorMark or the poor mans approach 
mentioned: neither apoorach ever has a chance if you can't sort the docs 
in the first place.

If DocValues on your uniqueKey field let you sort on it w/o OOM (by 
leveraing the OS's FS cache) then either cursorMakr or the poor mans 
cursor using fq should both be equally efficient.

Bottom line: it's not the cursor or the fq that's ever going to be a 
performance problem for you in terms of using RAM -- it's just a question 
of sorting.


-Hoss
http://www.lucidworks.com/


Multipart documents with different update cycles

2014-07-24 Thread Aurélien MAZOYER

Hello,

I have to index a dataset containing multipart documents. The "main" 
part and the "user metadata" part have different update cycles : we want 
to update the "user metadata part" frequently without having to refetch 
the main part from the datasource nor storing every fields in order to 
use atomic update. As there is no true field level update in Solr yet, I 
am afraid that I have to build an index for both parts and to perform a 
query time join, with all the well-known performance limitation. I have 
also heard of side car index. Is it a solution that can meet my 
requirements? Is it stable enough to be usable in production? Does the 
community plan to make it part of the trunk code?


Thanks,

Aurelien



To warm the whole cache of Solr other than the only autowarmcount

2014-07-24 Thread YouPeng Yang
Hi

   I think it is wonderful to have caches autowarmed when commit or soft
commit happens. However ,If I want to warm the whole cache other than the
only autowarmcount,the default the auto warming operation will take long
long ~~long time.So it comes up with that maybe it good idea  to just
change the reference of the caches of the newsearcher with the caches of
the oldsearcher. It will increase the time of the autowarming,also increase
the query time of NRT.
  It is just not a mature idea.I am pust this idea,and hope to get more
hints or help to make more cleat about the idea.



regards


Re: Need a tipp, how to find documents where content is "tel aviv" but user query is "telaviv"?

2014-07-24 Thread Jack Krupansky
And I should have added that the advantage of the word break approach is 
that it automatically handles both splitting and combining words, all based 
on the index, with no need to mess with creating synonyms.


Also, there is a dictionary-based filter called 
DictionaryCompoundWordTokenFilterFactory which can split combined terms, but 
you do have to put at least explicitly some of the word parts in a 
dictionary file. Again, there are examples in my e-book.


It would be nice to have a dynamic, index-based filter at query time to 
automatically (but optionally) do the expansion/compression.


-- Jack Krupansky

-Original Message- 
From: Sven Schönfeldt

Sent: Thursday, July 24, 2014 8:35 AM
To: solr-user@lucene.apache.org
Subject: Re: Need a tipp, how to find documents where content is "tel aviv" 
but user query is "telaviv"?


Thanks!

Thats my core problem, to let solr search a bit like GSA :-)


Greetz

Am 24.07.2014 um 14:27 schrieb Jack Krupansky :

Google handles this type of word concatenation quite well... but Solr does 
not out of the box, at least in terms of automatically. Solr does have a 
word break spell checker:


https://cwiki.apache.org/confluence/display/solr/Spell+Checking

And described in more detail, with examples in my e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

You could at least use this feature to implement a "did you mean..." UI 
for your search app - show the user actual results but also a proposed 
query with the words broken apart.


-- Jack Krupansky

-Original Message- From: Sven Schönfeldt
Sent: Thursday, July 24, 2014 4:07 AM
To: solr-user@lucene.apache.org
Subject: Need a tipp, how to find documents where content is "tel aviv" 
but user query is "telaviv"?


Hi Solr-Users,

what is the best way to find documents, where the user write a wrong word 
in query.


For example the user search for „telaviv“. the search result should also 
include documents where content is „tel aviv“.


any tipp, or keywords how to do that kind of queries?

regards, Sven= 




SolrJ POJO Annotations

2014-07-24 Thread David Philip
Hi,

   This question is related to SolrJ document as a bean. I have an entity
that has another entity within it. Could you please tell me how to annotate
for inner entities? The issue I am facing is the inner entities fields are
missing while indexing. In the below example, It is just adding Content
fields and missing out author name and id.


Example:  "Content" is one class that has "Author" as its has-a
relationship entity.

class Content{

@Field("uniqueId")
String id;

@Field("timeStamp")
Long timeStamp;

//What should be the annotation type for this entity?
Author author;
}

class Author{
@Field("authorName")
String authorName;

@Field("authorId")
String id;

}


My schema xml is:







Thank you. - David


Get Data under Highlight Json value pair

2014-07-24 Thread EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions)

I am trying to get the content under highlighting json string, but I am not 
able to map the values as the highlighting has "" and values. E.g below .

How can I get the value , is there any option at query syntax, current I used 
&h1.on and &h1.fl=

"highlighting":{
"":{
  "CatalogSearch_en-US":["VCM"],
  "Name_en-US":["VCM TO LAPTOP CABLE"],
  "Description_en-US":[".\nVCM (Vehicle Communication 
Module) / VMM (Vehicle Measurement Module) to Laptop 
Cable.\n\nPrevious part"]},
"":{
  "CatalogSearch_en-US":["VCM II"],
  "Name_en-US":["VCM II DLC CABLE"],
  "Description_en-US":[".\nVCM II DLC cable"]},
"":{
  "CatalogSearch_en-US":["VCM"],
  "Name_en-US":["8' DLC TO VCM I CABLE"],
  "Description_en-US":["8' DLC to VCM I 
cable."]},

Thanks

Ravi


Re: integrating Accumulo with solr

2014-07-24 Thread Erik Hatcher
Just FYI, the blog Joe mentioned below (authored by me) has been adjusted to 
Solr 4.x in the original blog location here:

   

Erik

On Jul 24, 2014, at 8:03 AM, Joe Gresock  wrote:

> Ali,
> 
> Sounds like a good choice.  It's pretty standard to store the primary
> storage id as a field in Solr so that you can search the full text in Solr
> and then retrieve the full document elsewhere.
> 
> I would recommend creating a document structure in Solr with whatever
> fields you want indexed (most likely as text_en, etc.), and then store a
> "string" field named "content_id", which would be the Accumulo row id that
> you look up with a scan.
> 
> One caveat -- Accumulo will be protected at the cell level, but if you need
> your Solr search results to be protected by complex authorization strings
> similar to Accumulo, you will need to write your own QParserPlugin and use
> post filtering:
> http://java.dzone.com/articles/custom-security-filtering-solr
> 
> The code you see in that article is written for an earlier version of Solr,
> but it's not too difficult to adjust it for the latest (we've done so in
> our project).  Once you've implemented this, you would store an
> "authorizations" string field in each Solr document, and pass in the
> authorizations that the user has access to in the fq parameter of every
> query.  It's also not too bad to write something that parses the Accumulo
> authorizations string (like A&B&(C|D|E|F)) and interpret it accordingly in
> the QParserPlugin.
> 
> This will give you true row level security in Solr and Accumulo, and it
> performs quite well in Solr.
> 
> Let me know if you have any other questions.
> 
> Joe
> 
> 
> On Thu, Jul 24, 2014 at 4:07 AM, Ali Nazemian  wrote:
> 
>> Dear Joe,
>> Hi,
>> I am going to store the crawl web pages in accumulo as the main storage
>> part of my project and I need to give these data to solr for indexing and
>> user searches. I need to do some social and web analysis on my data as well
>> as having some security features. Therefore accumulo is my choice for the
>> database part and for index and search I am going to use Solr. Would you
>> please guide me through that?
>> 
>> 
>> 
>> On Thu, Jul 24, 2014 at 1:28 AM, Joe Gresock  wrote:
>> 
>>> We store data in both Solr and Accumulo -- do you have more details about
>>> what kind of data and indexing you want?  Is there a reason you're
>> thinking
>>> of using both databases in particular?
>>> 
>>> 
>>> On Wed, Jul 23, 2014 at 5:17 AM, Ali Nazemian 
>>> wrote:
>>> 
 Dear All,
 Hi,
 I was wondering is there anybody out there that tried to integrate Solr
 with Accumulo? I was thinking about using Accumulo on top of HDFS and
>>> using
 Solr to index data inside Accumulo? Do you have any idea how can I do
>>> such
 integration?
 
 Best regards.
 
 --
 A.Nazemian
 
>>> 
>>> 
>>> 
>>> --
>>> I know what it is to be in need, and I know what it is to have plenty.  I
>>> have learned the secret of being content in any and every situation,
>>> whether well fed or hungry, whether living in plenty or in want.  I can
>> do
>>> all this through him who gives me strength.*-Philippians 4:12-13*
>>> 
>> 
>> 
>> 
>> --
>> A.Nazemian
>> 
> 
> 
> 
> -- 
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.*-Philippians 4:12-13*



Re: integrating Accumulo with solr

2014-07-24 Thread Jack Krupansky
If you are not a "true hard-core gunslinger" who is willing to dive in and 
integrate the code yourself, instead you should give serious consideration 
to a product such as DataStax Enterprise that fully integrates and packages 
a NoSQL database (Cassandra) and Solr for search. The security aspects are 
still a work in progress, but certainly headed in the right direction. And 
it has Hadoop and Spark integration as well.


See:
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise

-- Jack Krupansky

-Original Message- 
From: Ali Nazemian

Sent: Thursday, July 24, 2014 10:30 AM
To: solr-user@lucene.apache.org
Subject: Re: integrating Accumulo with solr

Thank you very much. Nice Idea but how can Solr and Accumulo can be
synchronized in this way?
I know that Solr can be integrated with HDFS and also Accumulo works on the
top of HDFS. So can I use HDFS as integration point? I mean set Solr to use
HDFS as a source of documents as well as the destination of documents.
Regards.


On Thu, Jul 24, 2014 at 4:33 PM, Joe Gresock  wrote:


Ali,

Sounds like a good choice.  It's pretty standard to store the primary
storage id as a field in Solr so that you can search the full text in Solr
and then retrieve the full document elsewhere.

I would recommend creating a document structure in Solr with whatever
fields you want indexed (most likely as text_en, etc.), and then store a
"string" field named "content_id", which would be the Accumulo row id that
you look up with a scan.

One caveat -- Accumulo will be protected at the cell level, but if you 
need

your Solr search results to be protected by complex authorization strings
similar to Accumulo, you will need to write your own QParserPlugin and use
post filtering:
http://java.dzone.com/articles/custom-security-filtering-solr

The code you see in that article is written for an earlier version of 
Solr,

but it's not too difficult to adjust it for the latest (we've done so in
our project).  Once you've implemented this, you would store an
"authorizations" string field in each Solr document, and pass in the
authorizations that the user has access to in the fq parameter of every
query.  It's also not too bad to write something that parses the Accumulo
authorizations string (like A&B&(C|D|E|F)) and interpret it accordingly in
the QParserPlugin.

This will give you true row level security in Solr and Accumulo, and it
performs quite well in Solr.

Let me know if you have any other questions.

Joe


On Thu, Jul 24, 2014 at 4:07 AM, Ali Nazemian 
wrote:

> Dear Joe,
> Hi,
> I am going to store the crawl web pages in accumulo as the main storage
> part of my project and I need to give these data to solr for indexing 
> and

> user searches. I need to do some social and web analysis on my data as
well
> as having some security features. Therefore accumulo is my choice for 
> the

> database part and for index and search I am going to use Solr. Would you
> please guide me through that?
>
>
>
> On Thu, Jul 24, 2014 at 1:28 AM, Joe Gresock  wrote:
>
> > We store data in both Solr and Accumulo -- do you have more details
about
> > what kind of data and indexing you want?  Is there a reason you're
> thinking
> > of using both databases in particular?
> >
> >
> > On Wed, Jul 23, 2014 at 5:17 AM, Ali Nazemian 
> > wrote:
> >
> > > Dear All,
> > > Hi,
> > > I was wondering is there anybody out there that tried to integrate
Solr
> > > with Accumulo? I was thinking about using Accumulo on top of HDFS 
> > > and

> > using
> > > Solr to index data inside Accumulo? Do you have any idea how can I 
> > > do

> > such
> > > integration?
> > >
> > > Best regards.
> > >
> > > --
> > > A.Nazemian
> > >
> >
> >
> >
> > --
> > I know what it is to be in need, and I know what it is to have plenty.
 I
> > have learned the secret of being content in any and every situation,
> > whether well fed or hungry, whether living in plenty or in want.  I 
> > can

> do
> > all this through him who gives me strength.*-Philippians 4:12-13*
> >
>
>
>
> --
> A.Nazemian
>



--
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.*-Philippians 4:12-13*





--
A.Nazemian 



Re: Java heap space error

2014-07-24 Thread Boon Low
How about simply increasing the heap size if RAM is available? You should also 
check the update handler config, e.g. auto commit, if docs aren’t being written 
to disk, they would be hanging around in memory. And “openSearcher” setting too 
as opening new searchers consumes memory, especially if expensive warm-up 
requests are configured.

Setup some graphs, monitoring the JVM heap, indexing rate, pending docs etc etc.

Boon

---
Boon Low
Lead Big Data / Search Engineer
DCT Family History


On 24 Jul 2014, at 15:12, François Schiettecatte 
mailto:fschietteca...@gmail.com>> wrote:

A default garbage collector will be chosen for you by the VM, might help to get 
the stack trace to look at.

François

On Jul 24, 2014, at 10:06 AM, Ameya Aware 
mailto:ameya.aw...@gmail.com>> wrote:

ooh ok.

So you want to say that since i am using large heap but didnt set my
garbage collection, thats why i why getting java heap space error?





On Thu, Jul 24, 2014 at 9:58 AM, Marcello Lorenzi 
mailto:mlore...@sorint.it>>
wrote:

I think that on large heap is suggested to monitor the garbage collection
behavior and try to add a strategy adapted to your performance.  On my
production environment with a heap of 6 GB I set this parameter (server
with 8 cores):

-server -Xms6144m -Xmx6144m -XX:MaxPermSize=512m
-Dcom.sun.management.jmxremote -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode -XX:+CMSParallelRemarkEnabled
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70
-XX:ConcGCThreads=6 -XX:ParallelGCThreads=6

Marcello


On 07/24/2014 03:53 PM, Ameya Aware wrote:

I did not make any other change than this.. rest of the settings are
default.

Do i need to set garbage collection strategy?


On Thu, Jul 24, 2014 at 9:49 AM, Marcello Lorenzi 
mailto:mlore...@sorint.it>>
wrote:

Hi,
Did you set a Garbage collection strategy on your JVM ?

Marcello


On 07/24/2014 03:32 PM, Ameya Aware wrote:

Hi

I am in process of indexing around 2,00,000 documents.

I have increase java jeap space to 4 GB using below command :

java -Xmx4096M -Xms4096M -jar start.jar

Still after indexing around 15000 documents it gives java heap space
error
again.


Any fix for this?

Thanks,
Ameya








This message is confidential and may contain privileged information. You should 
not disclose its contents to any other person. If you are not the intended 
recipient, please notify the sender named above immediately. It is expressly 
declared that this e-mail does not constitute nor form part of a contract or 
unilateral obligation. Opinions, conclusions and other information in this 
message that do not relate to the official business of D.C. Thomson Family 
History shall be understood as neither given nor endorsed by it.




This email has been checked for virus and other malicious content prior to
leaving our network.

Re: integrating Accumulo with solr

2014-07-24 Thread Ali Nazemian
Thank you very much. Nice Idea but how can Solr and Accumulo can be
synchronized in this way?
I know that Solr can be integrated with HDFS and also Accumulo works on the
top of HDFS. So can I use HDFS as integration point? I mean set Solr to use
HDFS as a source of documents as well as the destination of documents.
Regards.


On Thu, Jul 24, 2014 at 4:33 PM, Joe Gresock  wrote:

> Ali,
>
> Sounds like a good choice.  It's pretty standard to store the primary
> storage id as a field in Solr so that you can search the full text in Solr
> and then retrieve the full document elsewhere.
>
> I would recommend creating a document structure in Solr with whatever
> fields you want indexed (most likely as text_en, etc.), and then store a
> "string" field named "content_id", which would be the Accumulo row id that
> you look up with a scan.
>
> One caveat -- Accumulo will be protected at the cell level, but if you need
> your Solr search results to be protected by complex authorization strings
> similar to Accumulo, you will need to write your own QParserPlugin and use
> post filtering:
> http://java.dzone.com/articles/custom-security-filtering-solr
>
> The code you see in that article is written for an earlier version of Solr,
> but it's not too difficult to adjust it for the latest (we've done so in
> our project).  Once you've implemented this, you would store an
> "authorizations" string field in each Solr document, and pass in the
> authorizations that the user has access to in the fq parameter of every
> query.  It's also not too bad to write something that parses the Accumulo
> authorizations string (like A&B&(C|D|E|F)) and interpret it accordingly in
> the QParserPlugin.
>
> This will give you true row level security in Solr and Accumulo, and it
> performs quite well in Solr.
>
> Let me know if you have any other questions.
>
> Joe
>
>
> On Thu, Jul 24, 2014 at 4:07 AM, Ali Nazemian 
> wrote:
>
> > Dear Joe,
> > Hi,
> > I am going to store the crawl web pages in accumulo as the main storage
> > part of my project and I need to give these data to solr for indexing and
> > user searches. I need to do some social and web analysis on my data as
> well
> > as having some security features. Therefore accumulo is my choice for the
> > database part and for index and search I am going to use Solr. Would you
> > please guide me through that?
> >
> >
> >
> > On Thu, Jul 24, 2014 at 1:28 AM, Joe Gresock  wrote:
> >
> > > We store data in both Solr and Accumulo -- do you have more details
> about
> > > what kind of data and indexing you want?  Is there a reason you're
> > thinking
> > > of using both databases in particular?
> > >
> > >
> > > On Wed, Jul 23, 2014 at 5:17 AM, Ali Nazemian 
> > > wrote:
> > >
> > > > Dear All,
> > > > Hi,
> > > > I was wondering is there anybody out there that tried to integrate
> Solr
> > > > with Accumulo? I was thinking about using Accumulo on top of HDFS and
> > > using
> > > > Solr to index data inside Accumulo? Do you have any idea how can I do
> > > such
> > > > integration?
> > > >
> > > > Best regards.
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> > >
> > >
> > > --
> > > I know what it is to be in need, and I know what it is to have plenty.
>  I
> > > have learned the secret of being content in any and every situation,
> > > whether well fed or hungry, whether living in plenty or in want.  I can
> > do
> > > all this through him who gives me strength.*-Philippians 4:12-13*
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.*-Philippians 4:12-13*
>



-- 
A.Nazemian


Re: Java heap space error

2014-07-24 Thread François Schiettecatte
A default garbage collector will be chosen for you by the VM, might help to get 
the stack trace to look at.

François

On Jul 24, 2014, at 10:06 AM, Ameya Aware  wrote:

> ooh ok.
> 
> So you want to say that since i am using large heap but didnt set my
> garbage collection, thats why i why getting java heap space error?
> 
> 
> 
> 
> 
> On Thu, Jul 24, 2014 at 9:58 AM, Marcello Lorenzi 
> wrote:
> 
>> I think that on large heap is suggested to monitor the garbage collection
>> behavior and try to add a strategy adapted to your performance.  On my
>> production environment with a heap of 6 GB I set this parameter (server
>> with 8 cores):
>> 
>> -server -Xms6144m -Xmx6144m -XX:MaxPermSize=512m
>> -Dcom.sun.management.jmxremote -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
>> -XX:+CMSIncrementalMode -XX:+CMSParallelRemarkEnabled
>> -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70
>> -XX:ConcGCThreads=6 -XX:ParallelGCThreads=6
>> 
>> Marcello
>> 
>> 
>> On 07/24/2014 03:53 PM, Ameya Aware wrote:
>> 
>> I did not make any other change than this.. rest of the settings are
>> default.
>> 
>> Do i need to set garbage collection strategy?
>> 
>> 
>> On Thu, Jul 24, 2014 at 9:49 AM, Marcello Lorenzi 
>> wrote:
>> 
>>> Hi,
>>> Did you set a Garbage collection strategy on your JVM ?
>>> 
>>> Marcello
>>> 
>>> 
>>> On 07/24/2014 03:32 PM, Ameya Aware wrote:
>>> 
 Hi
 
 I am in process of indexing around 2,00,000 documents.
 
 I have increase java jeap space to 4 GB using below command :
 
 java -Xmx4096M -Xms4096M -jar start.jar
 
 Still after indexing around 15000 documents it gives java heap space
 error
 again.
 
 
 Any fix for this?
 
 Thanks,
 Ameya
 
 
>>> 
>> 
>> 



Re: Java heap space error

2014-07-24 Thread Ameya Aware
ooh ok.

So you want to say that since i am using large heap but didnt set my
garbage collection, thats why i why getting java heap space error?





On Thu, Jul 24, 2014 at 9:58 AM, Marcello Lorenzi 
wrote:

>  I think that on large heap is suggested to monitor the garbage collection
> behavior and try to add a strategy adapted to your performance.  On my
> production environment with a heap of 6 GB I set this parameter (server
> with 8 cores):
>
> -server -Xms6144m -Xmx6144m -XX:MaxPermSize=512m
> -Dcom.sun.management.jmxremote -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:+CMSIncrementalMode -XX:+CMSParallelRemarkEnabled
> -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70
> -XX:ConcGCThreads=6 -XX:ParallelGCThreads=6
>
> Marcello
>
>
> On 07/24/2014 03:53 PM, Ameya Aware wrote:
>
> I did not make any other change than this.. rest of the settings are
> default.
>
>  Do i need to set garbage collection strategy?
>
>
>  On Thu, Jul 24, 2014 at 9:49 AM, Marcello Lorenzi 
> wrote:
>
>> Hi,
>> Did you set a Garbage collection strategy on your JVM ?
>>
>> Marcello
>>
>>
>> On 07/24/2014 03:32 PM, Ameya Aware wrote:
>>
>>> Hi
>>>
>>> I am in process of indexing around 2,00,000 documents.
>>>
>>> I have increase java jeap space to 4 GB using below command :
>>>
>>> java -Xmx4096M -Xms4096M -jar start.jar
>>>
>>> Still after indexing around 15000 documents it gives java heap space
>>> error
>>> again.
>>>
>>>
>>> Any fix for this?
>>>
>>> Thanks,
>>> Ameya
>>>
>>>
>>
>
>


Re: Java heap space error

2014-07-24 Thread Marcello Lorenzi
I think that on large heap is suggested to monitor the garbage 
collection behavior and try to add a strategy adapted to your 
performance.  On my production environment with a heap of 6 GB I set 
this parameter (server with 8 cores):


-server -Xms6144m -Xmx6144m -XX:MaxPermSize=512m 
-Dcom.sun.management.jmxremote -XX:+UseParNewGC -XX:+UseConcMarkSweepGC 
-XX:+CMSIncrementalMode -XX:+CMSParallelRemarkEnabled 
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 
-XX:ConcGCThreads=6 -XX:ParallelGCThreads=6


Marcello

On 07/24/2014 03:53 PM, Ameya Aware wrote:
I did not make any other change than this.. rest of the settings are 
default.


Do i need to set garbage collection strategy?


On Thu, Jul 24, 2014 at 9:49 AM, Marcello Lorenzi > wrote:


Hi,
Did you set a Garbage collection strategy on your JVM ?

Marcello


On 07/24/2014 03:32 PM, Ameya Aware wrote:

Hi

I am in process of indexing around 2,00,000 documents.

I have increase java jeap space to 4 GB using below command :

java -Xmx4096M -Xms4096M -jar start.jar

Still after indexing around 15000 documents it gives java heap
space error
again.


Any fix for this?

Thanks,
Ameya







Re: Java heap space error

2014-07-24 Thread Ameya Aware
I did not make any other change than this.. rest of the settings are
default.

Do i need to set garbage collection strategy?


On Thu, Jul 24, 2014 at 9:49 AM, Marcello Lorenzi 
wrote:

> Hi,
> Did you set a Garbage collection strategy on your JVM ?
>
> Marcello
>
>
> On 07/24/2014 03:32 PM, Ameya Aware wrote:
>
>> Hi
>>
>> I am in process of indexing around 2,00,000 documents.
>>
>> I have increase java jeap space to 4 GB using below command :
>>
>> java -Xmx4096M -Xms4096M -jar start.jar
>>
>> Still after indexing around 15000 documents it gives java heap space error
>> again.
>>
>>
>> Any fix for this?
>>
>> Thanks,
>> Ameya
>>
>>
>


Re: Java heap space error

2014-07-24 Thread Marcello Lorenzi

Hi,
Did you set a Garbage collection strategy on your JVM ?

Marcello

On 07/24/2014 03:32 PM, Ameya Aware wrote:

Hi

I am in process of indexing around 2,00,000 documents.

I have increase java jeap space to 4 GB using below command :

java -Xmx4096M -Xms4096M -jar start.jar

Still after indexing around 15000 documents it gives java heap space error
again.


Any fix for this?

Thanks,
Ameya





Re: Passivate core in Solr Cloud

2014-07-24 Thread Aurélien MAZOYER
Thank you Erick and Alex for your answers. Lots of core stuff seems to 
meet my requirement but it is a problem if it does not work with Solr 
Cloud. Is there an issue opened for this problem?
If I understand well, the only solution for me is to use multiple 
monoinstances of Solr using transient cores and to distribute manually 
the cores for my tenant (I assume the LRU mechanimn will be less 
effective as it will be done per solr instance).
When you say "does NOT play nice with distributed mode", does it also 
include the standard replication mecanism?


Thanks,

Regards,

Aurelien



Le 23/07/2014 17:21, Erick Erickson a écrit :

Do note that the lots of cores stuff does NOT play nice with in
distributed mode (yet).

Best,
Erick


On Wed, Jul 23, 2014 at 6:00 AM, Alexandre Rafalovitch
wrote:


Solr has some support for large number of cores, including transient
cores:http://wiki.apache.org/solr/LotsOfCores

Regards,
Alex.
Personal:http://www.outerthoughts.com/  and @arafalov
Solr resources:http://www.solr-start.com/  and @solrstart
Solr popularizers community:https://www.linkedin.com/groups?gid=6713853


On Wed, Jul 23, 2014 at 7:55 PM, Aurélien MAZOYER
  wrote:

Hello,

We want to setup a Solr Cloud cluster in order to handle a high volume of
documents with a multi-tenant architecture. The problem is that an
application-level isolation for a tenant (using a mutual index with a

field

"customer") is not enough to fit our requirements. As a result, we need 1
collection/customer. There is more than a thousand customers and it seems
unreasonable to create thousands of collections in Solr Cloud... But as

we

know that there are less than 1 query/customer/day, we are currently

looking

for a way to passivate collection when they are not in use. Can it be a

good

idea? If yes, are there best practices to implement this? What side

effects

can we expect? Do we need to put some application-level logic on top on

the

Solr Cloud cluster to choose which collection we have to unload (and

maybe

there is something smarter (and quicker?) than simply loading/unloading

the

core when it is not in used?) ?


Thank you for your answer(s),

Aurelien





Re: how to fully test a response writer

2014-07-24 Thread Mikhail Khludnev
Hello,

I think you can check TestDistributedSearch or other descendants of
BaseDistributedSearchTestCase (I don't think you need to look at *Zk*
tests).


On Wed, Jul 23, 2014 at 5:03 PM, Matteo Grolla 
wrote:

> Hi,
> I developed a new SolResponseWriter but I'm not happy with how I
> wrote tests.
> My problem is that I need to test it either with local request and with
> distributed request since the solr response object (input to the response
> writer) are different.
> a) I tested the local request case
> using SolrTestCaseJ4
> b) tested the distributed request case
> using a junit test case and making rest calls to
> the alias coll12 associated to
> a couple of solrcloud  collection configured with my
> custom response writer
>
> the problem with b) is that it requires a manual setup on every machine
> where I want to run the tests.
>
> thanks
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Java heap space error

2014-07-24 Thread Ameya Aware
Hi

I am in process of indexing around 2,00,000 documents.

I have increase java jeap space to 4 GB using below command :

java -Xmx4096M -Xms4096M -jar start.jar

Still after indexing around 15000 documents it gives java heap space error
again.


Any fix for this?

Thanks,
Ameya


Re: Need a tipp, how to find documents where content is "tel aviv" but user query is "telaviv"?

2014-07-24 Thread Sven Schönfeldt
Thanks!

Thats my core problem, to let solr search a bit like GSA :-)


Greetz

Am 24.07.2014 um 14:27 schrieb Jack Krupansky :

> Google handles this type of word concatenation quite well... but Solr does 
> not out of the box, at least in terms of automatically. Solr does have a word 
> break spell checker:
> 
> https://cwiki.apache.org/confluence/display/solr/Spell+Checking
> 
> And described in more detail, with examples in my e-book:
> http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html
> 
> You could at least use this feature to implement a "did you mean..." UI for 
> your search app - show the user actual results but also a proposed query with 
> the words broken apart.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Sven Schönfeldt
> Sent: Thursday, July 24, 2014 4:07 AM
> To: solr-user@lucene.apache.org
> Subject: Need a tipp, how to find documents where content is "tel aviv" but 
> user query is "telaviv"?
> 
> Hi Solr-Users,
> 
> what is the best way to find documents, where the user write a wrong word in 
> query.
> 
> For example the user search for „telaviv“. the search result should also 
> include documents where content is „tel aviv“.
> 
> any tipp, or keywords how to do that kind of queries?
> 
> regards, Sven= 



Re: Need a tipp, how to find documents where content is "tel aviv" but user query is "telaviv"?

2014-07-24 Thread Jack Krupansky
Google handles this type of word concatenation quite well... but Solr does 
not out of the box, at least in terms of automatically. Solr does have a 
word break spell checker:


https://cwiki.apache.org/confluence/display/solr/Spell+Checking

And described in more detail, with examples in my e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

You could at least use this feature to implement a "did you mean..." UI for 
your search app - show the user actual results but also a proposed query 
with the words broken apart.


-- Jack Krupansky

-Original Message- 
From: Sven Schönfeldt

Sent: Thursday, July 24, 2014 4:07 AM
To: solr-user@lucene.apache.org
Subject: Need a tipp, how to find documents where content is "tel aviv" but 
user query is "telaviv"?


Hi Solr-Users,

what is the best way to find documents, where the user write a wrong word in 
query.


For example the user search for „telaviv“. the search result should also 
include documents where content is „tel aviv“.


any tipp, or keywords how to do that kind of queries?

regards, Sven= 



Re: Need a tipp, how to find documents where content is "tel aviv" but user query is "telaviv"?

2014-07-24 Thread Sven Schönfeldt
Thank You Alex!

Am 24.07.2014 um 11:08 schrieb Alexandre Rafalovitch :

> You can put the SynonymFilterFactory at query time as well. But it's
> less reliable. Especially if the text is "tel aviv" and the query is
> telaviv, you need to make sure to enable auto phrase search as well.
> 
> Regards,
>   Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
> 
> 
> On Thu, Jul 24, 2014 at 3:31 PM, Sven Schönfeldt
>  wrote:
>> So i will need SynonymFilterFactory at indexing, or? Any chance to get it 
>> work by query time?
>> 
>> 
>> Am 24.07.2014 um 10:24 schrieb Alexandre Rafalovitch :
>> 
>>> How often does this happen? Could use synonyms if not too many.
>>> On 24/07/2014 3:08 pm, "Sven Schönfeldt"  wrote:
>>> 
 Hi Solr-Users,
 
 what is the best way to find documents, where the user write a wrong word
 in query.
 
 For example the user search for „telaviv“. the search result should also
 include documents where content is „tel aviv“.
 
 any tipp, or keywords how to do that kind of queries?
 
 regards, Sven
>> 



Auto Suggest

2014-07-24 Thread benjelloun
Hello,

Did solr.SuggestComponent work on MultiValued Field to Auto suggest not only
one word but the whole sentence?



Regards,
Anass BENJELLOUN



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Auto-Suggest-tp4149004.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: integrating Accumulo with solr

2014-07-24 Thread Joe Gresock
Ali,

Sounds like a good choice.  It's pretty standard to store the primary
storage id as a field in Solr so that you can search the full text in Solr
and then retrieve the full document elsewhere.

I would recommend creating a document structure in Solr with whatever
fields you want indexed (most likely as text_en, etc.), and then store a
"string" field named "content_id", which would be the Accumulo row id that
you look up with a scan.

One caveat -- Accumulo will be protected at the cell level, but if you need
your Solr search results to be protected by complex authorization strings
similar to Accumulo, you will need to write your own QParserPlugin and use
post filtering:
http://java.dzone.com/articles/custom-security-filtering-solr

The code you see in that article is written for an earlier version of Solr,
but it's not too difficult to adjust it for the latest (we've done so in
our project).  Once you've implemented this, you would store an
"authorizations" string field in each Solr document, and pass in the
authorizations that the user has access to in the fq parameter of every
query.  It's also not too bad to write something that parses the Accumulo
authorizations string (like A&B&(C|D|E|F)) and interpret it accordingly in
the QParserPlugin.

This will give you true row level security in Solr and Accumulo, and it
performs quite well in Solr.

Let me know if you have any other questions.

Joe


On Thu, Jul 24, 2014 at 4:07 AM, Ali Nazemian  wrote:

> Dear Joe,
> Hi,
> I am going to store the crawl web pages in accumulo as the main storage
> part of my project and I need to give these data to solr for indexing and
> user searches. I need to do some social and web analysis on my data as well
> as having some security features. Therefore accumulo is my choice for the
> database part and for index and search I am going to use Solr. Would you
> please guide me through that?
>
>
>
> On Thu, Jul 24, 2014 at 1:28 AM, Joe Gresock  wrote:
>
> > We store data in both Solr and Accumulo -- do you have more details about
> > what kind of data and indexing you want?  Is there a reason you're
> thinking
> > of using both databases in particular?
> >
> >
> > On Wed, Jul 23, 2014 at 5:17 AM, Ali Nazemian 
> > wrote:
> >
> > > Dear All,
> > > Hi,
> > > I was wondering is there anybody out there that tried to integrate Solr
> > > with Accumulo? I was thinking about using Accumulo on top of HDFS and
> > using
> > > Solr to index data inside Accumulo? Do you have any idea how can I do
> > such
> > > integration?
> > >
> > > Best regards.
> > >
> > > --
> > > A.Nazemian
> > >
> >
> >
> >
> > --
> > I know what it is to be in need, and I know what it is to have plenty.  I
> > have learned the secret of being content in any and every situation,
> > whether well fed or hungry, whether living in plenty or in want.  I can
> do
> > all this through him who gives me strength.*-Philippians 4:12-13*
> >
>
>
>
> --
> A.Nazemian
>



-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.*-Philippians 4:12-13*


Re: how to achieve static boost in solr

2014-07-24 Thread rahulmodi
Thanks a lot Erick,

i have looked at Query Elevation Component, it works but the problem is if i
need to add new  tag or update existing  tag in elevate.xml
file then i need to restart the server in order to take effect.

I have also used "forceElevation=true" even then it requires restarting
server.

Is there any way by which we can achieve this without restarting server.

Also, there is another issue is that it works only when we use exact query,
example is below:
elevate.xml file has entry like:-


  http://welcome.energy.com/"; />


if i use "energy" as query then i get correct url as
"http://welcome.energy.com/";
But if i use "power energy" as query then i get another url but here also i
want the url "http://welcome.energy.com/"; to be displayed.

Please suggest how to achieve this.
Thanks in advance.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-achieve-static-boost-in-solr-tp4148788p4148999.html
Sent from the Solr - User mailing list archive at Nabble.com.


Tracking request for Solr index backup

2014-07-24 Thread zzT
One way to get an index backup in Solr is through an HTTP call like this
http://localhost:8983/solr/replication?command=backup

I have 2 questions regarding this

1) Is there a way to get information on the progress of the backup
operation, much like the async param that was introduced in 
   4.8 (I think) for Collections API?
2) Does this command guarantee that you'll get a valid backup even if the
Solr core is up and maybe in middle of 
indexing/commit operations?

I'm using Solr 4.8.1

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tracking-request-for-Solr-index-backup-tp4148997.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Mixing ordinary and nested documents

2014-07-24 Thread Bjørn Axelsen
thank you very much :-)

2014-07-22 16:34 GMT+02:00 Umesh Prasad :

> public static DocSet mapChildDocsToParentOnly(DocSet childDocSet) {
>
> DocSet mappedParentDocSet = new BitDocSet();
> DocIterator childIterator = childDocSet.iterator();
> while (childIterator.hasNext()) {
> int childDoc = childIterator.nextDoc();
> int parentDoc = childToParentDocMapping[childDoc];
> mappedParentDocSet.addUnique(parentDoc);
> }
> int[] matches = new int[mappedParentDocSet.size()];
> DocIterator parentIter = mappedParentDocSet.iterator();
> for (int i = 0; parentIter.hasNext(); i++) {
> matches[i] = parentIter.nextDoc();
> }
> return new SortedIntDocSet(matches); // you will need
> SortedIntDocSet impl else docset interaction in some facet queries fails
> later.
> }
>
>
>
> On 22 July 2014 19:59, Umesh Prasad  wrote:
>
> > Query parentFilterQuery = new TermQuery(new Term("document_type",
> > "parent"));
> >
> > int[] childToParentDocMapping = new int[searcher.maxDoc()];
> > DocSet allParentDocSet =
> searcher.getDocSet(parentFilterQuery);
> > DocIterator iter = allParentDocSet.iterator();
> > int child = 0;
> > while (iter.hasNext()) {
> > int parent = iter.nextDoc();
> > while (child <= parent) {
> > childToParentDocMapping[child] = parent;
> > child++;
> > }
> > }
> >
> >
> > On 22 July 2014 16:28, Bjørn Axelsen 
> > wrote:
> >
> >> Thanks, Umesh
> >>
> >> You can get the parent bitset by running a the parent doc type query on
> >> > the solr indexsearcher.
> >> > Then child bitset by runnning the child doc type query. Then  use
> these
> >> > together to create a int[] where int[i] = parent of i.
> >> >
> >>
> >> Can you kindly add an example? I am not quite sure how to put this into
> a
> >> query?
> >>
> >> I can easily make the join from child to parent, but what I want to
> >> achieve
> >> is to get the parent document added to the result if it exists but
> >> maintain
> >> the scoring fromt the child as well as the full child document. Is this
> >> possible?
> >>
> >> Cheers,
> >> Bjørn
> >>
> >> 2014-07-18 19:00 GMT+02:00 Umesh Prasad :
> >>
> >> > Comments inline
> >> >
> >> >
> >> > On 16 July 2014 20:31, Bjørn Axelsen <
> bjorn.axel...@fagkommunikation.dk
> >> >
> >> > wrote:
> >> >
> >> > > Hi Solr users
> >> > >
> >> > > I would appreciate your inputs on how to handle a *mix *of *simple
> >> *and
> >> > > *nested
> >> > > *documents in the most easy and flexible way.
> >> > >
> >> > > I need to handle:
> >> > >
> >> > >- simple documens: webpages, short articles etc. (approx. 90% of
> >> the
> >> > >content)
> >> > >- nested documents: books containing chapters etc. (approx 10% of
> >> the
> >> > >content)
> >> > >
> >> > >
> >> >
> >> >
> >> > > For simple documents I just want to present straightforward search
> >> > results
> >> > > without any grouping etc.
> >> > >
> >> > > For the nested documents I want to group by book and show book
> title,
> >> > book
> >> > > price etc. AND the individual results within the book. Lets say
> there
> >> is
> >> > a
> >> > > hit on "Chapters 1" and "Chapter 7" within "Book 1" and a hit on
> >> "Article
> >> > > 1", I would like to present this:
> >> > >
> >> > > *Book 1 title*
> >> > > Book 1 published date
> >> > > Book 1 description
> >> > > - *Chapter 1 title*
> >> > >   Chapter 1 snippet
> >> > > - *Chapter 7 title*
> >> > >   CHapter 7 snippet
> >> > >
> >> > > *Article 1 title*
> >> > > Article 1 published date
> >> > > Article 1 description
> >> > > Article 1 snippet
> >> > >
> >> > > It looks like it is pretty straightforward to use the
> >> CollapsingQParser
> >> > to
> >> > > collapse the book results into one result and not to collapse the
> >> other
> >> > > results. But how about showing the information about the book (the
> >> parent
> >> > > document of the chapters)?
> >> > >
> >> >
> >> > You can map the child document to parent  doc id space and extract the
> >> > information from parent doc id.
> >> >
> >> > First you need to generate child doc to parent doc id mapping one
> time.
> >> >   You can get the parent bitset by running a the parent doc type query
> >> on
> >> > the solr indexsearcher.
> >> > Then child bitset by runnning the child doc type query. Then  use
> these
> >> > together to create a int[] where int[i] = parent of i. This result is
> >> > cachable till next commit. I am doing that for computing facets from
> >> fields
> >> > in parent docs and sorting on values from parent docs (while getting
> >> child
> >> > docs as output).
> >> >
> >> >
> >> >
> >> >
> >> > > 1) Is there a way to do an* optional block join* to a *parent
> >> *document
> >> > and
> >> > > return it together *with *the *child *document - but not to 

RE: Any Solr consultants available??

2014-07-24 Thread Markus Jelsma
Hahaha thanks wunder, made me laugh!

 
-Original message-
> From:Walter Underwood 
> Sent: Thursday 24th July 2014 2:07
> To: solr-user@lucene.apache.org
> Subject: Re: Any Solr consultants available??
> 
> When I see job postings like this, I have to assume they were written by 
> people who really don’t understand the problem and have never met people with 
> the various skills they are asking for. They are not going to find one person 
> who does all this.
> 
> This is an opening for zebra unicorn that walks on water. At best, they’ll 
> get a one-horned goat with painted stripes on a life raft. They need to talk 
> to some people, make multiple realistic openings, and expect to grow some of 
> their own expertise.
> 
> I got an email like this from Goldman Sachs this morning.
> 
> “... a Senior Application Architect/Developer and DevOps Engineer for a major 
> company initiative. In addition to an effort to build a new cloud 
> infrastructure from the ground up, they are beginning a number of company 
> projects in the areas of cloud-based open source search, Machine Learning/AI, 
> Big Data, Predictive Analytics & Low-Latency Trading Algorithm Development.”
> 
> Good luck, fellas.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/
> 
> 
> On Jul 23, 2014, at 1:01 PM, Jack Krupansky  wrote:
> 
> > Yeah, I saw that, which is why I suggested not being too picky about 
> > specific requirements. If you have at least two or three years of solid 
> > Solr experience, that would make you at least worth looking at.
> > 
> > -- Jack Krupansky
> > 
> > From: Tri Cao 
> > Sent: Wednesday, July 23, 2014 3:57 PM
> > To: solr-user@lucene.apache.org 
> > Cc: solr-user@lucene.apache.org 
> > Subject: Re: Any Solr consultants available??
> > 
> > Well, it's kind of hard to find a person if the requirement is "10 years' 
> > experience with Solr" given that Solr was created in 2004.
> > 
> > On Jul 23, 2014, at 12:45 PM, Jack Krupansky  
> > wrote:
> > 
> > 
> >  I occasionally get pinged by recruiters looking for Solr application 
> > developers... here’s the latest. If you are interested, either contact 
> > Jessica directly or reply to me and I’ll forward your reply.
> > 
> >  Even if you don’t strictly meet all the requirements... they are having 
> > trouble finding... anyone. All the great Solr guys I know are quite busy.
> > 
> >  Thanks.
> > 
> >  -- Jack Krupansky
> > 
> >  From: Jessica Feigin 
> >  Sent: Wednesday, July 23, 2014 3:36 PM
> >  To: 'Jack Krupansky' 
> >  Subject: Thank you!
> > 
> >  Hi Jack,
> > 
> > 
> > 
> >  Thanks for your assistance, below is the Solr Consultant job description:
> > 
> > 
> > 
> >  Our client, a hospitality Fortune 500 company are looking to update their 
> > platform to make accessing information easier for the franchisees. This is 
> > the first phase of the project which will take a few years. They want a 
> > hands on Solr consultant who has ideally worked in the search space. As you 
> > can imagine the company culture is great, everyone is really friendly and 
> > there is also an option to become permanent. They are looking for:
> > 
> > 
> > 
> >  - 10+ years’ experience with Solr (Apache Lucene), HTML, XML, Java, 
> > Tomcat, JBoss, MySQL
> > 
> >  - 5+ years’ experience implementing Solr builds of indexes, shards, and 
> > refined searches across semi-structured data sets to include architectural 
> > scaling
> > 
> >  - Experience in developing a re-usable framework to support web site 
> > search; implement rich web site search, including the incorporation of 
> > metadata.
> > 
> >  - Experienced in development using Java, Oracle, RedHat, Perl, shell, and 
> > clustering
> > 
> >  - A strong understanding of Data analytics, algorithms, and large data 
> > structures
> > 
> >  - Experienced in architectural design and resource planning for scaling 
> > Solr/Lucene capabilities.
> > 
> >  - Bachelor's degree in Computer Science or related discipline.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >  Jessica Feigin 
> >  Technical Recruiter
> > 
> >  Technology Resource Management 
> >  30 Vreeland Rd., Florham Park, NJ 07932 
> >  Phone 973-377-0040 x 415, Fax 973-377-7064 
> >  Email: jess...@trmconsulting.com
> > 
> >  Web site: www.trmconsulting.com
> > 
> >  LinkedIn Profile: www.linkedin.com/in/jessicafeigin
> > 
> > 
> 
> 


Re: How to migrate content of a collection to a new collection

2014-07-24 Thread Per Steffensen

Thanks for replying

I tried this "poor mans" cursor approach out ad-hoc, but I get OOM. 
Pretty sure this is because you need all uniqueKey-values in FieldCache 
in order to be able to sort on it. We do not have memory for that - and 
never will. Our uniqueKey field is not DocValue.

Just out of curiosity
* Will I have the same OOM problem using the CURSOR-feature in later Solrs?
* Will the "poor mans" cursor approach still be efficient if my 
uniqueKey was DocValued, knowing that all values for uniqueKey (the 
DocValue file) cannot fit in memory (OS file cache)?


Regards, Per Steffensen

On 23/07/14 23:57, Chris Hostetter wrote:

: billions of documents (not enough memory). Please note that we are on 4.4,
: which does not contain the new CURSOR-feature. Please also note that speed is
: an important factor for us.

for situations where you know you will be processing every doc and order
doesn't matter you can use a "poor mans" cursor by filtering on sccessive
ranges of your uniqueKey field as described in the "Is There A
Workaround?" section of this blog post...

http://searchhub.org/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

* sort on uniqueKey
* leave start=0 on every requets
* add an fq to each request based on the last uniqueKey value from
   the previous request.


-Hoss
http://www.lucidworks.com/





Re: How to migrate content of a collection to a new collection

2014-07-24 Thread Per Steffensen

On 23/07/14 17:13, Erick Erickson wrote:

Per:

Given that you said that the field redefinition also includes routing
info
Exactly. It would probably be much faster to make sure that the new 
collection have the same number of shards on each Solr-machine and that 
the routing-ranges are identical and then to a local 1-1 copy on 
shard-level. But it just will not end up correctly wrt routing, because 
we also need to change our ids while copying from old to new collections.

  I don't see
any other way than re-indexing each collection. That said, could you use the
collection aliasing and do one collection at a time?
We will definitely do one collection at the time. Whether we will use 
aliasing or do something else to achieve create-new-twin-collection -> 
copy-from-old-collection-to-new-twin-collection -> 
delete-old-collection-and-let-new-twin-collection-take-its-place I do 
not know yet. But that is details, we will definitely be able to manage.


Best,
Erick




Re: Where can I get information about sold Cloud H/W spec

2014-07-24 Thread Alexandre Rafalovitch
Have you tried searching the mailing list archives? Some of these
things have been discussed a number of times. SSDs are definitely good
for Solr. But also you may get more specific help if you say what kind
of volume/throughput of data you are looking at.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Thu, Jul 24, 2014 at 3:27 PM, Lee Chunki  wrote:
> Hi,
>
> I am trying to build sold cloud.
> Do you know where can I get informations like :
>
> * solr cloud support heterogeneous servers
> * HDD
>* SDD vs. SAS vs. ….
>* no RAID vs. RAID-5 vs. RAID-0 vs. ….
> * Network
>* 100MB vs. 1GB vs. ….
> * ….
>
> of course, it will depend on data size, traffic and so on.
> but please let me know general or  minimum H/W spec.
>
> Thanks,
> Chunki.


Re: Need a tipp, how to find documents where content is "tel aviv" but user query is "telaviv"?

2014-07-24 Thread Alexandre Rafalovitch
You can put the SynonymFilterFactory at query time as well. But it's
less reliable. Especially if the text is "tel aviv" and the query is
telaviv, you need to make sure to enable auto phrase search as well.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Thu, Jul 24, 2014 at 3:31 PM, Sven Schönfeldt
 wrote:
> So i will need SynonymFilterFactory at indexing, or? Any chance to get it 
> work by query time?
>
>
> Am 24.07.2014 um 10:24 schrieb Alexandre Rafalovitch :
>
>> How often does this happen? Could use synonyms if not too many.
>> On 24/07/2014 3:08 pm, "Sven Schönfeldt"  wrote:
>>
>>> Hi Solr-Users,
>>>
>>> what is the best way to find documents, where the user write a wrong word
>>> in query.
>>>
>>> For example the user search for „telaviv“. the search result should also
>>> include documents where content is „tel aviv“.
>>>
>>> any tipp, or keywords how to do that kind of queries?
>>>
>>> regards, Sven
>


Re: Need a tipp, how to find documents where content is "tel aviv" but user query is "telaviv"?

2014-07-24 Thread Sven Schönfeldt
So i will need SynonymFilterFactory at indexing, or? Any chance to get it work 
by query time?


Am 24.07.2014 um 10:24 schrieb Alexandre Rafalovitch :

> How often does this happen? Could use synonyms if not too many.
> On 24/07/2014 3:08 pm, "Sven Schönfeldt"  wrote:
> 
>> Hi Solr-Users,
>> 
>> what is the best way to find documents, where the user write a wrong word
>> in query.
>> 
>> For example the user search for „telaviv“. the search result should also
>> include documents where content is „tel aviv“.
>> 
>> any tipp, or keywords how to do that kind of queries?
>> 
>> regards, Sven



Where can I get information about sold Cloud H/W spec

2014-07-24 Thread Lee Chunki
Hi,

I am trying to build sold cloud.
Do you know where can I get informations like :

* solr cloud support heterogeneous servers
* HDD
   * SDD vs. SAS vs. ….
   * no RAID vs. RAID-5 vs. RAID-0 vs. ….
* Network
   * 100MB vs. 1GB vs. ….
* ….

of course, it will depend on data size, traffic and so on.
but please let me know general or  minimum H/W spec.

Thanks,
Chunki.

Re: Need a tipp, how to find documents where content is "tel aviv" but user query is "telaviv"?

2014-07-24 Thread Alexandre Rafalovitch
How often does this happen? Could use synonyms if not too many.
On 24/07/2014 3:08 pm, "Sven Schönfeldt"  wrote:

> Hi Solr-Users,
>
> what is the best way to find documents, where the user write a wrong word
> in query.
>
> For example the user search for „telaviv“. the search result should also
> include documents where content is „tel aviv“.
>
> any tipp, or keywords how to do that kind of queries?
>
> regards, Sven


Need a tipp, how to find documents where content is "tel aviv" but user query is "telaviv"?

2014-07-24 Thread Sven Schönfeldt
Hi Solr-Users,

what is the best way to find documents, where the user write a wrong word in 
query.

For example the user search for „telaviv“. the search result should also 
include documents where content is „tel aviv“.

any tipp, or keywords how to do that kind of queries?

regards, Sven

Re: integrating Accumulo with solr

2014-07-24 Thread Ali Nazemian
Dear Joe,
Hi,
I am going to store the crawl web pages in accumulo as the main storage
part of my project and I need to give these data to solr for indexing and
user searches. I need to do some social and web analysis on my data as well
as having some security features. Therefore accumulo is my choice for the
database part and for index and search I am going to use Solr. Would you
please guide me through that?



On Thu, Jul 24, 2014 at 1:28 AM, Joe Gresock  wrote:

> We store data in both Solr and Accumulo -- do you have more details about
> what kind of data and indexing you want?  Is there a reason you're thinking
> of using both databases in particular?
>
>
> On Wed, Jul 23, 2014 at 5:17 AM, Ali Nazemian 
> wrote:
>
> > Dear All,
> > Hi,
> > I was wondering is there anybody out there that tried to integrate Solr
> > with Accumulo? I was thinking about using Accumulo on top of HDFS and
> using
> > Solr to index data inside Accumulo? Do you have any idea how can I do
> such
> > integration?
> >
> > Best regards.
> >
> > --
> > A.Nazemian
> >
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.*-Philippians 4:12-13*
>



-- 
A.Nazemian


Re: solr always loading and not any response

2014-07-24 Thread Alexandre Rafalovitch
Is it on the same machine or on a different one? Either way, try to
open the developer console in the browser and see what's happening on
the network when you reload. Also, see the Solr side whether you get
any message in the console. Maybe you are hitting an exception of some
sort.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Thu, Jul 24, 2014 at 2:28 PM, zhijun liu  wrote:
> hi, all, solr admin page is always "loading", and when I send query request
> also can not get any response. the tcp link is  always "ESTABLISHED"。only
> restart solr service can fix it. how to find out the problem?
>
> solr:4.6
> jetty:8
>
> thanks so much.


solr always loading and not any response

2014-07-24 Thread zhijun liu
hi, all, solr admin page is always "loading", and when I send query request
also can not get any response. the tcp link is  always "ESTABLISHED"。only
restart solr service can fix it. how to find out the problem?

solr:4.6
jetty:8

thanks so much.