from:"Lance"

Nested documents (parent,child,grandchild), multi-select facets

2021-01-26 Thread Lance Snell

oupingFilter2}"
,
"topQuery_product":""

},
"fields":
"*, [child fl=$returnFields limit=-1
childFilter='/productItems/{!filters v=$child_FQ2}']"
,
"filter": [

// "${prdoc_Q2} {!parent which='*:* -_nest_path_:/*' v=$prdoc_Q2}
{!parent which='*:* -_nest_path_:/productItems/*' v=$prdoc_Q2}
categoryPrefixes:(ASB00434166111481 ASB0043416611148All)"

"_query_:(${prdoc_Q2} ${pcdoc_Q2} docType:pcdoc docType:prdoc
docType:pidoc docType:pdoc)"
,
"{!filters tag=MID param=$test v=$baseQ}"
],
"sort":
"{!parent which='*:* -_nest_path_:*' score=max v='+docType:prdoc
+{!func}fprice'} asc"
,
"offset": 0,
"limit": "${LIMIT}",
"facet": {
"testing": {
"type": "terms",
    "field": "docType",
"limit": -1,
"facet": {
"parentCount": "unique(_root_)",
},
"domain":{
"excludeTags":[
"TOP",
//"MID",
// "LOW"
]
}
},
"testing2": {
"type": "terms",
"field": "docType",
"limit": -1,
"facet": {
"parentCount": "unique(_root_)",
},
"domain":{
"excludeTags":[
"TOP",
"MID",
// "LOW"
]
}
}
}
}


-- 
Thanks,

Lance

Re: Multi-select faceting for nested documents

2021-01-26 Thread Lance Snell

-1
childFilter='/productItems/{!filters v=$child_FQ2}']"
,
"filter": [

// "${prdoc_Q2} {!parent which='*:* -_nest_path_:/*' v=$prdoc_Q2}
{!parent which='*:* -_nest_path_:/productItems/*' v=$prdoc_Q2}
categoryPrefixes:(ASB00434166111481 ASB0043416611148All)"

"_query_:(${prdoc_Q2} ${pcdoc_Q2} docType:pcdoc docType:prdoc
docType:pidoc docType:pdoc)"
,
"{!filters tag=MID param=$test v=$baseQ}"
],
"sort":
"{!parent which='*:* -_nest_path_:*' score=max v='+docType:prdoc
+{!func}fprice'} asc"
,
"offset": 0,
"limit": "${LIMIT}",
"facet": {
"testing": {
"type": "terms",
"field": "docType",
"limit": -1,
"facet": {
"parentCount": "unique(_root_)",
},
"domain":{
"excludeTags":[
"TOP",
//"MID",
// "LOW"
]
}
},
"testing2": {
"type": "terms",
"field": "docType",
"limit": -1,
"facet": {
"parentCount": "unique(_root_)",
},
"domain":{
"excludeTags":[
    "TOP",
"MID",
// "LOW"
]
}
}
}
}


On Mon, Jan 25, 2021 at 9:41 AM Alexandre Rafalovitch 
wrote:

> I don't have an answer, but I feel that maybe explaining the situation
> in more details would help a bit more. Specifically, you explain your
> data structure well, but not your actual presentation requirement in
> enough details.
>
> How would you like the multi-select to work, how it is working for you
> now and what is the gap?
>
> Regards,
>Alex.
> P.s. Sometimes, you have to really modify the way the information is
> stored in Solr for the efficient and effective search results. Solr is
> not the database, so it needs to model search requirements, rather
> than original data shape.
>
> On Mon, 25 Jan 2021 at 10:34, Lance Snell 
> wrote:
> >
> > Any examples would be greatly appreciated.
> >
> > On Mon, Jan 25, 2021, 2:25 AM Lance Snell 
> wrote:
> >
> > > Hey all,
> > >
> > > I am having trouble finding current examples of multi-select faceting
> for
> > > nested documents.  Specifically ones with *multiple *levels of nested
> > > documents.
> > >
> > > My current schema has a parent document, two child documents(siblings),
> > > and a grandchild document.  I am using the JSON API.
> > >
> > > Product -> Sku -> Price
> > >|
> > >\/
> > > StoreCategory
> > >
> > > Any help/direction would be appreciated.
> > >
> > >
> > > Solr. 8.6
> > >
> > > --
> > > Thanks,
> > >
> > > Lance
> > >
>


-- 
Thanks,

Lance Snell
(507) 829-7389

Re: Multi-select faceting for nested documents

2021-01-25 Thread Lance Snell

Any examples would be greatly appreciated.

On Mon, Jan 25, 2021, 2:25 AM Lance Snell  wrote:

> Hey all,
>
> I am having trouble finding current examples of multi-select faceting for
> nested documents.  Specifically ones with *multiple *levels of nested
> documents.
>
> My current schema has a parent document, two child documents(siblings),
> and a grandchild document.  I am using the JSON API.
>
> Product -> Sku -> Price
>|
>\/
> StoreCategory
>
> Any help/direction would be appreciated.
>
>
> Solr. 8.6
>
> --
> Thanks,
>
> Lance
>

Multi-select faceting for nested documents

2021-01-25 Thread Lance Snell

Hey all,

I am having trouble finding current examples of multi-select faceting for
nested documents.  Specifically ones with *multiple *levels of nested
documents.

My current schema has a parent document, two child documents(siblings), and
a grandchild document.  I am using the JSON API.

Product -> Sku -> Price
   |
   \/
StoreCategory

Any help/direction would be appreciated.


Solr. 8.6

-- 
Thanks,

Lance

Re: need help on OpenNLP with Solr

2014-01-09 Thread Lance Norskog

There is no way to do these things with LUCENE-2899.


On Mon, Jan 6, 2014 at 8:07 AM, rashi gandhi gandhirash...@gmail.comwrote:

 Hi,



 I have applied OpenNLP (LUCENE 2899.patch) patch to SOLR-4.5.1 for nlp
 searching and it is working fine.

 Also I have designed an analyzer for this:

 fieldType name=nlp_type class=solr.TextField
 positionIncrementGap=100

   analyzer type=index

 tokenizer class=solr.OpenNLPTokenizerFactory
 sentenceModel=opennlp/en-test-sent.bin
tokenizerModel=opennlp/en-test-tokenizer.bin/

 filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt enablePositionIncrements=true/

 filter class=solr.OpenNLPFilterFactory
 posTaggerModel=opennlp/en-pos-maxent.bin/

 filter class=solr.OpenNLPFilterFactory
 nerTaggerModels=opennlp/en-ner-person.bin/

 filter class=solr.OpenNLPFilterFactory
 nerTaggerModels=opennlp/en-ner-location.bin/

 filter
 class=solr.LowerCaseFilterFactory/

 filter
 class=solr.SnowballPorterFilterFactory/

/analyzer

analyzer type=query

 tokenizer class=solr.OpenNLPTokenizerFactory
 sentenceModel=opennlp/en-test-sent.bin tokenizerModel
 =opennlp/en-test-tokenizer.bin/

 filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt enablePositionIncrements=true/

 filter class=solr.OpenNLPFilterFactory
 posTaggerModel=opennlp/en-pos-maxent.bin/

 filter class=solr.OpenNLPFilterFactory
 nerTaggerModels=opennlp/en-ner-person.bin/

 filter class=solr.OpenNLPFilterFactory
 nerTaggerModels=opennlp/en-ner-location.bin/

 filter
 class=solr.LowerCaseFilterFactory/

 filter
 class=solr.SnowballPorterFilterFactory/

/analyzer

 /fieldType


 I am able to find that posTaggerModel is performing tagging in the phrases
 and add the payloads. ( but iam not able to analyze it)

 My Question is:
 Can i search a phrase giving high boost to NOUN then VERB ?
 For example: if iam searching sitting on blanket , so i want to give high
 boost to NOUN term first then VERB, that are tagged by OpenNLP.
 How can i use payloads for boosting?
 What are the changes required in schema.xml?

 Please provide me some pointers to move ahead

 Thanks in advance




-- 
Lance Norskog
goks...@gmail.com

Re: SolrCloud unstable

2013-11-24 Thread Lance Norskog

Yes, you should use a recent Java 7. Java 6 is end-of-life and no longer
supported by Oracle. Also, read up on the various garbage collectors. It
is a complex topic and there are many guides online.

In particular there is a problem in some Java 6 releases that causes a
massive memory leak in Solr. The symptom is that memory use oscillates
(normally) from, say 1GB to 2GB. After the bug triggers, the ceiling of
2GB becomes the floor, and memory use oscillates from 2GB to 3GB. I'm
not saying this is the problem you have. I'm just saying that is
important to read up on garbage collection.

Lance

On 11/22/2013 05:27 AM, Martin de Vries wrote:

We did some more monitoring and have some new information:

Before
the issue happens the garbage collector's collection count increases a
lot. The increase seems to start about an hour before the real problem
occurs:

http://www.analyticsforapplications.com/GC.png [1]

We tried
both the g1 garbage collector and the regular one, the problem happens
with both of them.

We use Java 1.6 on some servers. Will Java 1.7 be
better?

Martin

Martin de Vries schreef op 12.11.2013 10:45:

Hi,

We have:

Solr 4.5.1 - 5 servers
36 cores, 2 shards each,

2 servers per shard (every core is on 4

servers)
about 4.5 GB total

data on disk per server

4GB JVM-Memory per server, 3GB average in

use

Zookeeper 3.3.5 - 3 servers (one shared with Solr)
haproxy load

balancing

Our Solrcloud is very unstable. About one time a week

some cores go in

recovery state or down state. Many timeouts occur

and we have to restart

servers to get them back to work. The failover

doesn't work in many

cases, because one server has the core in down

state, the other in

recovering state. Other cores work fine. When the

cloud is stable I

sometimes see log messages like:
- shard update

error StdNode:
http://033.downnotifier.com:8983/solr/dntest_shard2_replica1/:org.apache.solr.client.solrj.SolrServerException:

IOException occured when talking to server at:

http://033.downnotifier.com:8983/solr/dntest_shard2_replica1

forwarding update to
http://033.downnotifier.com:8983/solr/dn_shard2_replica2/ failed -
retrying ...

- null:ClientAbortException: java.io.IOException: Broken

pipe

Before the the cloud problems start there are many large

Qtime's in the

log (sometimes over 50 seconds), but there are no

other errors until the

recovery problems start.

Any clue about

what can be wrong?

Kinds regards,

Martin

Links:
--
[1]
http://www.analyticsforapplications.com/GC.png

Re: SOLR: Searching on OpenNLP fields is unstable

2013-10-20 Thread Lance Norskog

Hi-

Unit tests to the rescue! The current unit test system in the 4.x branch
catches code sequence problems.

  [junit4] Throwable #1: java.lang.IllegalStateException:
TokenStream contract violation: reset()/close() call missing, reset()
called multiple times, or subclass does not call super.reset().
 Please see Javadocs of TokenStream class for more information about the
correct consuming workflow.

I'll try to get this right. But both OpenNLP and LUCENE-2899 have
deployment problems:
1) OpenNLP does not have a good source of statistical training data for the
models. For example, the NER models are trained from late 1980's newspaper
articles, so the organization finder is kind of... obsolete. That kind of
problem. I think the currency recognizer is trained on text from before the
Euro was introduced (not sure about this).
2) Solr has a basic packaging problem when the Lucene code uses external
libraries.

As to adding it to the main Solr project, I think the Marketplace Of Ideas
has spoken with deafening silence :)


On Wed, Sep 25, 2013 at 9:26 AM, rashi gandhi gandhirash...@gmail.comwrote:

 HI,



 I am working on OpenNLP integration with SOLR. I have successfully applied
 the patch (LUCENE-2899-x.patch) to latest SOLR source code (branch_4x).

 I have designed OpenNLP analyzer and index data to it. Analyzer
 declaration in schema.xml is as



   fieldType name=nlp_type class=solr.TextField
 positionIncrementGap=100

 analyzer type=index

 !-- Sequence of tokenizers and filters
 applied at the index time--

 tokenizer
 class=solr.StandardTokenizerFactory/

 filter
 class=solr.LowerCaseFilterFactory/

 filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt enablePositionIncrements=true/

 filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true/

 filter
 class=solr.SnowballPorterFilterFactory/

 filter
 class=solr.ASCIIFoldingFilterFactory/

 /analyzer

 analyzer type=query

 !-- Sequence of tokenizers and filters
 applied at the index time--

 tokenizer
 class=solr.StandardTokenizerFactory/

 filter class=solr.OpenNLPFilterFactory
 posTaggerModel=opennlp/en-pos-maxent.bin/

 filter class=solr.OpenNLPFilterFactory
 nerTaggerModels=opennlp/en-ner-person.bin/

  filter class=solr.OpenNLPFilterFactory
 nerTaggerModels=opennlp/en-ner-location.bin/

 filter
 class=solr.LowerCaseFilterFactory/

 filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt enablePositionIncrements=true/

  /analyzer

 /fieldType



 And field declared for this analyzer:

 field name=Detail_Person type=nlp_type indexed=true stored=true
 omitNorms=true omitPositions=true/



 Problem is here : When I search over this field Detail_Person, results are
 not constant.



 When I search Detail_Person:brett, it return one document





 But again when I fire the same query, it return zero document.



 Searching is not stable on OpenNLP field, sometimes it return documents
 and sometimes not but documents are there.

 And if I search on non OpenNLP fields, it is working properly, results are
 stable and correct.

 Please help me to make solr results consistent.

 Thanks in Advance.





-- 
Lance Norskog
goks...@gmail.com

Re: DIH - stream file with solrEntityProcessor

2013-10-14 Thread Lance Norskog


On 10/13/2013 10:02 AM, Shawn Heisey wrote:

On 10/13/2013 10:16 AM, Josh Lincoln wrote:

I have a large solr response in xml format and would like to import it into
a new solr collection. I'm able to use DIH with solrEntityProcessor, but
only if I first truncate the file to a small subset of the records. I was
hoping to set stream=true to handle the full file, but I still get an out
of memory error, so I believe stream does not work with solrEntityProcessor
(I know the docs only mention the stream option for the
XPathEntityProcessor, but I was hoping solrEntityProcessor just might have
the same capability).

Before I open a jira to request stream support for solrEntityProcessor in
DIH, is there an alternate approach for importing large files that are in
the solr results format?
Maybe a way to use xpath to get the values and a transformer to set the
field names? I'm hoping to not have to declare the field names in
dataConfig so I can reuse the process across data sets.

How big is the XML file?  You might be running into a size limit for
HTTP POST.

In newer 4.x versions, Solr itself sets the size of the POST buffer
regardless of what the container config has.  That size defaults to 2MB
but is configurable using the formdataUploadLimitInKB setting that you
can find in the example solrconfig.xml file, on the requestParsers tag.

In Solr 3.x, if you used the included jetty, it had a configured HTTP
POST size limit of 1MB.  In early Solr 4.x, there was a bug in the
included Jetty that prevented the configuration element from working, so
the actual limit was Jetty's default of 200KB.  With other containers
and these older versions, you would need to change your container
configuration.

https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130

Thanks,
Shawn

The SEP calls out to another Solr and reads. Are you importing data from 
another Solr and cross-connecting it with your uploaded XML?


If the memory errors are a problem with streaming, you could try 
piping your uploaded documents through a processor that supports 
streaming. This would then push one document at a time into your 
processor that calls out to Solr and combines records.

Re: DIH - stream file with solrEntityProcessor

2013-10-14 Thread Lance Norskog


Can you do this data in CSV format? There is a CSV reader in the DIH.
The SEP was not intended to read from files, since there are already 
better tools that do that.


Lance

On 10/14/2013 04:44 PM, Josh Lincoln wrote:

Shawn, I'm able to read in a 4mb file using SEP, so I think that rules out
the POST buffer being the issue. Thanks for suggesting I test this. The
full file is over a gig.

Lance, I'm actually pointing SEP at a static file (I simply named the file
select and put it on a Web server). SEP thinks it's a large solr
response, which it was, though now it's just static xml. Works well until I
hit the memory limit of the new solr instance.

I can't query the old solr from the new one b/c they're on two different
networks. I can't copy the index files b/c I only want a subset of the data
(identified with a query and dumped to xml...all fields of interest were
stored). To further complicate things, the old solr is 1.4. I was hoping to
use the result xml format to backup the old, and DIH SEP to import to the
new dev solr4.x. It's promising as a simple and repeatable migration
process, except that SEP fails on largish files.

It seems my options are 1) use the xpathprocessor and identify each field
(there are many fields); 2) write a small script to act as a proxy to the
xml file and accept the row and start parameters from the SEP iterative
calls and return just a subset of the docs; 3) a script to process the xml
and push to solr, not using DIH; 4) consider XSLT to transform the result
xml to an update message and use XPathEntityProcessor
with useSolrAddSchema=true and streaming. The latter seems like the most
elegant and reusable approach, though I'm not certain it'll work.

It'd be great if solrEntityProcessor could stream static files, or if I
could specify the solr result format while using the xpathentityprocessor
(i.e. a useSolrResultSchema option)

Any other ideas?






On Mon, Oct 14, 2013 at 6:24 PM, Lance Norskog goks...@gmail.com wrote:


On 10/13/2013 10:02 AM, Shawn Heisey wrote:


On 10/13/2013 10:16 AM, Josh Lincoln wrote:


I have a large solr response in xml format and would like to import it
into
a new solr collection. I'm able to use DIH with solrEntityProcessor, but
only if I first truncate the file to a small subset of the records. I was
hoping to set stream=true to handle the full file, but I still get an
out
of memory error, so I believe stream does not work with
solrEntityProcessor
(I know the docs only mention the stream option for the
XPathEntityProcessor, but I was hoping solrEntityProcessor just might
have
the same capability).

Before I open a jira to request stream support for solrEntityProcessor in
DIH, is there an alternate approach for importing large files that are in
the solr results format?
Maybe a way to use xpath to get the values and a transformer to set the
field names? I'm hoping to not have to declare the field names in
dataConfig so I can reuse the process across data sets.


How big is the XML file?  You might be running into a size limit for
HTTP POST.

In newer 4.x versions, Solr itself sets the size of the POST buffer
regardless of what the container config has.  That size defaults to 2MB
but is configurable using the formdataUploadLimitInKB setting that you
can find in the example solrconfig.xml file, on the requestParsers tag.

In Solr 3.x, if you used the included jetty, it had a configured HTTP
POST size limit of 1MB.  In early Solr 4.x, there was a bug in the
included Jetty that prevented the configuration element from working, so
the actual limit was Jetty's default of 200KB.  With other containers
and these older versions, you would need to change your container
configuration.

https://bugs.eclipse.org/bugs/**show_bug.cgi?id=397130https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130

Thanks,
Shawn

  The SEP calls out to another Solr and reads. Are you importing data from

another Solr and cross-connecting it with your uploaded XML?

If the memory errors are a problem with streaming, you could try piping
your uploaded documents through a processor that supports streaming. This
would then push one document at a time into your processor that calls out
to Solr and combines records.

Re: Solr4.4 or zookeeper 3.4.5 do not support too many collections? more than 600?

2013-09-10 Thread Lance Norskog

Yes, Solr/Lucene works fine with other indexes this large. There are 
many indexes with hundreds of gigabytes and hundreds of millions of 
documents. My experience years ago was that at this scale, searching 
worked great, sorting  facets less so, and the real problem was IT: a 
200G blob of data is a pain in the neck to administer.


As always, every index is different, but you should not have problems 
doing the merge that you describe.


Lance

On 09/08/2013 09:01 PM, diyun2008 wrote:

Thank you Erick. It's very useful to me. I have already started to merge logs
of collections to 15 collections. but there's another question. If I merge
1000 collections to 1 collection, to the new collection it will have about
20G data and about 30M records. In 1 solr server, I will create 15 such big
collections. So I don't know if solr can support such big data in 1
collection(20G data with 30M records) or in 1 solr server(15*20G data with
15*30M records)? Or do I need buy new servers to install solr and do shrding
to support that?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr4-4-or-zookeeper-3-4-5-do-not-support-too-many-collections-more-than-600-tp4088689p4088802.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR Prevent solr of modifying fields when update doc

2013-08-23 Thread Lance Norskog

Solr does not by default generate unique IDs. It uses what you give as 
your unique field, usually called 'id'.


What software do you use to index data from your RSS feeds? Maybe that 
is creating a new 'id' field?


There is no partial update, Solr (Lucene) always rewrites the complete 
document.


On 08/23/2013 09:03 AM, Greg Preston wrote:

Perhaps an atomic update that only changes the fields you want to change?

-Greg


On Fri, Aug 23, 2013 at 4:16 AM, Luís Portela Afonso
meligalet...@gmail.com wrote:

Hi thanks by the answer, but the uniqueId is generated by me. But when solr 
indexes and there is an update in a doc, it deletes the doc and creates a new 
one, so it generates a new UUID.
It is not suitable for me, because i want that solr just updates some fields, 
because the UUID is the key that i use to map it to an user in my database.

Right now i'm using information that comes from the source and never chages, as 
my uniqueId, like for example the guid, that exists in some rss feeds, or if it 
doesn't exists i use link.

I think there is any simple solution for me, because for what i have read, when 
an update to a doc exists, SOLR deletes the old one and create a new one, right?

On Aug 23, 2013, at 12:07 PM, Erick Erickson erickerick...@gmail.com wrote:


Well, not much in the way of help because you can't do what you
want AFAIK. I don't think UUID is suitable for your use-case. Why not
use your uniqueId?

Or generate something yourself...

Best
Erick


On Thu, Aug 22, 2013 at 5:56 PM, Luís Portela Afonso meligalet...@gmail.com

wrote:
Hi,

How can i prevent solr from update some fields when updating a doc?
The problem is, i have an uuid with the field name uuid, but it is not an
unique key. When a rss source updates a feed, solr will update the doc with
the same link but it generates a new uuid. This is not the desired because
this id is used by me to relate feeds with an user.

Can someone help me?

Many Thanks

Re: How to SOLR file in svn repository

2013-08-22 Thread Lance Norskog


You need to:
1) crawl the SVN database
2) index the files
3) make a UI that fetches the original file when you click on a search 
results.


Solr only has #2. If you run a subversion web browser app, you can 
download the developer-only version of the LucidWorks product and crawl 
the SVN web viewer. This will give you #1 and #3.


Lance

On 08/21/2013 09:00 AM, jiunarayan wrote:

I have a svn respository and svn file path. How can I SOLR search content on
the svn file.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-SOLR-file-in-svn-repository-tp4085904.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Document Similarity Algorithm at Solr/Lucene

2013-08-07 Thread Lance Norskog


Block-quoting and plagiarism are two different questions.

Block-quoting is simple: break the text apart into sentences or even 
paragraphs and make them separate documents. Make facets of the 
post-analysis text. Now just pull counts of facets and block quotes will 
be clear.


Mahout has a scalable implementation of n-gram based document 
similarity. It calculates distances between all documents and identifies 
clusters of similar documents. This is a much more general technique and 
may help you find obfuscated plagiarism.


Lance

On 07/23/2013 02:33 AM, Furkan KAMACI wrote:

Hi;

Sometimes a huge part of a document may exist in another document. As like
in student plagiarism or quotation of a blog post at another blog post.
Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to
detect it?

Re: Percolate feature?

2013-08-05 Thread Lance Norskog


Cool!

On 08/05/2013 03:34 AM, Charlie Hull wrote:

On 03/08/2013 00:50, Mark wrote:

We have a set number of known terms we want to match against.

In Index:
term one
term two
term three

I know how to match all terms of a user query against the index but 
we would like to know how/if we can match a user's query against all 
the terms in the index?


Search Queries:
my search term = 0 matches
my term search one = 1 match  (term one)
some prefix term two = 1 match (term two)
one two three = 0 matches

I can only explain this is almost a reverse search???

I came across the following from ElasticSearch 
(http://www.elasticsearch.org/guide/reference/api/percolate/) and it 
sounds like this may accomplish the above but haven't tested. I was 
wondering if Solr had something similar or an alternative way of 
accomplishing this?


Thanks



Hi Mark,

We've built something that implements this kind of reverse search for 
our clients in the media monitoring sector - we're working on 
releasing the core of this as open source very soon, hopefully in a 
month or two. It's based on Lucene.


Just for reference it's able to apply tens of thousands of stored 
queries to a document per second (our clients often have very large 
and complex Boolean strings representing their clients' interests and 
may monitor hundreds of thousands of news stories every day). It also 
records the positions of every match. We suspect it's a lot faster and 
more flexible than Elasticsearch's Percolate feature.


Cheers

Charlie

Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-22 Thread Lance Norskog


Are you feeding Graphite from Solr? If so, how?

On 07/19/2013 01:02 AM, Neil Prosser wrote:

That was overnight so I was unable to track exactly what happened (I'm
going off our Graphite graphs here).

Re: adding date column to the index

2013-07-22 Thread Lance Norskog

Solr/Lucene does not automatically add when asked, the way DBMS systems 
do. Instead, all data for a field is added at the same time. To get the 
new field, you have to reload all of your data.


This is also true for deleting fields. If you remove a field, that data 
does not go away until you re-index.


On 07/22/2013 07:31 AM, Mysurf Mail wrote:

I have added a date field to my index.
I dont want the query to search on this field, but I want it to be returned
with each row.
So I have defined it in the scema.xml as follows:
   field name=LastModificationTime type=date indexed=false
stored=true required=true/



I added it to the select in data-config.xml and I see it selected in the
profiler.
now, when I query all fileds (using the dashboard) I dont see it.
Even when I ask for it specifically I dont see it.
What am I doing wrong?

(In the db it is (datetimeoffset(7)))

Re: JVM Crashed - SOLR deployed in Tomcat

2013-07-16 Thread Lance Norskog

I don't know about jvm crashes, but it is known that the Java 6 jvm had 
various problems supporting Solr, including the 20-30 series. A lot of 
people use the final jvm release (I think 6_30).


On 07/16/2013 12:25 PM, neoman wrote:

Hello Everyone,
We are using solrcloud with Tomcat in our production environment.
Here is our configuration.
solr-4.0.0
JVM 1.6.0_25

The JVM keeps crashing everyday with the following error. I think it is
happening while we try index the data with solrj APIs.

INFO: [aq-core] webapp=/solr path=/update
params={distrib.from=http://solr03-prod:8080/solr/aq-core/update.distrib=TOLEADERwt=javabinversion=2}
status=0 QTime=1
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0xfd7ffadac771, pid=2411, tid=33662
#
# JRE version: 6.0_25-b06
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.0-b11 mixed mode
solaris-amd64 compressed oops)
# Problematic frame:
# J
org.apache.lucene.codecs.PostingsConsumer.merge(Lorg/apache/lucene/index/MergeState;Lorg/apache/lucene/index/DocsEnum;Lorg/apache/lucene/util/FixedBitSet;)Lorg/apache/lucene/codecs/TermStats;
#
# An error report file with more information is saved as:
# /opt/tomcat/hs_err_pid2411.log
Jul 16, 2013 6:27:07 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start
commit{flags=0,_version_=0,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false}
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp



Instructions: (pc=0xfd7ffadac771)
0xfd7ffadac751:   89 4c 24 30 4c 89 44 24 28 4c 89 54 24 18 44 89
0xfd7ffadac761:   5c 24 20 4c 8b 57 10 4d 63 d9 49 8b ca 49 03 cb
0xfd7ffadac771:   44 0f be 01 45 8b d9 41 ff c3 44 89 5f 18 45 85
0xfd7ffadac781:   c0 0f 8c b0 05 00 00 45 8b d0 45 8b da 41 d1 eb

Register to memory mapping:

RAX=0x14008cf2 is an unknown value
RBX=
[error occurred during error reporting (printing register info), id 0xb]

Stack: [0xfd7de4eff000,0xfd7de4fff000],  sp=0xfd7de4ffe140,
free space=1020k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
code)
J
org.apache.lucene.codecs.PostingsConsumer.merge(Lorg/apache/lucene/index/MergeState;Lorg/apache/lucene/index/DocsEnum;Lorg/apache/lucene/util/FixedBitSet;)Lorg/apache/lucene/codecs/TermStats;

Please let me know if anyone has seen this before. Any input is appreciated.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/JVM-Crashed-SOLR-deployed-in-Tomcat-tp4078439.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Norms

2013-07-12 Thread Lance Norskog

Norms stay in the index even if you delete all of the data. If you just 
changed the schema, emptied the index, and tested again, you've still 
got norms in there.


You can examine the index with Luke to verify this.

On 07/09/2013 08:57 PM, William Bell wrote:

I have a field that has omitNorms=true, but when I look at debugQuery I see
that
the field is being normalized for the score.

What can I do to turn off normalization in the score?

I want a simple way to do 2 things:

boost geodist() highest at 1 mile and lowest at 100 miles.
plus add a boost for a query=edgefield^5.

I only want tf() and no queryNorm. I am not even sure I want idf() but I
can probably live with rare names being boosted.



The results are being normalized. See below. I tried dismax and edismax -
bf, bq and boost.

requestHandler name=autoproviderdist class=solr.SearchHandler
lst name=defaults
str name=echoParamsnone/str
str name=defTypeedismax/str
float name=tie0.01/float
str name=fl
display_name,city_state,prov_url,pwid,city_state_alternative
/str
!--
str name=bq_val_:sum(recip(geodist(store_geohash), .5, 6, 6),
0.1)^10/str
--
str name=boostsum(recip(geodist(store_geohash), .5, 6, 6), 0.1)/str
int name=rows5/int
str name=q.alt*:*/str
str name=qfname_edgy^.9 name_edge^.9 name_word/str
str name=grouptrue/str
str name=group.fieldpwid/str
str name=group.maintrue/str
!-- str name=pfname_edgy/str do not turn on --
str name=sortscore desc, last_name asc/str
str name=d100/str
str name=pt39.740112,-104.984856/str
str name=sfieldstore_geohash/str
str name=hlfalse/str
str name=hl.flname_edgy/str
str name=mm2-1 4-2 6-3/str
/lst
/requestHandler

0.058555886 = queryNorm

product of: 10.854807 = (MATCH) sum of: 1.8391232 = (MATCH) max plus 0.01
times others of: 1.8214592 = (MATCH) weight(name_edge:paul^0.9 in 231378),
product of: 0.30982485 = queryWeight(name_edge:paul^0.9), product of: 0.9 =
boost 5.8789964 = idf(docFreq=26567, maxDocs=3493655)* 0.058555886 =
queryNorm* 5.8789964 = (MATCH) fieldWeight(name_edge:paul in 231378),
product of: 1.0 = tf(termFreq(name_edge:paul)=1) 5.8789964 =
idf(docFreq=26567, maxDocs=3493655) 1.0 = fieldNorm(field=name_edge,
doc=231378) 1.7664119 = (MATCH) weight(name_edgy:paul^0.9 in 231378),
product of: 0.30510724 = queryWeight(name_edgy:paul^0.9), product of: 0.9 =
boost 5.789479 = idf(docFreq=29055, maxDocs=3493655)* 0.058555886 =
queryNorm* 5.789479 = (MATCH) fieldWeight(name_edgy:paul in 231378),
product of: 1.0 = tf(termFreq(name_edgy:paul)=1) 5.789479 =
idf(docFreq=29055, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy,
doc=231378) 9.015684 = (MATCH) max plus 0.01 times others of: 8.9352665 =
(MATCH) weight(name_word:nutting in 231378), product of: 0.72333425 =
queryWeight(name_word:nutting), product of: 12.352887 = idf(docFreq=40,
maxDocs=3493655) 0.058555886 = queryNorm 12.352887 = (MATCH)
fieldWeight(name_word:nutting in 231378), product of: 1.0 =
tf(termFreq(name_word:nutting)=1) 12.352887 = idf(docFreq=40,
maxDocs=3493655) 1.0 = fieldNorm(field=name_word, doc=231378) 8.04174 =
(MATCH) weight(name_edgy:nutting^0.9 in 231378), product of: 0.65100086 =
queryWeight(name_edgy:nutting^0.9), product of: 0.9 = boost 12.352887 =
idf(docFreq=40, maxDocs=3493655)* 0.058555886 = queryNorm* 12.352887 =
(MATCH) fieldWeight(name_edgy:nutting in 231378), product of: 1.0 =
tf(termFreq(name_edgy:nutting)=1) 12.352887 = idf(docFreq=40,
maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 1.0855998 =
sum(6.0/(0.5*float(geodist(39.74168747663498,-104.9849385023117,39.740112,-104.984856))+6.0),const(0.1))

Re: Solr limitations

2013-07-10 Thread Lance Norskog


Also, total index file size. At 200-300gb managing an index becomes a pain.

Lance

On 07/08/2013 07:28 AM, Jack Krupansky wrote:
Other that the per-node/per-collection limit of 2 billion documents 
per Lucene index, most of the limits of Solr are performance-based 
limits - Solr can handle it, but the performance may not be 
acceptable. Dynamic fields are a great example. Nothing prevents you 
from creating a document with, say, 50,000 dynamic fields, but you are 
likely to find the performance less than acceptable. Or facets. Sure, 
Solr will let you have 5,000 faceted fields, but the performance is 
likely to be... you get the picture.


What is acceptable performance? That's for you to decide.

What will the performance of 5,000 dynamic fields or 500 faceted 
fields or 500 million documents on a node be? It all depends on your 
data, especially the cardinality (unique values) of each individual 
field.


How can you determine the performance? Only one way: Proof of concept. 
You need to do your own proof of concept implementation, with your own 
representative data, with your own representative data model, with 
your own representative hardware, with your own representative client 
software, with your own representative user query load. That testing 
will give you all the answers you need.


There are are no magic answers. Don't believe any magic spreadsheet or 
magic wizard. Flip a coin whether they will work for your situation.


Some simple, common sense limits:

1. No more than 50 to 100 million documents per node.
2. No more than 250 fields per document.
3. No more than 250K characters per document.
4. No more than 25 faceted fields.
5. No more than 32 nodes in your SolrCloud cluster.
6. Don't return more than 250 results on a query.

None of those is a hard limit, but don't go beyond them unless your 
Proof of Concept testing proves that performance is acceptable for 
your situation.


Start with a simple 4-node, 2-shard, 2-replica cluster for preliminary 
tests and then scale as needed.


Dynamic and multivalued fields? Try to stay away from them - excepts 
for the simplest cases, they are usually an indicator of a weak data 
model. Sure, it's fine to store a relatively small number of values in 
a multivalued field (say, dozens of values), but be aware that you 
can't directly access individual values, you can't tell which was 
matched on a query, and you can't coordinate values between multiple 
multivalued fields. Except for very simple cases, multivalued fields 
should be flattened into multiple documents with a parent ID.


Since you brought up the topic of dynamic fields, I am curious how you 
got the impression that they were a good technique to use as a 
starting point. They're fine for prototyping and hacking, and fine 
when used in moderation, but not when used to excess. The whole point 
of Solr is searching and searching is optimized within fields, not 
across fields, so having lots of dynamic fields is counter to the 
primary strengths of Lucene and Solr. And... schemas with lots  of 
dynamic fields tend to be difficult to maintain. For example, if you 
wanted to ask a support question here, one of the first things we want 
to know is what your schema looks like, but with lots of dynamic 
fields it is not possible to have a simple discussion of what your 
schema looks like.


Sure, there is something called schemaless design (and Solr supports 
that in 4.4), but that's very different from heavy reliance on dynamic 
fields in the traditional sense. Schemaless design is A-OK, but using 
dynamic fields for arrays of data in a single document is a poor 
match for the search features of Solr (e.g., Edismax searching across 
multiple fields.)


One other tidbit: Although Solr does not enforce naming conventions 
for field names, and you can put special characters in them, there are 
plenty of features in Solr, such as the common fl parameter, where 
field names are expected to adhere to Java naming rules. When people 
start going wild with dynamic fields, it is common that they start 
going wild with their names as well, using spaces, colons, slashes, 
etc. that cannot be parsed in the fl and qf parameters, for 
example. Please don't go there!


In short, put up a small cluster and start doing a Proof of Concept 
cluster. Stay within my suggested guidelines and you should do okay.


-- Jack Krupansky

-Original Message- From: Marcelo Elias Del Valle
Sent: Monday, July 08, 2013 9:46 AM
To: solr-user@lucene.apache.org
Subject: Solr limitations

Hello everyone,

   I am trying to search information about possible solr limitations I
should consider in my architecture. Things like max number of dynamic
fields, max number o documents in SolrCloud, etc.
   Does anyone know where I can find this info?

Best regards,

Re: Distributed search results in SocketException: Connection reset

2013-06-30 Thread Lance Norskog


This usually means the end server timed out.

On 06/30/2013 06:31 AM, Shahar Davidson wrote:

Hi all,

We're getting the below exception sporadically when using distributed search. 
(using Solr 4.2.1)
Note that 'core_3' is one of the cores mentioned in the 'shards' parameter.

Any ideas anyone?

Thanks,

Shahar.


Jun 03, 2013 5:27:38 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: 
org.apache.solr.client.solrj.SolrServerException: IOException occured when 
talking to server at: http://127.0.0.1:8210/solr/core_3
 at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:300)
 at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1830)
 at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
 at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
 at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
 at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
 at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:365)
 at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
 at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at 
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
 at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
 at 
org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
 at 
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
 at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.solr.client.solrj.SolrServerException: IOException 
occured when talking to server at: http://127.0.0.1:8210/solr/core_3
 at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
 at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
 at 
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
 at 
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown 
Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.Executors$RunnableAdapter.call(Unknown 
Source)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown 
Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
 ... 1 more
Caused by: java.net.SocketException: Connection reset
 at java.net.SocketInputStream.read(Unknown Source)
 at java.net.SocketInputStream.read(Unknown Source)
 at

Re: getting different search results for words with same meaning in Japanese language

2013-06-30 Thread Lance Norskog

The MappingCharFilter allows you to map both characters to one
characters. If you do this during indexing and querying, searching with
one should find the other. This is sort of like synonyms, but on a
character-by-character basis.

Lance

On 06/18/2013 11:08 PM, Yash Sharma wrote:
 Hi,

 we have two japanese words with the same meaning ソフトウェア and ソフトウエア (notice
 the difference in capital I looking character - word meaning is 'software'
 in the english language). When ソフトウェア is searched, it gives around 8 search
 results but when ソフトウエア is searched, it gives only 2 search results.

 The japanese translator told that this is something called yugari (which
 means that the above words can be seen as authorise and authorize, so they
 should yield same search results as they have same meaning but spelled
 differently).

 we have one solution to this issue - to use synonyms.txt and place all
 these similar words in this text file. This solved our problem to some
 extent but, in real time scenario, we do not have all the japanese
 technical words like software, product, technology, and so on and we cannot
 keep updating synonyms.txt on a daily basis.

 Is there any better solution, so that all the similar japanese words give
 same search results ?
 Any help is greatly appreciated.

Re: Http status 503 Error in solr cloud setup

2013-06-29 Thread Lance Norskog

I do not know what causes the error. This setup will not work. You need 
one or three zookeepers. SolrCloud demands that a majority of the ZK 
servers agree. If you have two ZKs this will not work.


On 06/29/2013 05:47 AM, Sagar Chaturvedi wrote:


Hi,

I setup 2 solr instances on 2 different machines and configured 2 
zookeeper servers on these machines also. When I start solr on both 
machines and try to access the solr web-admin then I get following 
error on browser --


Http status 503 -- server is shutting down

When I setup a single standalone solr without zookeeper, I do not get 
this error.


Any insights ?

/Thanks and Regards,/

/Sagar Chaturvedi/

/Member Of Technical Staff /

/NEC Technologies India, Noida/

/09711931646/

DISCLAIMER:
---
The contents of this e-mail and any attachment(s) are confidential and
intended
for the named recipient(s) only.
It shall not attach any liability on the originator or NEC or its
affiliates. Any views or opinions presented in
this email are solely those of the author and may not necessarily reflect the
opinions of NEC or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification,
distribution and / or publication of
this message without the prior written consent of the author of this e-mail is
strictly prohibited. If you have
received this email in error please delete it and notify the sender
immediately. .
---

Re: Varnish

2013-06-29 Thread Lance Norskog

Solr HTTP caching also support e-tags. These are unique keys for the 
output of a query. If you send a query twice, and the index has not 
changed, the return will be the same. The e-tag is generated from the 
query string and the index generation number.


If Varnish supports e-tags, you can keep some queries cached longer than 
your timeout.


Lance

On 06/29/2013 05:51 PM, William Bell wrote:

On a large website, by putting 1 varnish in front of all 4 SOLR boxes we
were able to trim 25% off the load time (TTFB) of the page.

Our hit ratio was between 55 and 75%. We gave varnish 24GB of RAM, and was
not able to fill it under full load with a 10 minute cache timeout.

We get about 2.4M SOLR calls every 15 to 20 minutes.

One varnish was able to handle it with almost no lingering connections, and
load average of  1.

Varnish is very optimized and worth trying.



On Sat, Jun 29, 2013 at 6:47 PM, William Bell billnb...@gmail.com wrote:


OK.

Here is the answer for us. Here is a sample default.vcl. We are validating
the LastModified ( if (!beresp.http.last-modified) )
is changing when the core is indexed and the version changes of the index.

This does 10 minutes caching and a 1hr grace period (if solr is down, it
will deliver results up to 1 hr).

This uses the URL for caching.

You can also do:

http://localhost?PURGEME

To clear varnish if your IP is in the ACL list.


backend server1 {
 .host = XXX.domain.com;
 .port = 8983;
 .probe = {
 .url = /solr/pingall/select/?q=*%3A*;
 .interval = 5s;
 .timeout = 1s;
 .window = 5;
 .threshold = 3;
 }
}
backend server2{
 .host = XXX1.domain.com;
 .port = 8983;
 .probe = {
 .url = /solr/pingall/select/?q=*%3A*;
 .interval = 5s;
 .timeout = 1s;
 .window = 5;
 .threshold = 3;
 }
}
backend server3{
 .host = XXX2.domain.com;
 .port = 8983;
 .probe = {
 .url = /solr/pingall/select/?q=*%3A*;
 .interval = 5s;
 .timeout = 1s;
 .window = 5;
 .threshold = 3;
 }
}
backend server4{
 .host = XXX3.domain.com;
 .port = 8983;
 .probe = {
 .url = /solr/pingall/select/?q=*%3A*;
 .interval = 5s;
 .timeout = 1s;
 .window = 5;
 .threshold = 3;
 }
}

director default round-robin {
   {
 .backend = server1;
   }
   {
 .backend = server2;
   }
   {
 .backend = server3;
   }
   {
 .backend = server4;
   }
}

acl purge {
 localhost;
 10.0.1.0/24;
 10.0.3.0/24;
}


sub vcl_recv {
if (req.url ~ \?PURGEME$) {
 if (!client.ip ~ purge) {
 error 405 Not allowed.  + client.ip;
 }
 ban(req.url ~ /);
 error 200 Cached Cleared;
}
remove req.http.Cookie;
if (req.backend.healthy) {
  set req.grace = 15s;
} else {
  set req.grace = 1h;
}
return (lookup);
}

sub vcl_fetch {
   set beresp.grace = 1h;
   if (!beresp.http.last-modified) {
 set beresp.ttl = 600s;
   }
   if (beresp.ttl  600s) {
 set beresp.ttl = 600s;
   }
   unset beresp.http.Set-Cookie;
}

sub vcl_deliver {
 if (obj.hits  0) {
 set resp.http.X-Cache = HIT;
 } else {
 set resp.http.X-Cache = MISS;
 }
}

sub vcl_hash {
 hash_data(req.url);
 return (hash);
}






On Tue, Jun 25, 2013 at 4:44 PM, Learner bbar...@gmail.com wrote:


Check this link..
http://lucene.472066.n3.nabble.com/SolrJ-HTTP-caching-td490063.html



--
View this message in context:
http://lucene.472066.n3.nabble.com/Varnish-tp4072057p4073205.html
Sent from the Solr - User mailing list archive at Nabble.com.




--
Bill Bell
billnb...@gmail.com
cell 720-256-8076

Does SolrCloud require matching configuration files?

2013-06-22 Thread Lance Norskog

Accumulo is a BigTable/Cassandra style distributed database. It is now 
an Apache Incubator project. In the README we find this gem:


Synchronize your accumulo conf directory across the cluster. As a 
precaution against mis-configured systems, servers using different 
configuration files will not communicate with the rest of the cluster.


https://github.com/apache/accumulo

Would this be a good policy for SolrCloud? Accumulo is designed for 
multi-thousand-node clusters; this policy might be overkill.

Re: Adding pdf/word file using JSON/XML

2013-06-16 Thread Lance Norskog

No, they just learned a few features and then stopped because it was 
good enough, and they had a thousand other things to code.


As to REST- yes, it is worth having a coherent API. Solr is behind the 
curve here. Look at the HATEOS paradigm. It's ornate (and a really goofy 
name) but it provides a lot of goodness- the API tells you how to use 
it. For example, a search page response includes a link for the next 
page; your UI finds the link and hangs it off a 'Next' button. Your UI 
does not need code for 'create a Next link'.


Also, don't do that /v1 crap. At this point we all know how it should work.

On 06/15/2013 07:35 AM, Grant Ingersoll wrote:

On Jun 13, 2013, at 11:24 AM, Walter Underwood wun...@wunderwood.org wrote:


That was my thought exactly. Contribute a REST request handler. --wunder


+1.  The bits are already in place for a lot of it now that RESTlet is in.

That being said, it truly amazes me that people were ever able to implement 
Solr, given some of the FUD in this thread.  I guess those tens of thousands of 
deployments out there were all done by above average devs...

-Grant

Re: Best way to match umlauts

2013-06-16 Thread Lance Norskog

One small thing: German u-umlaut is often flattened as 'ue' instead of 
'u'. And the same with o-umlaut, it can be 'oe' or 'o'. I don't know if 
Lucene has a good solution for this problem.


On 06/16/2013 06:44 AM, adityab wrote:

Thanks for the explanation Steve. I now see it clearly. In my case it should
work.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-way-to-match-umlauts-tp4070256p4070805.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR-4872 and LUCENE-2145 (or, how to clean up a Tokenizer)

2013-06-12 Thread Lance Norskog

In 4.x and trunk is a close() method on Tokenizers and Filters. In 
currently released up to 4.3, there is instead a reset(stream) method 
which is how it resets a TokenizerFilter for a following document in 
the same upload.


In both cases I had to track the first time the tokens are consumed, and 
do all of the setup then. If you do this, then reset(stream) can clear 
the native resources, and let you re-load them on the next consume.


Look at LUCENE-2899 in OpenNLPTokenizer and OpenNLPFilter.java to see 
what I had to do.


But yes, to be absolutely sure, you need to add a finalizer.

On 06/12/2013 04:34 AM, Benson Margulies wrote:

Could I have some help on the combination of these two? Right now, it
appears that I'm stuck with a finalizer to chase after native
resources in a Tokenizer. Am I missing something?

Re: OPENNLP problems

2013-06-09 Thread Lance Norskog


text_opennlp has the right behavior.
text_opennlp_pos does what you describe.
I'll look some more.

On 06/09/2013 04:38 PM, Patrick Mi wrote:

Hi Lance,

I updated the src from 4.x and applied the latest patch LUCENE-2899-x.patch
uploaded on 6th June but still had the same problem.


Regards,
Patrick

-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com]
Sent: Thursday, 6 June 2013 5:16 p.m.
To: solr-user@lucene.apache.org
Subject: Re: OPENNLP problems

Patrick-
I found the problem with multiple documents. The problem was that the
API for the life cycle of a Tokenizer changed, and I only noticed part
of the change. You can now upload multiple documents in one post, and
the OpenNLPTokenizer will process each document.

You're right, the example on the wiki is wrong. The FilterPayloadsFilter
default is to remove the given payloads, and needs keepPayloads=true
to retain them.

The fixed patch is up as LUCENE-2899-x.patch. Again, thanks for trying it.

Lance

https://issues.apache.org/jira/browse/LUCENE-2899

On 05/28/2013 10:08 PM, Patrick Mi wrote:

Hi there,

Checked out branch_4x and applied the latest patch
LUCENE-2899-current.patch however I ran into 2 problems

Followed the wiki page instruction and set up a field with this type

aiming

to keep nouns and verbs and do a facet on the field
==
fieldType name=text_opennlp_nvf class=solr.TextField
positionIncrementGap=100
analyzer
  tokenizer class=solr.OpenNLPTokenizerFactory
tokenizerModel=opennlp/en-token.bin/
  filter class=solr.OpenNLPFilterFactory
posTaggerModel=opennlp/en-pos-maxent.bin/
  filter class=solr.FilterPayloadsFilterFactory
payloadList=NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW/
  filter class=solr.StripPayloadsFilterFactory/
/analyzer
  /fieldType
==

Struggled to get that going until I put the extra parameter
keepPayloads=true in as below.
   filter class=solr.FilterPayloadsFilterFactory keepPayloads=true
payloadList=NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW/

Question: am I doing the right thing? Is this a mistake on wiki

Second problem:

Posted the document xml one by one to the solr and the result was what I
expected.

add
doc
field name=id1/field
field name=text_opennlp_nvfcheck in the hotel/field/doc
/add

However if I put multiple documents into the same xml file and post it in
one go only the first document gets processed( only 'check' and 'hotel'

were

showing in the facet result.)
   
add

doc
field name=id1/field
field name=text_opennlp_nvfcheck in the hotel/field
/doc
doc
field name=id2/field
field name=text_opennlp_nvfremoves the payloads/field
/doc
doc
field name=id3/field
field name=text_opennlp_nvfretains only nouns and verbs /field
/doc
/add

Same problem when updated the data using csv upload.

Is that a bug or something I did wrong?

Thanks in advance!

Regards,
Patrick

Re: OPENNLP problems

2013-06-09 Thread Lance Norskog


Found the problem. Please see:
https://issues.apache.org/jira/browse/LUCENE-2899?focusedCommentId=13679293page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13679293
On 06/09/2013 04:38 PM, Patrick Mi wrote:

Hi Lance,

I updated the src from 4.x and applied the latest patch LUCENE-2899-x.patch
uploaded on 6th June but still had the same problem.


Regards,
Patrick

-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com]
Sent: Thursday, 6 June 2013 5:16 p.m.
To: solr-user@lucene.apache.org
Subject: Re: OPENNLP problems

Patrick-
I found the problem with multiple documents. The problem was that the
API for the life cycle of a Tokenizer changed, and I only noticed part
of the change. You can now upload multiple documents in one post, and
the OpenNLPTokenizer will process each document.

You're right, the example on the wiki is wrong. The FilterPayloadsFilter
default is to remove the given payloads, and needs keepPayloads=true
to retain them.

The fixed patch is up as LUCENE-2899-x.patch. Again, thanks for trying it.

Lance

https://issues.apache.org/jira/browse/LUCENE-2899

On 05/28/2013 10:08 PM, Patrick Mi wrote:

Hi there,

Checked out branch_4x and applied the latest patch
LUCENE-2899-current.patch however I ran into 2 problems

Followed the wiki page instruction and set up a field with this type

aiming

to keep nouns and verbs and do a facet on the field
==
fieldType name=text_opennlp_nvf class=solr.TextField
positionIncrementGap=100
analyzer
  tokenizer class=solr.OpenNLPTokenizerFactory
tokenizerModel=opennlp/en-token.bin/
  filter class=solr.OpenNLPFilterFactory
posTaggerModel=opennlp/en-pos-maxent.bin/
  filter class=solr.FilterPayloadsFilterFactory
payloadList=NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW/
  filter class=solr.StripPayloadsFilterFactory/
/analyzer
  /fieldType
==

Struggled to get that going until I put the extra parameter
keepPayloads=true in as below.
   filter class=solr.FilterPayloadsFilterFactory keepPayloads=true
payloadList=NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW/

Question: am I doing the right thing? Is this a mistake on wiki

Second problem:

Posted the document xml one by one to the solr and the result was what I
expected.

add
doc
field name=id1/field
field name=text_opennlp_nvfcheck in the hotel/field/doc
/add

However if I put multiple documents into the same xml file and post it in
one go only the first document gets processed( only 'check' and 'hotel'

were

showing in the facet result.)
   
add

doc
field name=id1/field
field name=text_opennlp_nvfcheck in the hotel/field
/doc
doc
field name=id2/field
field name=text_opennlp_nvfremoves the payloads/field
/doc
doc
field name=id3/field
field name=text_opennlp_nvfretains only nouns and verbs /field
/doc
/add

Same problem when updated the data using csv upload.

Is that a bug or something I did wrong?

Thanks in advance!

Regards,
Patrick

Re: OPENNLP problems

2013-06-05 Thread Lance Norskog


Patrick-
I found the problem with multiple documents. The problem was that the 
API for the life cycle of a Tokenizer changed, and I only noticed part 
of the change. You can now upload multiple documents in one post, and 
the OpenNLPTokenizer will process each document.


You're right, the example on the wiki is wrong. The FilterPayloadsFilter 
default is to remove the given payloads, and needs keepPayloads=true 
to retain them.


The fixed patch is up as LUCENE-2899-x.patch. Again, thanks for trying it.

Lance

https://issues.apache.org/jira/browse/LUCENE-2899

On 05/28/2013 10:08 PM, Patrick Mi wrote:

Hi there,

Checked out branch_4x and applied the latest patch
LUCENE-2899-current.patch however I ran into 2 problems

Followed the wiki page instruction and set up a field with this type aiming
to keep nouns and verbs and do a facet on the field
==
fieldType name=text_opennlp_nvf class=solr.TextField
positionIncrementGap=100
   analyzer
 tokenizer class=solr.OpenNLPTokenizerFactory
tokenizerModel=opennlp/en-token.bin/
 filter class=solr.OpenNLPFilterFactory
posTaggerModel=opennlp/en-pos-maxent.bin/
 filter class=solr.FilterPayloadsFilterFactory
payloadList=NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW/
 filter class=solr.StripPayloadsFilterFactory/
   /analyzer
 /fieldType
==

Struggled to get that going until I put the extra parameter
keepPayloads=true in as below.
  filter class=solr.FilterPayloadsFilterFactory keepPayloads=true
payloadList=NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW/

Question: am I doing the right thing? Is this a mistake on wiki

Second problem:

Posted the document xml one by one to the solr and the result was what I
expected.

add
doc
   field name=id1/field
   field name=text_opennlp_nvfcheck in the hotel/field/doc
/add

However if I put multiple documents into the same xml file and post it in
one go only the first document gets processed( only 'check' and 'hotel' were
showing in the facet result.)
  
add

doc
   field name=id1/field
   field name=text_opennlp_nvfcheck in the hotel/field
/doc
doc
   field name=id2/field
   field name=text_opennlp_nvfremoves the payloads/field
/doc
doc
   field name=id3/field
   field name=text_opennlp_nvfretains only nouns and verbs /field
/doc
/add

Same problem when updated the data using csv upload.

Is that a bug or something I did wrong?

Thanks in advance!

Regards,
Patrick

Re: Dynamic Indexing using DB and DIH

2013-06-02 Thread Lance Norskog

Let's assume that the Solr record includes the database record's 
timestamp field.You can make a more complex DIH stack that does a Solr 
query with the SolrEntityProcessor. You can do a query that gets the 
most recent timestamp in the index, and then use that in the DB update 
command.


On 06/02/2013 06:25 PM, PeriS wrote:

Currently I have wired up the dataimporthandler to do a full and incremental 
indexing. I was wondering if there was way to automatically update the indexes 
as soon as the row in the table gets updated. I don't want to get into any sort 
of cron jobs, triggers etc; Current what I do is as soon as I update the row, i 
follow it up by calling the delta import. But in this case its about timing and 
if SOLR doesn't see the row as updated, then it doesn't do anything….any ideas?

-Peri.S

Re: Shard Keys and Distributed Search

2013-06-02 Thread Lance Norskog

Distributed search does the actual search twice: once to get the scores 
and again to fetch the documents with the top N scores. This algorithm 
does not play well with deep searches.


On 06/02/2013 07:32 PM, Niran Fajemisin wrote:

Thanks Daniel.

That's exactly what I thought as well. I did try passing the distrib=false 
parameter and specifying the shards local to the initial server being invoked 
and yes it did localize the search to the initial server that was invoked. I 
unfortunately didn't see any marked improvement in performance as we have a 
very fast network (8Gbit host bust adaptor over a fiber channel etc.) and are 
backed by SSD on a SAN. The only (and most painful part) of our scenario is 
that we fetch a lot of documents at onceupwards of 50,000...yes...not the 
ideal use case for Solr (to put it mildly).

The performance as expected, even for such a large request is quite impressive 
when the inbound request is initially routed to the exact shard containing the 
documents (again we use shard key so the composite id router will be invoked). 
In this case I've noticed that no distributed search is performed...at least 
from my limited observation.

Thanks again for your response.

Cheers.





From: Daniel Collins danwcoll...@gmail.com
To: Solr User solr-user@lucene.apache.org
Sent: Saturday, June 1, 2013 4:09 AM
Subject: Re: Shard Keys and Distributed Search


Yes it is doing a distributed search, Solr cloud will do that by default unless 
you say distrib=false.

My understanding of Solr's Load balancer is that it picks a random instance 
from the list of available instances serving each shard.
So in your example:

1. Query comes in to Server 1, server 1 de-constructs it and works out which 
shards it needs to query. It then gets a list (from ZK) of all the instances in 
that collection which can service that shard, and the LB in Solr just picks one 
(at random).
2. It has picked Server 3 in your case, so the request goes there.
3. The request is still a 2-stage process (in terms of what you see in the logs), 1 query to get the docIds 
(using your query data) and then a second query to get the stored fields, once it has the correct 
list of docs. This is necessary because in a general multi-shard query, the responses will have to go back to 
server 1 and be consolidated (not 100% sure of this area but I believe this is true and it makes logical 
sense to me), so if you had a query for 10 records that needed to access 4 shards, it would ask for the 
top 10 from each shard, then combine/sort them to get the overall top 10, and then 
get the stored fields for those 10 (which might be 5 from shard 1, 2 from shard2 and 3 from shard3, nothing 
from shard4 for example).

You are right that it seems counter intuitive from the users's perspective, but I don't think 
Solr Cloud currently has any logic to favour a local instance over a remote one, I guess that 
would be a change to CloudSolrServer? Alternatively, you can do it in your client, send a 
non-distributed query, so append 
distrib=falseshards=localhost:8983/solr,localhost:7574/solr.

-Original Message- From: Niran Fajemisin
Sent: Friday, May 31, 2013 5:00 PM
To: Solr User
Subject: Shard Keys and Distributed Search

Hi all,

I'm trying to make sure that I understand under what circumstance a distributed 
search is performed against Solr and if my general understanding of what 
constitutes a distributed search is correct.

I have a Solr collection that was created using the Collections API with the 
following parameters: numShards=5  maxShardsPerNode=5  replicationFactor=4. 
Given that we have 4 servers this will result in 5 shards being created on each 
server. All documents indexed into Solr have a shard key specified as a part of 
their document id, such that we can use the same shard key prefix as a part of 
our query by specifying: shard.keys=myshardkey!

My assumption was that when the search request is submitted, given that my 
deployment topology has all possible shards available on each server, there 
will be no need to call out to other servers in the cluster to fulfill the 
search. What I am noticing is the following:

1. Submit a search to Server 1 with the shard.keys parameter specified. (Note 
again that replicas for shard 1-5 are all available on the Server 1.)
2. The request is forwarded to a server other than Server 1, for example Server 
3.
3. The  /select request handler of Server 3 is invoked. This proceeds to 
execute the /select request, asking for the id and score fields for each 
document that matches the submittted query. I also noticed that it passes the 
shard.url parameter but states that distrib=false.
4. Then *another* request is executed on Server 3 for the /select request 
handler *again*. This time the ids returned from the previous search are passed 
in as the ids parameters.
5. Finally the results are passed back to the caller through the original 
server,

Re: OPENNLP problems

2013-05-30 Thread Lance Norskog


I will look at these problems. Thanks for trying it out!

Lance Norskog

On 05/28/2013 10:08 PM, Patrick Mi wrote:

Hi there,

Checked out branch_4x and applied the latest patch
LUCENE-2899-current.patch however I ran into 2 problems

Followed the wiki page instruction and set up a field with this type aiming
to keep nouns and verbs and do a facet on the field
==
fieldType name=text_opennlp_nvf class=solr.TextField
positionIncrementGap=100
   analyzer
 tokenizer class=solr.OpenNLPTokenizerFactory
tokenizerModel=opennlp/en-token.bin/
 filter class=solr.OpenNLPFilterFactory
posTaggerModel=opennlp/en-pos-maxent.bin/
 filter class=solr.FilterPayloadsFilterFactory
payloadList=NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW/
 filter class=solr.StripPayloadsFilterFactory/
   /analyzer
 /fieldType
==

Struggled to get that going until I put the extra parameter
keepPayloads=true in as below.
  filter class=solr.FilterPayloadsFilterFactory keepPayloads=true
payloadList=NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW/

Question: am I doing the right thing? Is this a mistake on wiki

Second problem:

Posted the document xml one by one to the solr and the result was what I
expected.

add
doc
   field name=id1/field
   field name=text_opennlp_nvfcheck in the hotel/field/doc
/add

However if I put multiple documents into the same xml file and post it in
one go only the first document gets processed( only 'check' and 'hotel' were
showing in the facet result.)
  
add

doc
   field name=id1/field
   field name=text_opennlp_nvfcheck in the hotel/field
/doc
doc
   field name=id2/field
   field name=text_opennlp_nvfremoves the payloads/field
/doc
doc
   field name=id3/field
   field name=text_opennlp_nvfretains only nouns and verbs /field
/doc
/add

Same problem when updated the data using csv upload.

Is that a bug or something I did wrong?

Thanks in advance!

Regards,
Patrick

Re: Regular expression in solr

2013-05-22 Thread Lance Norskog

If the indexed data includes positions, it should be possible to 
implement ^ and $ as the first and last positions.


On 05/22/2013 04:08 AM, Oussama Jilal wrote:
There is no ^ or $ in the solr regex since the regular expression will 
match tokens (not the complete indexed text). So the results you get 
will basicly depend on your way of indexing, if you use the regex on a 
tokenized field and that is not what you want, try to use a copy field 
wich is not tokenized and then use the regex on that one.


On 05/22/2013 11:53 AM, Stéphane Habett Roux wrote:

I just can't get the $ endpoint to work.

I am not sure but I heard it works with the Java Regex engine (a 
little obvious if it is true ...), so any Java regex tutorial would 
help you.


On 05/22/2013 11:42 AM, Sagar Chaturvedi wrote:
Yes, it works for me too. But many times result is not as expected. 
Is there some guide on use of regex in solr?


-Original Message-
From: Oussama Jilal [mailto:jilal.ouss...@gmail.com]
Sent: Wednesday, May 22, 2013 4:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Regular expression in solr

I don't think so, it always worked for me without anything special, 
just try it and see :)


On 05/22/2013 11:26 AM, Sagar Chaturvedi wrote:
@Oussama Thank you for your reply. Is it as simple as that? I mean 
no additional settings required?


-Original Message-
From: Oussama Jilal [mailto:jilal.ouss...@gmail.com]
Sent: Wednesday, May 22, 2013 3:37 PM
To: solr-user@lucene.apache.org
Subject: Re: Regular expression in solr

You can write a regular expression query like this (you need to 
specify the regex between slashes / ) :


fieldName:/[rR]egular.*/

On 05/22/2013 10:51 AM, Sagar Chaturvedi wrote:

Hi,

How do we search based upon regular expressions in solr?

Regards,
Sagar



DISCLAIMER:
- 


-
-
The contents of this e-mail and any attachment(s) are confidential
and intended for the named recipient(s) only.
It shall not attach any liability on the originator or NEC or its
affiliates. Any views or opinions presented in this email are solely
those of the author and may not necessarily reflect the opinions of
NEC or its affiliates.
Any form of reproduction, dissemination, copying, disclosure,
modification, distribution and / or publication of this message
without the prior written consent of the author of this e-mail is
strictly prohibited. If you have received this email in error please
delete it and notify the sender immediately. .
- 


-
-


DISCLAIMER:
-- 


-
The contents of this e-mail and any attachment(s) are confidential 
and

intended for the named recipient(s) only.
It shall not attach any liability on the originator or NEC or its
affiliates. Any views or opinions presented in this email are solely
those of the author and may not necessarily reflect the opinions of
NEC or its affiliates.
Any form of reproduction, dissemination, copying, disclosure,
modification, distribution and / or publication of this message
without the prior written consent of the author of this e-mail is
strictly prohibited. If you have received this email in error please
delete it and notify the sender immediately. .
-- 


-



DISCLAIMER:
--- 


The contents of this e-mail and any attachment(s) are confidential and
intended
for the named recipient(s) only.
It shall not attach any liability on the originator or NEC or its
affiliates. Any views or opinions presented in
this email are solely those of the author and may not necessarily 
reflect the

opinions of NEC or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, 
modification,

distribution and / or publication of
this message without the prior written consent of the author of 
this e-mail is

strictly prohibited. If you have
received this email in error please delete it and notify the sender
immediately. .
---

Re: Upgrading from SOLR 3.5 to 4.2.1 Results.

2013-05-17 Thread Lance Norskog

This is great; data like this is rare. Can you tell us any hardware or 
throughput numbers?


On 05/17/2013 12:29 PM, Rishi Easwaran wrote:


Hi All,

Its Friday 3:00pm, warm  sunny outside and it was a good week. Figured I'd 
share some good news.
I work for AOL mail team and we use SOLR for our mail search backend.
We have been using it since pre-SOLR 1.4 and strong supporters of SOLR 
community.
We deal with millions indexes and billions of requests a day across our complex.
We finished full rollout of SOLR 4.2.1 into our production last week.

Some key highlights:
- ~75% Reduction in Search response times
- ~50% Reduction in SOLR Disk busy , which in turn helped with ~90% Reduction 
in errors
- Garbage collection total stop reduction by over 50% moving application 
throughput into the 99.8% - 99.9% range
- ~15% reduction in CPU usage

We did not tune our application moving from 3.5 to 4.2.1 nor update java.
For the most part it was a binary upgrade, with patches for our special use 
case.

Now going forward we are looking at prototyping SOLR Cloud for our search 
system, upgrade java and tomcat, tune our application further. Lots of fun 
stuff :)

Have a great weekend everyone.
Thanks,

Rishi.

Re: SOLR guidance required

2013-05-13 Thread Lance Norskog


If this is for the US, remove the age range feature before you get sued.

On 05/09/2013 08:41 PM, Kamal Palei wrote:

Dear SOLR experts
I might be asking a very silly question. As I am new to SOLR kindly guide
me.


I have a job site. Using SOLR to search resumes. When a HR user enters some
keywords say JAVA, MySQL etc, I search resume documents using SOLR,
retrieve 100 records and show to user.

The problem I face is say, I retrieved 100 records, then we do filtering
for experience range, age range, salary range (using mysql query).
Sometimes it so happens that the 100 records I fetch , I do not get a
single record to show to user. When user clicks next link there might be
few records, it looks odd really.


I hope there must be some mechanism, by which I can associate salary,
experience, age etc with resume document during indexing. And when
I search for resumes I can give all filters accordingly and can retrieve
100 records and strait way I can show 100 records to user without doing any
mysql query. Please let me know if this is feasible. If so, kindly give me
some pointer how do I do it.

Best Regards
Kamal

Re: Why is SolrCloud doing a full copy of the index?

2013-05-04 Thread Lance Norskog


Great! Thank you very much Shawn.

On 05/04/2013 10:55 AM, Shawn Heisey wrote:

On 5/4/2013 11:45 AM, Shawn Heisey wrote:

Advance warning: this is a long reply.

I have condensed some relevant performance problem information into the
following wiki page:

http://wiki.apache.org/solr/SolrPerformanceProblems

Anyone who has additional information for this page, feel free to add
it.  I hope I haven't made too many mistakes!

Thanks,
Shawn

Re: SolrCloud vs Solr master-slave replication

2013-04-18 Thread Lance Norskog

Run checksums on all files in both master and slave, and verify that 
they are the same.

TCP/IP has a checksum algorithm that was state-of-the-art in 1969.

On 04/18/2013 02:10 AM, Victor Ruiz wrote:

Also, I forgot to say... the same error started to happen again.. the index
is again corrupted :(



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-vs-Solr-master-slave-replication-tp4055541p4056926.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Spatial search question

2013-04-12 Thread Lance Norskog


Outer distance AND NOT inner distance?

On 04/12/2013 09:02 AM, kfdroid wrote:

We currently do a radius search from a given Lat/Long point and it works
great. I have a new requirement to do a search on a larger radius from the
same point, but not include the smaller radius.  Kind of a donut (torus)
shaped search.

How would I do this (Solr 4)?  Search where radius is between 20km and 40km
for example?
Thanks,
Ken



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spatial-search-question-tp4055597.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Flow Chart of Solr

2013-04-07 Thread Lance Norskog

Seconded. Single-stepping really is the best way to follow the logic 
chains and see how the data mutates.


On 04/05/2013 06:36 AM, Erick Erickson wrote:

Then there's my lazy method. Fire up the IDE and find a test case that
looks close to something you want to understand further. Step through
it all in the debugger. I admit there'll be some fumbling at the start
to _find_ the test case, but they're pretty well named. In IntelliJ,
all you have to do is right-click on the test case and the context
menu says debug blahbalbhabl You can chart the class
relationships you actually wind up in as you go. This seems tedious,
but it saves me getting lost in the class hierarchy.

Also, there are some convenient tools in the IDE that will show you
class hierarchies as you need.

Or attach your debugger to a running Solr, which is actually very
easy. In IntelliJ (and Eclipse has something very similar), create a
remote project. That'll specify some parameters you start up with,
e.g.:
java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=5900
-jar start.jar

Now start up the remote debugging session you just created in the IDE
and you are attached to a live solr instance and able to step through
any code you want.

Either way, you can make the IDE work for you!

FWIW,
Erick

On Wed, Apr 3, 2013 at 12:03 PM, Jack Krupansky j...@basetechnology.com wrote:

We're using the 4.x branch code as the basis for our writing. So,
effectively it will be for at least 4.3 when the book comes out in the
summer.

Early access will be in about a month or so. O'Reilly will be showing a
galley proof for 200 pages of the book next week at Big Data TechCon next
week in Boston.


-- Jack Krupansky

-Original Message- From: Jack Park
Sent: Wednesday, April 03, 2013 12:56 PM

To: solr-user@lucene.apache.org
Subject: Re: Flow Chart of Solr

Jack,

Is that new book up to the 4.+ series?

Thanks
The other Jack

On Wed, Apr 3, 2013 at 9:19 AM, Jack Krupansky j...@basetechnology.com
wrote:

And another one on the way:

http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957

Hopefully that help a lot as well. Plenty of diagrams. Lots of examples.

-- Jack Krupansky

-Original Message- From: Jack Park
Sent: Wednesday, April 03, 2013 11:25 AM

To: solr-user@lucene.apache.org
Subject: Re: Flow Chart of Solr

There are three books on Solr, two with that in the title, and one,
Taming Text, each of which have been very valuable in understanding
Solr.

Jack

On Wed, Apr 3, 2013 at 5:25 AM, Jack Krupansky j...@basetechnology.com
wrote:


Sure, yes. But... it comes down to what level of detail you want and need
for a specific task. In other words, there are probably a dozen or more
levels of detail. The reality is that if you are going to work at the
Solr
code level, that is very, very different than being a user of Solr, and
at
that point your first step is to become familiar with the code itself.

When you talk about parsing and stemming, you are really talking
about
the user-level, not the Solr code level. Maybe what you really need is a
cheat sheet that maps a user-visible feature to the main Solr code
component
for that implements that user feature.

There are a number of different forms of parsing in Solr - parsing of
what? Queries? Requests? Solr documents? Function queries?

Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does
that.
Lucene does all of the token filtering. Are you asking for details on
how
Lucene works? Maybe you meant to ask how term analysis works, which is
split between Solr and Lucene. Or maybe you simply wanted to know when
and
where term analysis is done. Tell us your specific problem or specific
question and we can probably quickly give you an answer.

In truth, NOBODY uses flow charts anymore. Sure, there are some
user-level
diagrams, but not down to the code level.

If you could focus on specific questions, we could give you specific
answers.

Main steps? That depends on what level you are working at. Tell us what
problem you are trying to solve and we can point you to the relevant
areas.

In truth, if you become generally familiar with Solr at the user level
(study the wikis), you will already know what the main steps are.

So, it is not main steps of Solr, but main steps of some specific
request of Solr, and for a specified level of detail, and for a
specified
area of Solr if greater detail is needed. Be more specific, and then we
can
be more specific.

For now, the general advice for people who need or want to go far beyond
the
user level is to get familiar with the code - just LOOK at it - a lot
of
the package and class names are OBVIOUS, really, and follow the class
hierarchy and code flow using the standard features of any modern Java
IDE.
If you are wondering where to start for some specific user-level feature,
please ask specifically about that feature. But... make a diligent effort
to
discover and learn on your own before asking

Re: Blog Post: Integration Testing SOLR Index with Maven

2013-03-14 Thread Lance Norskog

Wow! That's great. And it's a lot of work, especially getting it all 
keyboard-complete. Thank you.


On 03/14/2013 01:29 AM, Chantal Ackermann wrote:

Hi all,


this is not a question. I just wanted to announce that I've written a blog post 
on how to set up Maven for packaging and automatic testing of a SOLR index 
configuration.

http://blog.it-agenten.com/2013/03/integration-testing-your-solr-index-with-maven/

Feedback or comments appreciated!
And again, thanks for that great piece of software.

Chantal

Re: InvalidShapeException when using SpatialRecursivePrefixTreeFieldType with custom worldBounds

2013-03-09 Thread Lance Norskog

Thank you (and Hoss)! I have found this concept elusive, and you two 
have nailed it. I will be able to understand it for the 5 minutes I will 
need to code with it.


Lance

On 03/09/2013 10:57 AM, David Smiley (@MITRE.org) wrote:

Just finished:
http://wiki.apache.org/solr/SpatialForTimeDurations



-
  Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/InvalidShapeException-when-using-SpatialRecursivePrefixTreeFieldType-with-custom-worldBounds-tp4045351p4045997.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Returning to Solr 4.0 from 4.1

2013-03-01 Thread Lance Norskog


Yes, the SolrEntityProcessor can be used for this.
If you stored the original document bodies in the Solr index!
You can also download the documents in Json or CSV format and re-upload 
those to old Solr. I don't know if CSV will work for your docs.  If CSV 
works, you can directly upload what you download. If you download JSON, 
you have to unwrap the outermost structure and upload the data as an 
array.


There are problems with the SolrEntityProcessor.1)  It is 
single-threaded. 2) If you 'copyField' to a field, and store that field, 
you have to be sure not to reload the contents of the field, because you 
will add a new copy from the 'source' field.


On 03/01/2013 04:48 AM, Alexandre Rafalovitch wrote:

What about SolrEntityProcessor in DIH?
https://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor

Regards,
 Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Mar 1, 2013 at 5:16 AM, Dotan Cohen dotanco...@gmail.com wrote:


On Fri, Mar 1, 2013 at 11:59 AM, Rafał Kuć r@solr.pl wrote:

Hello!

I assumed that re-indexing can be painful in your case, if it wouldn't
you probably would re-index by now :) I guess (didn't test it myself),
that you can create another collection inside your cluster, use the
old codec for Lucene 4.0 (setting the version in solrconfig.xml should
be enough) and re-indexing, but still re-indexing will have to be
done. Or maybe someone knows a better way ?


Will I have to reindex via an external script bridging, such as a
Python script which requests N documents at a time, indexes them into
Solr 4.1, then requests another N documents to index? Or is there
internal Solr / Lucene facility for this? I've actually looked for
such a facility, but as I am unable to find such a thing I ask.


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Poll: SolrCloud vs. Master-Slave usage

2013-02-25 Thread Lance Norskog


Do you use replication instead, or do you just have one instance?

On 02/25/2013 07:55 PM, Otis Gospodnetic wrote:

Hi,

Quick poll to see what % of Solr users use SolrCloud vs. Master-slave setup:

http://blog.sematext.com/2013/02/25/poll-solr-cloud-or-not/

I have to say I'm surprised with the results so far!

Thanks,
Otis
--
Solr  ElasticSearch Support
http://sematext.com/

Re: Benefits of Solr over Lucene?

2013-02-12 Thread Lance Norskog

Lucene and Solr have an aggressive upgrade schedule.From 3 to 4 got a 
major rewiring,

and parts are orders of magnitude faster and smaller.
If you code using Lucene, you will never upgrade to newer versions.
(I supported SolrLucene customers for 3 years, and nobody ever did.)

Cheers,
Lance


I know that Solr web-enables a Lucene index, but I'm trying to figure out
what other things Solr offers over Lucene.  On the Solr features list it
says Solr uses the Lucene search library and extends it!, but what exactly
are the extensions from the list and what did Lucene give you?  Also if I
have an index built through Solr is there a non-HTTP way to search that
index?  Because solr4j essentially just makes HTTP requests correct?

Some features Im particularly interested in are:
Geospatial Search
Highlighting
Dynamic Fields
Near Real-Time Indexing
Multiple Search Indices

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Benefits-of-Solr-over-Lucene-tp4039964.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Upgrading indexes from Solr 1.4.1 to 4.1.0

2013-02-04 Thread Lance Norskog

A side problem here is text analyzers: the analyzers have changed how 
they split apart text for searching, and are matched pairs. That is, the 
analyzer queries are created matching what the analyzer did when 
indexing. If you do this binary upgrade sequence, the indexed data will 
not match what the analyzers do. It is not a major problem, but queries 
will not bring back what you expect.


Also, in 4.x, the unique field has to be called 'id' and every document 
needs a '_version_' field.


On 02/04/2013 09:32 AM, Upayavira wrote:

Just to add a little to the good stuff Shawn has shared here - Solr 4.1
does not support 1.4.1 indexes. If you cannot re-index (by far
recommended), then first upgrade to 3.6, then optimize your index, which
will convert it to 3.6 format. Then you will be able to use that index
in 4.1. The simple logic here is that Solr/Lucene can read the indexes
of the previous major version. Given you are two major versions behind,
you'd have to do it in two steps.

Upayavira

On Mon, Feb 4, 2013, at 03:18 PM, Shawn Heisey wrote:

On 2/4/2013 7:20 AM, Artem OXSEED wrote:

I need to upgrade our Solr installation from 1.4.1 to the latest 4.1.0
version. The question is how to deal with indexes. AFAIU there are two
things to be aware of: file format and index format (excuse me for
possible term mismatch, I'm new to Solr) - and while file format can
(and will automatically?) be updated if old index files are used by new
Solr installation, one cannot say the same about index format. Is it true?

And if the above is true then the question is - should this index
format be updated at all - i.e. if we can happily live with it then
it's fine, but I guess that this decision will not bring
performance/feature improvements that were introduced since 1.4.1
version, will it?

Assuming we do need to update this index format, how to do it? I found
solution on SO
(http://stackoverflow.com/questions/4528063/moving-data-from-solr-1-5-to-solr-4-0)
that includes usage of some export to XML feature, maybe with Luke,
some custom-made XSLT transformation and import back. Seems like a lot
to do - although it's quite understandable. However, this answer was
given in 2010 with Solr 4.0 being in pre-alpha - so maybe there are now
tools for this now?

Artem,

When upgrading Solr, the absolute best option is always to delete (or
move) your index directory, let the new version recreate it, and rebuild
from scratch by reindexing from your original data source.  This should
always remain an option - the indexes may get corrupted by an unexpected
situation.  If you have the ability to rebuild your 1.4.1 index from
your original data source, then it should be straightforward to do the
same thing on the new version.

Solr 4.1 can read version 3.x indexes, but I would not be surprised to
find that it can't read the Lucene 2.9.x format that Solr 1.4.1 uses.  I
don't know how much difference there is between the 2.9.x format and the
3.x format.  I'm not aware of a distinction between file and index
formats.

If a Solr version supports an older format, then it will read the
segments created in that format, but new segments will be in the new
format.  Solr/Lucene index segments on disk are never changed once they
are finalized.  They can be merged into new segments and then deleted,
but nothing will ever change them.

Have you stored every single field individually in Solr?  If you have,
then you will be able to retrieve the data to reindex into the new
version.  If you have fields that are indexed but not stored, then even
with the XML method you will be unable to obtain all the data.  It is
fairly normal in a Solr schema to have fields that you can search on but
that are not stored, because stored fields make the index larger.

If you have stored every single field in your index, you can also use
the SolrEntityProcessor in the dataimport handler to import from an old
Solr server to a new one.

The critical piece of the puzzle for upgrading between incompatible
versions is that you must be storing every field in the old version
before you start.  If you aren't storing a particular field, then the
data from that field is not retrievable and you must go back to the
original data source.

http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor

Thanks,
Shawn

Re: Upgrading indexes from Solr 1.4.1 to 4.1.0

2013-02-04 Thread Lance Norskog

I don't have the source handy. I believe that SolrCloud hard-codes 'id' 
as the field name for defining shards.


On 02/04/2013 10:19 AM, Shawn Heisey wrote:

On 2/4/2013 10:58 AM, Lance Norskog wrote:

A side problem here is text analyzers: the analyzers have changed how
they split apart text for searching, and are matched pairs. That is, the
analyzer queries are created matching what the analyzer did when
indexing. If you do this binary upgrade sequence, the indexed data will
not match what the analyzers do. It is not a major problem, but queries
will not bring back what you expect.

Also, in 4.x, the unique field has to be called 'id' and every document
needs a '_version_' field.


My unique field isn't called 'id' ... it's called 'tag_id' ... what 
features will I be unable to use properly?


Thanks,
Shawn

Re: Solr load balancer

2013-01-31 Thread Lance Norskog

It is possible to do this with IP Multicast. The query goes out on the 
multicast and all query servers read it. The servers wait for a random 
amount of time, then transmit the answer. Here's the trick: it's 
multicast. All of the query servers listen to each other's responses, 
and drop out when another server answers the query. The server has to 
decide whether to do the query before responding; this would take some 
tuning.


Having all participants snoop on their peers is a really powerful 
design. I worked on a telecom system that used IP Multicast to do 
shortest-path-first allocation of T1 lines.  Worked really well. It's a 
shame Enron never used it.


On 01/24/2013 04:17 PM, Chris Hostetter wrote:

: For example perhaps a load balancer that sends multiple queries
: concurrently to all/some replicas and only keeps the first response
: might be effective. Or maybe a load balancer which takes account of the

I know of other distributed query systems that use this approach, when
query speed is more important to people then load and people who use them
seem to think it works well.

given that it synthetically multiplies the load of each end user request,
it's probably not something we'd want to turn on by default, but a
configurable option certainly seems like it might be handy.


-Hoss

Re: Indexing nouns only - UIMA vs. OpenNLP

2013-01-31 Thread Lance Norskog


Thanks, Kai!

About removing non-nouns: the OpenNLP patch includes two simple 
TokenFilters for manipulating terms with payloads. The 
FilterPayloadFilter lets you keep or remove terms with given payloads. 
In the demo schema.xml, there is an example type that keeps only 
nounsverbs.


There is a universal mapping for parts-of-speech systems for different 
languages. There is no Solr/Lucene support for it.

http://code.google.com/p/universal-pos-tags/

On 01/31/2013 09:47 AM, Kai Gülzau wrote:

UIMA:

I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
Now I am able to use this analyzer for english texts and filter (un)wanted 
token types :-)

fieldType name=uima_nouns_en class=solr.TextField 
positionIncrementGap=100
   analyzer
 tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory
   descriptorPath=/uima/AggregateSentenceAE.xml 
tokenType=org.apache.uima.TokenAnnotation
   featurePath=posTag/
 filter class=solr.TypeTokenFilterFactory types=/uima/stoptypes.txt /
   /analyzer
/fieldType

Open issue - How to set the ModelFile for the Tagger to 
german/TuebaModel.dat ???



OpenNLP:

And a modified patch for https://issues.apache.org/jira/browse/LUCENE-2899 is 
now working
with solr 4.1. :-)

fieldType name=nlp_nouns_de class=solr.TextField 
positionIncrementGap=100
   analyzer
 tokenizer class=solr.OpenNLPTokenizerFactory 
tokenizerModel=opennlp/de-token.bin /
   filter class=solr.OpenNLPFilterFactory 
posTaggerModel=opennlp/de-pos-maxent.bin /
   filter class=solr.FilterPayloadsFilterFactory payloadList=NN,NNS,NNP,NNPS,FM 
keepPayloads=true/
   filter class=solr.StripPayloadsFilterFactory/
   /analyzer
/fieldType



Any hints on which lib is more accurate on noun tagging?
Any performance or memory issues (some OOM here while testing with 1GB via 
Analyzer Admin GUI)?


Regards,

Kai Gülzau




-Original Message-
From: Kai Gülzau [mailto:kguel...@novomind.com]
Sent: Thursday, January 31, 2013 2:19 PM
To: solr-user@lucene.apache.org
Subject: Indexing nouns only - UIMA vs. OpenNLP

Hi,

I am stuck trying to index only the nouns of german and english texts.
(very similar to http://wiki.apache.org/solr/OpenNLP#Full_Example)


First try was to use UIMA with the HMMTagger:

processor 
class=org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory
   lst name=uimaConfig
 lst name=runtimeParameters/lst
 str 
name=analysisEngine/org/apache/uima/desc/AggregateSentenceAE.xml/str
 bool name=ignoreErrorsfalse/bool
 lst name=analyzeFields
   bool name=mergefalse/bool
   arr name=fieldsstralbody/str/arr
 /lst
 lst name=fieldMappings
   lst name=type
 str name=nameorg.apache.uima.SentenceAnnotation/str
 lst name=mapping
   str name=featurecoveredText/str
   str name=fieldalbody2/str
 /lst
   /lst
/lst
   /lst
/processor

- But how do I set the ModelFile to use the german corpus?
- What about language identification?
-- How do I use the right corpus/tagger based on the language?
-- Should this be done in UIMA (how?) or via solr contrib/langid field mapping?
- How to remove non nouns in the annotated field?


Second try is to use OpenNLP and to apply the patch 
https://issues.apache.org/jira/browse/LUCENE-2899
But the patch seems to be a bit out of date.
Currently I try to get it to work with solr 4.1.


Any pointers appreciated :-)

Regards,

Kai Gülzau

Re: Solr 4 slower than Solr 3.x?

2013-01-28 Thread Lance Norskog

For this second report, it's easy: switching from a single query server 
to a sharded query is going to be slower. Virtual machines add jitter to 
the performance and response time of the front-end vs the query shards. 
Distributed search does 2 round-trips for each sharded query. Add these 
all up and your response time curve flattens out.


Here's how to consider it, using probability arithmetic: suppose the 
best case is 1 and the worst case is zero, and the mean is .8. If you 
put two of these measurements in a row, the overall mean becomes 0.8 * 
0.8 = 0.64. This is a longer, flatter curve. If a simple search is one 
round-trip measurement, a distributed search has three measurements in a 
row. Or, 0.8 cubed = .512. The standard deviation is the flatness of the 
curve and the fatness of the tail. When you add in the jitter caused by 
using virtual servers, the standard deviation of the curve increases, 
making the curve flatter and the long tail fatter. Notice that his 
best-case query time was faster in 4.0 than with 3.6.1. The core 4.0 
data structures are much cleaner and faster. It's the distributed 
topology that's killing him.


There is no law that says you can't use the indexer/query topology in 
4.0. SolrCloud's virtues only kick in after your deployment need several 
shards.


On 01/17/2013 08:08 AM, Otis Gospodnetic wrote:

Hello,

Here is another one from the other day:
http://search-lucene.com/m/tqmNjXO51B/SolrCloud+Performance+for+High+Query+Volume

Am I the only one seeing people reporting this? :)

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Mon, Jan 14, 2013 at 10:55 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:


Hi,

I've seen this mentioned on the ML a few times now with the most recent
one being:


http://search-lucene.com/m/mbT4g1fQPr91/?subj=Solr+4+0+upgrade+reduced+performance

Are there any known, good Solr 3.x vs. Solr 4.x benchmarks?

Thanks,
Otis
--
Solr  ElasticSearch Support
http://sematext.com/

Re: RSS tutorial that comes with the apache-solr not indexing

2013-01-14 Thread Lance Norskog

This example may be out of date, if the RSS feeds from Slashdot have 
changed. If you know XML and XPaths, try this:
Find an rss feed from somewhere that works. Compare the xpaths in it 
v.s. the xpaths in the DIH script.


On 01/13/2013 07:38 PM, bibhor wrote:

Hi
I am trying to use the RSS tutorial that comes with the apache-solr.
I am not sure if I missed anything but when I do full-import no indexing
happens.
These are the steps that I am taking:

1) Download apache-solr-3.6.2 (http://lucene.apache.org/solr/)
2) Start the solr by doing: java -Dsolr.solr.home=./example-DIH/solr/ -jar
start.jar
3) Goto url:
http://192.168.1.12:8983/solr/rss/dataimport?command=full-import
4) When I do this it says: Indexing completed. Added/Updated: 0 documents.
Deleted 0 documents.

Now I know that the default example is getting the RSS from:
http://rss.slashdot.org/Slashdot/slashdot
This default example is empty when I view it in chrome. It does have XML
data in the source but I am not sure if this has anything to do with the
import failure.
  
I also modified the rss-config so that I can test other RSS sources. I used

http://www.feedforall.com/sample.xml and updated the rss-config.xml but this
did the same and did not Add/Update any documents.
Any help is appreciated.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/RSS-tutorial-that-comes-with-the-apache-solr-not-indexing-tp4033067.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Schema Field Names i18n

2013-01-14 Thread Lance Norskog

Will a field have different names in different languages? There is no 
facility for 'aliases' for field name. Erick is right, this sounds like 
you need query and update components to implement this. Also, you might 
try using URL-encoding for the field names. This would save my sanity.


On 01/10/2013 04:56 AM, Erick Erickson wrote:

There's no really easy way that I know of. I've seen several approaches
used though

1 do it in the UI. This assumes that your users aren't typing in raw
queries, they're picking field names from a drop-down or similar. Then the
UI maps the chosen fields into what the schema defines.

2 Do it in the middleware when assembling the query to pass through. Be
careful with the translations though, there always seem to be edge cases.

3 What you're suggesting. Unless you're really fluent in parsers (they
give me indigestion)  I'd think about a query component.

Best
Erick


On Wed, Jan 9, 2013 at 7:36 PM, Daryl Robbins daryl.robb...@mxi.com wrote:


Anyone have experience with internationalizing the field names in the SOLR
schema, so users in different languages can specify fields in their own
language? My first thoughts would be to create a custom search component or
query parser than would convert localized field names back to the English
names in the schema, but I haven't dived in too deep yet. Any input would
be greatly appreciated.

Thanks,

Daryl



__
* This message is intended only for the use of the individual or entity to
which it is addressed, and may contain information that is privileged,
confidential and exempt from disclosure under applicable law. Unless you
are the addressee (or authorized to receive for the addressee), you may not
use, copy or disclose the message or any information contained in the
message. If you have received this message in error, please advise the
sender by reply e-mail, and delete the message, or call +1-613-747-4698. *

Re: Index data from multiple tables into Solr

2013-01-14 Thread Lance Norskog

Try all of the links under the collection name in the lower left-hand
columns. There several administration monitoring tools you may find useful.

On 01/14/2013 11:45 AM, hassancrowdc wrote:

ok stats are changing, so the data is indexed. But how can i do query with
this data, or ow can i search it, like the command will be
http://localhost:8983/solr/select?q=(any of my field column from table)?
coz whatever i am putting in my url it shows me an xml file but the
numFound are always 0?

On Sat, Jan 12, 2013 at 1:24 PM, Alexandre Rafalovitch [via Lucene]
ml-node+s472066n4032778...@n3.nabble.com wrote:

Have you tried the Admin interface yet? The one on :8983 port if you are
running default setup. That has a bunch of different stats you can look at
apart from a nice way of doing a query. I am assuming you are on Solr 4,
of
course.

Regards,
Alex.

On Fri, Jan 11, 2013 at 5:13 PM, hassancrowdc [hidden
email]http://user/SendEmail.jtp?type=nodenode=4032778i=0wrote:

So, I followed all the steps and solr is working successfully, Can you
please tell me how i can see if my data is indexed or not? do i have to
enter specific url into my browser or anything. I want to make sure that
the data is indexed.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)

--
If you reply to this email, your message will be added to the discussion
below:

http://lucene.472066.n3.nabble.com/Index-data-from-multiple-tables-into-Solr-tp4032266p4032778.html
To unsubscribe from Index data from multiple tables into Solr, click
herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4032266code=aGFzc2FuY3Jvd2RjYXJlQGdtYWlsLmNvbXw0MDMyMjY2fC00ODMwNzMyOTM=
.
NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml

--
View this message in context:
http://lucene.472066.n3.nabble.com/Index-data-from-multiple-tables-into-Solr-tp4032266p4033268.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH fails after processing roughly 10million records

2013-01-09 Thread Lance Norskog


At this scale, your indexing job is prone to break in various ways.
If you want this to be reliable, it should be able to restart in the 
middle of an upload, rather than starting over.


On 01/08/2013 10:19 PM, vijeshnair wrote:

Yes Shawn, the batchSize is -1 only and I also have the mergeScheduler
exactly same as you mentioned.  When I had this problem in SOLR 3.4, I did
an extensive googling and gathered much of the tweaks and tuning from
different blogs and forums and configured the 4.0 instance. My next full run
is scheduled for this weekend, I will try with a higher mysql wait_timeout
value and update you the outcome.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-fails-after-processing-roughly-10million-records-tp4031508p4031779.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Lance Norskog

Also, searching can be much faster if you put all of the shards on one 
machine, and the search distributor. That way, you search with multiple 
simultaneous threads inside one machine. I've seen this make searches 
several times faster.


On 01/03/2013 06:36 AM, Jack Krupansky wrote:
Ah... the multiple shards (of the same collection) in a single node is 
about planning for future expansion of your cluster - create more 
shards than you need today, put more of them on a single node and then 
migrate them to their own nodes as the data outgrows the smaller 
number of nodes. In other words, add nodes incrementally without 
having to reindex all the data.


-- Jack Krupansky

-Original Message- From: Darren Govoni
Sent: Thursday, January 03, 2013 9:18 AM
To: solr-user@lucene.apache.org
Subject: RE: Re: Terminology question: Core vs. Collection vs...

Yes. And its worth to note that when having multiple shards in a 
single node(@deprecated) that they are shards of different collections...


brbrbr--- Original Message ---
On 1/3/2013  09:16 AM Jack Krupansky wrote:brAnd I would revise 
node to note that in SolrCloud a node is simply an

brinstance of a Solr server.
br
brAnd, technically, you can have multiple shards in a single 
instance of Solr,
brseparating the logical sharding of keys from the distribution of 
the data.

br
br-- Jack Krupansky
br
br-Original Message- brFrom: Jack Krupansky
brSent: Thursday, January 03, 2013 9:08 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brOops... let me word that a little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message- brFrom: Jack Krupansky
brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding 
is a way of
brslicing the original data, before we talk about how the shards 
get stored
brand replicated on actual Solr cores. Replicas are instances of the 
data for

bra shard.
br
brSometimes people may loosely speak of a replica as being a 
shard, but

brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- brFrom: Darren Govoni
brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the 
fact that the
brbrcollection may be sharded, with each shard on one or more 
cores, with
breach brcore being a replica of the other cores within that shard 
of that

brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A 
shard
britself is not distributed across cores in the same since. Rather a 
shard
brexist on a single core and is replicated on other cores. Is that 
right? The

brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine 
in a

brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have 
multiple
brbrvirtual nodes on the same physical box. Each Solr replica 
would be on

bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on 
a single
brbrhardware node, each with a different port. They are simply 
instances of
brbrSolr, although you could consider each Solr instance a node in 
a Solr

brcloud
brbras well, a virtual node. So, technically, you could have 
multiple

brreplicas
brbron the same node, but that sort of defeats most of the purpose 
of having
brbrreplicas in the first place - to distribute the data for 
performance and
brbrfault tolerance. But, you could have replicas of different 
shards on the

brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- brbrFrom: Darren Govoni
brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be 
labeled

brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is 
the more

brmodern
brbrterm and incorporates the fact that the
brbrbrcollection

Re: Upgrading from 3.6 to 4.0

2013-01-03 Thread Lance Norskog

Please start new mail threads for new questions. This makes it much 
easier to research old mail threads. Old mail is often the only 
documentation for some problems.

On 01/02/2013 10:04 AM, Benjamin, Roy wrote:

Will the existing 3.6 indexes work with 4.0 binary ?

Will 3.6 solrJ clients work with 4.0 servers ?


Thanks
Roy

What is group.query?

2013-01-03 Thread Lance Norskog


What does group.query do? How is it different from q= and fq= ?

Thanks.

Re: Upgrading from 3.6 to 4.0

2013-01-02 Thread Lance Norskog

Indexes will not work. I have not heard of an index upgrader. If you run 
your 3.6 and new 4.0 Solr at the same time, you can upload all the data 
with a DataImportHandler script using the SolrEntityProcessor.


How large are your indexes? 4.1 indexes will not match 4.0, so you will 
have to upload everything twice. You might want to wait, or use a build 
from the 4.x trunk.


SolrJ client apps should work with 4.0.

On 01/02/2013 10:04 AM, Benjamin, Roy wrote:

Will the existing 3.6 indexes work with 4.0 binary ?

Will 3.6 solrJ clients work with 4.0 servers ?


Thanks
Roy

Re: Viewing the Solr MoinMoin wiki offline

2013-01-01 Thread Lance Norskog


3 problems:
a- he wanted to read it locally.
b- crawling the open web is imperfect.
c- /browse needs to get at the files with the same URL as the uploader.

a and b- Try downloading the whole thing with 'wget'. It has a 'make 
links point to the downloaded files' option. Wget is great.


I have done this by parking my files behind a web server. You can use 
Tomcat. (I recommend the XAMPP distro: 
http://www.apachefriends.org/en/xampp.html). Then, use Erik's command to 
crawl that server. Use /browse to read it.


Looking at this again, it should be possible to add a file system 
service to the Solr start.jar etc/jetty.xml file. I think I did this 
once. It would be a handy patch. In fact, this whole thing would make a 
great blog post.


On 12/30/2012 05:05 AM, Erik Hatcher wrote:

Here's a geeky way to do it yourself:

Fire up Solr 4.x, run this from example/exampledocs:

java -Ddata=web -Ddelay=2 -Drecursive=1 -jar post.jar 
http://wiki.apache.org/solr/

(although I do end up getting a bunch of 503's, so maybe this isn't very 
reliable yet?)

Tada: http://localhost:8983/solr/collection1/browse

:)

Erik


On Dec 29, 2012, at 16:54 , d_k wrote:


Hello,

I'm setting up Solr inside an intranet without an internet access and
I was wondering if there is a way to obtain the data dump of the Solr
Wiki (http://wiki.apache.org/solr/) for offline viewing and searching.

I understand MoinMoin has an export feature one can use
(http://moinmo.in/MoinDump and
http://moinmo.in/HelpOnMoinCommand/ExportDump) but i'm afraid it needs
to be executed from within the MoinMoin server.

Is there a way to obtain the result of that command?
Is there another way to view the solr wiki offline?

Re: [DIH] Script Transformer: Is there a way to import js file?

2012-12-26 Thread Lance Norskog

Maybe you could write a Javascript snippet that downloads and runs your 
external file?


On 12/26/2012 09:12 AM, Dyer, James wrote:

I'm not very familiar with using scipting langauges with Java, but having seen the 
DIH code for this, my guess is that all script code needs to be in the script 
/ section of data-config.xml.  So I don't think what you want is possible.  This 
seems like the kind of thing that would be useful if it could support it though.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: zakaria benzidalmal [mailto:zakib...@gmail.com]
Sent: Wednesday, December 26, 2012 11:00 AM
To: solr-user@lucene.apache.org
Subject: [DIH] Script Transformer: Is there a way to import js file?

Hi all,

I am importing some data using DIH, I'd like to use script transformer in
order to perform some transformations before indexing.
As the transformations are a bit complex I am using an external js library.

My question is: Is there a way to import the js library file to my DIH
script?

like: script src=lib/proj4js.js/script

Cordialement.
__
Zakaria BENZIDALMAL
mobile: 06 31 40 04 33

Re: [ANNOUNCE] Apache Solr 3.6.2 released

2012-12-26 Thread Lance Norskog


Cool!

On 12/25/2012 08:03 AM, Robert Muir wrote:

25 December 2012, Apache Solr™ 3.6.2 available

The Lucene PMC and Santa Claus are pleased to announce the release of
Apache Solr 3.6.2.

Solr is the popular, blazing fast open source enterprise search
platform from the Apache Lucene project. Its major features include
powerful full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
distributed search and index replication, and it powers the search and
navigation features of many of the world's largest internet sites.

This release is a bug fix release for version 3.6.1. It contains
numerous bug fixes, optimizations, and improvements, some of which are
highlighted below.  The release is available for immediate download
at: http://lucene.apache.org/solr/mirrors-solr-3x-redir.html (see note
below).

See the CHANGES.txt file included with the release for a full list of details.

Solr 3.6.2 Release Highlights:

  * Fixed ConcurrentModificationException during highlighting, if all
fields were requested.

  * Fixed edismax queryparser to apply minShouldMatch to implicit
boolean queries.

  * Several bugfixes to the DataImportHandler.

  * Bug fixes from Apache Lucene 3.6.2.

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases.  It is possible that the mirror you
are using may not have replicated the release yet.  If that is the
case, please try another mirror.  This also goes for Maven access.

Happy holidays and happy searching,

Lucene/Solr developers

Re: Converting fq params to Filter object

2012-12-26 Thread Lance Norskog

A Solr facet query does a boolean query, caches the Lucene facet data 
structure, and uses it as a Lucene filter. After that until you do a 
full commit, using the same fq=string (you must match the string 
exactly) fetches the cached data structure and uses it again as a Lucene 
filter.


Have you benchmarked the DirectSpellChecker against 
IndexBasedSpellChecker? If you use the fq= filter query as the 
spellcheck.q= query it should use the cached filter.


Also, since you are checking all words against the same filter query, 
can you just do one large OR query with all of the words?


On 12/26/2012 03:10 PM, Nalini Kartha wrote:

Hi Otis,

Sorry, let me be more specific.

The end goal is for the DirectSpellChecker to make sure that the
corrections it is returning will return some results taking into account
the fq params included in the original query. This is a follow up question
to another question I had posted earlier -

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201212.mbox/%3ccamqozyftgiwyrbvwsdf0hfz1sznkq9gnbjfdb_obnelsmvr...@mail.gmail.com%3E

Initially, the way I was thinking of implementing this was to call one of
the SolrIndexSearcher.getDocSet() methods for ever correction, passing in
the correction as the Query and a DocSet created from the fq queries. But I
didn't think that calling a SolrIndexSearcher method in Lucene code
(DirectSpellChecker) was a good idea. So I started looking at which method
on IndexSearcher would accomplish this. That's where I'm stuck trying to
figure out how to convert the fq params into a Filter object.

Does this approach make sense? Also I realize that this implementation is
probably non-performant but wanted to give it a try and measure how it
does. Any advice about what the perf overhead from issuing such queries for
say 50 corrections would be? Note that the filter from the fq params is the
same for every query - would that be cached and help speed things up?

Thanks,
Nalini


On Wed, Dec 26, 2012 at 3:34 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:


Hi,

The fq *is* for filtering.

What is your end goal, what are you trying to achieve?

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Dec 26, 2012 11:22 AM, Nalini Kartha nalinikar...@gmail.com wrote:


Hi,

I'm trying to figure out how to convert the fq params that are being

passed

to Solr into something that can be used to filter the results of a query
that's being issued against the Lucene IndexSearcher (I'm modifying some
Lucene code to issue the query so calling through to one of the
SolrIndexSearcher methods would be ugly).

Looks like one of the IndexSearcher.search(Query query, Filter filter,

...)

  methods would do what I want but I'm wondering if there's any easy way

of

converting the fq params into a Filter? Or is there a better way of doing
all of this?

Thanks,
Nalini

Re: multi field query with selective results

2012-12-23 Thread Lance Norskog

A thousand pardons! Thunderbird displayed your email as a hijack. Now, 
it does not. I really wish everyone's code could be free of bugs, like 
my code is :)


On 12/23/2012 01:38 AM, J Mohamed Zahoor wrote:

I don't think I hijacked any thread.  it is a new thread. Can you please
enlighten me?

On Sunday, December 23, 2012, Lance Norskog wrote:


Please start a new thread.

Thanks!

On 12/22/2012 11:03 AM, J Mohamed Zahoor wrote:


Hi

I have a word completion requirement where i need to pick result from two
indexed fields.
The trick is i need to pick top 5 results from each field and display as
suggestions.

If i set fq as field1:XXX AND field2:XXX, the top result comes entirely
from field1 matches.
Is there any other way to get top 5 from field 1 matches and top 5 from
field 2 matched results?

./Zahoor

Re: multi field query with selective results

2012-12-22 Thread Lance Norskog


Please start a new thread.

Thanks!

On 12/22/2012 11:03 AM, J Mohamed Zahoor wrote:

Hi

I have a word completion requirement where i need to pick result from two 
indexed fields.
The trick is i need to pick top 5 results from each field and display as 
suggestions.

If i set fq as field1:XXX AND field2:XXX, the top result comes entirely from 
field1 matches.
Is there any other way to get top 5 from field 1 matches and top 5 from field 2 
matched results?

./Zahoor

Re: Finding the last committed record in SOLR 4

2012-12-21 Thread Lance Norskog

The only sure way to get the last searchable document is to use a 
timestamp or sequence number in the document. I do not think that using 
a timestamp with default=NOW will give a unique timestamp, so you need 
your own sequence number.


On 12/19/2012 10:17 PM, Joe wrote:

I'm using SOLR 4 for an application, where I need to search the index soon
after inserting records.

I'm using the solrj code below to get the last ID in the index. However, I
noticed that the last id I see when I execute a query through the solr web
admin is often lagging behind this. And that my searches are not including
all documents up until the last ID I get from the code snippet below. I'm
guessing this is because of delays in hard commits. I don't need to switch
to soft commits yet. I just want to make sure that I get the ID of the last
searchable document. Is this possible to do?


 SolrQuery query = new SolrQuery();
 query.set(qt,/select);
 query.setQuery( *:* );
 query.setFields(id);
 query.set(rows,1);
 query.set(sort,id desc);

 QueryResponse rsp = m_Server.query( query );
 SolrDocumentList docs = rsp.getResults();
 SolrDocument doc = docs.get(0);
 long id = (Long) doc.getFieldValue(id);




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Finding-the-last-committed-record-in-SOLR-4-tp4028235.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Pause and resume indexing on SolR 4 for backups

2012-12-20 Thread Lance Norskog

To be clear: 1) is fine. Lucene index updates are carefully sequenced so 
that the index is never in a bogus state. All data files are written and 
flushed to disk, then the segments.* files are written that match the 
data files. You can capture the files with a set of hard links to create 
a backup.


The CheckIndex program will verify the index backup.
java -cp yourcopy/lucene-core-SOMETHING.jar 
org.apache.lucene.index.CheckIndex collection/data/index


lucene-core-SOMETHING.jar is usually in the solr-webapp directory where 
Solr is unpacked.


On 12/20/2012 02:16 AM, Andy D'Arcy Jewell wrote:

Hi all.

Can anyone advise me of a way to pause and resume SolR 4 so I can 
perform a backup? I need to be able to revert to a usable (though not 
necessarily complete) index after a crash or other disaster more 
quickly than a re-index operation would yield.


I can't yet afford the extravagance of a separate SolR replica just 
for backups, and I'm not sure if I'll ever have the luxury. I'm 
currently running with just one node, be we are not yet live.


I can think of the following ways to do this, each with various 
downsides:


1) Just backup the existing index files whilst indexing continues
+ Easy
+ Fast
- Incomplete
- Potential for corruption? (e.g. partial files)

2) Stop/Start Tomcat
+ Easy
- Very slow and I/O, CPU intensive
- Client gets errors when trying to connect

3) Block/unblock SolR port with IpTables
+ Fast
- Client gets errors when trying to connect
- Have to wait for existing transactions to complete (not sure 
how, maybe watch socket FD's in /proc)


4) Pause/Restart SolR service
+ Fast ? (hopefully)
- Client gets errors when trying to connect

In any event, the web app will have to gracefully handle 
unavailability of SolR, probably by displaying a down for 
maintenance message, but this should preferably be only a very short 
amount of time.


Can anyone comment on my proposed solutions above, or provide any 
additional ones?


Thanks for any input you can provide!

-Andy

Re: optimun precisionStep for DAY granularity in a TrieDateField

2012-12-14 Thread Lance Norskog

Do you use rounding in your dates? You can index a date rounded to the 
nearest minute, N minutes, hour or day. This way a range query has to 
look at such a small number of terms that you may not need to tune the 
precision step. Hunt for NOW/DAY or 5DAYS in the queries.


http://wiki.apache.org/solr/SimpleFacetParameters

On 12/14/2012 10:16 AM, jmlucjav wrote:

Hi

I have a TrieDateField in my index, where I will index dates (range
2000-2020). I am only interested in the DAY granularity, that is , I dont
care about time (I'll index all based on the same Timezone).

Is there an optimun value for precisionStep that I can use so I don't index
info I will not ever use?? I have looked but have not found some info on
what values of precisionStep map to year/month/../day/hour... (not sure if
the mapping is straightforward anyway).

thanks for the help.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/optimun-precisionStep-for-DAY-granularity-in-a-TrieDateField-tp4027078.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Modeling openinghours using multipoints

2012-12-10 Thread Lance Norskog

Bit maps can be done with a separate term for each bit. You search for 
all of the terms in the bit range you want.


On 12/10/2012 06:34 AM, David Smiley (@MITRE.org) wrote:

Maybe it would? I don't completely get your drift.  But you're talking about a 
user writing a bunch of custom code to build, save, and query the bitmap 
whereas working on top of existing functionality seems to me a lot more 
maintainable on the user's part.
~ David


From: Lance Norskog-2 [via Lucene] [ml-node+s472066n4025579...@n3.nabble.com]
Sent: Sunday, December 09, 2012 6:35 PM
To: Smiley, David W.
Subject: Re: Modeling openinghours using multipoints

If these are not raw times, but quantized on-the-hour, would it be
faster to create a bit map of hours and then query across the bit
maps?

On Sun, Dec 9, 2012 at 8:06 AM, Erick Erickson [hidden 
email]UrlBlockedError.aspx wrote:


Thanks for the discussion, I've added this to my bag of tricks, way cool!

Erick


On Sat, Dec 8, 2012 at 10:52 PM, britske [hidden email]UrlBlockedError.aspx 
wrote:


Brilliant! Got some great ideas for this. Indeed all sorts of usecases
which use multiple temporal ranges could benefit..

Eg: Another Guy on stackoverflow asked me about this some days ago.. He
wants to model multiple temporary offers per product (free shopping for
christmas, 20% discount for Black friday , etc) .. All possible with this
out of the box. Factor in 'offer category' in  x and y as well for some
extra powerfull querying.

Yup im enthousiastic about it , which im sure you can tell :)

Thanks a lot David,

Cheers,
Geert-Jan



Sent from my iPhone

On 9 dec. 2012, at 05:35, David Smiley (@MITRE.org) [via Lucene] 
[hidden email]UrlBlockedError.aspx wrote:


britske wrote
That's seriously awesome!

Some change in the query though:
You described: To query for a business that is open during at least some
part of a given time duration
I want To query for a business that is open during at least the entire
given time duration.

Feels like a small difference but probably isn't (I'm still wrapping my
head on the intersect query I must admit)
So this would be a slightly different rectangle query.  Interestingly,

you simply swap the location in the rectangle where you put the start and
end time.  In summary:

Indexed span CONTAINS query span:
minX minY maxX maxY - 0 end start *

Indexed span INTERSECTS (i.e. OVERLAPS) query span:
minX minY maxX maxY - 0 start end *

Indexed span WITHIN query span:
minX minY maxX maxY - start 0 * end

I'm using '*' here to denote the max possible value.  At some point I

may add that as a feature.

That was a fun exercise!  I give you credit in prodding me in this

direction as I'm not sure if this use of spatial would have occurred to me
otherwise.

britske wrote
Moreover, any indication on performance? Should, say, 50.000 docs with
about 100-200 points each (1 a 2 open-close spans per day) be ok? ( I

know

'your mileage may very' etc. but just a guestimate :)
You should have absolutely no problem.  The real clincher in your favor

is the fact that you only need 9600 discrete time values (so you said), not
Long.MAX_VALUE.  Using Long.MAX_VALUE would simply not be possible with the
current implementation because it's using Doubles which has 52 bits of
precision not the 64 that would be required to be a complete substitute for
any time/date.  Even given the 52 bits, a quad SpatialPrefixTree with
maxLevels=52 would probably not perform well or might fail; not sure.
  Eventually when I have time to work on an implementation that can be based
on a configurable number of grid cells (not unlike how you can configure
precisionStep on the Trie numeric fields), 52 should be no problem.

I'll have to remember to refer back to this email on the approach if I

create a field type that wraps this functionality.

~ David

britske wrote
Again, this looks good!
Geert-Jan

2012/12/8 David Smiley (@MITRE.org) [via Lucene] 
[hidden email]


Hello again Geert-Jan!

What you're trying to do is indeed possible with Solr 4 out of the box.
  Other terminology people use for this is multi-value time duration.

  This

creative solution is a pure application of spatial without the

geospatial

notion -- we're not using an earth or other sphere model -- it's a flat
plane.  So no need to make reference to longitude  latitude, it's x 

y.

I would put opening time into x, and closing time into y.  To express a
point, use x y (x space y), and supply this as a string to your
SpatialRecursivePrefixTreeFieldType based field for indexing.  You can

give

it multiple values and it will work correctly; this is one of RPT's

main

features that set it apart from Solr 3 spatial.  To query for a

business

that is open during at least some part of a given time duration, say

6-8

o'clock, the query would look like openDuration:Intersects(minX minY

maxX

maxY)  and put 0 or minX (always), 6 for minY (start time), 8 for maxX
(end time), and the largest

Re: Modeling openinghours using multipoints

2012-12-09 Thread Lance Norskog

://www.packtpub.com/apache-solr-3-enterprise-search-server/book
 
 
  If you reply to this email, your message will be added to the discussion
 below:
 
 http://lucene.472066.n3.nabble.com/Modeling-openinghours-using-multipoints-tp4025336p4025434.html
  To unsubscribe from Modeling openinghours using multipoints, click here.
  NAML




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Modeling-openinghours-using-multipoints-tp4025336p4025454.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com

Re: Downloading files from the solr replication Handler

2012-11-29 Thread Lance Norskog

Maybe these are text encoding markers?

- Original Message -
| From: Eva Lacy e...@lacy.ie
| To: solr-user@lucene.apache.org
| Sent: Thursday, November 29, 2012 3:53:07 AM
| Subject: Re: Downloading files from the solr replication Handler
| 
| I tried downloading them with my browser and also with a c#
| WebRequest.
| If I skip the first and last 4 bytes it seems work fine.
| 
| 
| On Thu, Nov 29, 2012 at 2:28 AM, Erick Erickson
| erickerick...@gmail.comwrote:
| 
|  How are you downloading them? I suspect the issue is
|  with the download process rather than Solr, but I'm just guessing.
| 
|  Best
|  Erick
| 
| 
|  On Wed, Nov 28, 2012 at 12:19 PM, Eva Lacy e...@lacy.ie wrote:
| 
|   Just to add to that, I'm using solr 3.6.1
|  
|  
|   On Wed, Nov 28, 2012 at 5:18 PM, Eva Lacy e...@lacy.ie wrote:
|  
|I downloaded some configuration and data files directly from
|solr in an
|attempt to develop a backup solution.
|I noticed there is some characters at the start and end of the
|file
|  that
|aren't in configuration files, I notice the same characters at
|the
|  start
|and end of the data files.
|Anyone with any idea how I can download these files without the
|extra
|characters or predict how many there are going to be so I can
|skip
|  them?
|   
|  
| 
|

Re: User context based search in apache solr

2012-11-24 Thread Lance Norskog

sagarzond- you are trying to embed a recommendation system into search. 
Recommendations are inherently a matrix problem, where Solr and other search 
engines are one-dimensional databases. What you have is a sparse user-product 
matrix. This book has a good explanation of recommender systems:

Mahout In Action
http://manning.com/owen/



- Original Message -
| From: Otis Gospodnetic otis.gospodne...@gmail.com
| To: solr-user@lucene.apache.org
| Sent: Saturday, November 24, 2012 5:05:53 PM
| Subject: Re: User context based search in apache solr
| 
| Hi,
| 
| I don't have a full picture here, but why not just have userID =
| {list of
| clicked product IDs} stored somewhere (in memory, disk, DB...) and
| then, at
| search time, retrieve last N product IDs, run MLT query on those IDs,
| and
| then do whatever you desire to do... either take top N of those hits
| and
| slap them on top of regular results, or take top N of those and boost
| them
| in the main results, or ...  if you are into this, you may find
| http://sematext.com/search-analytics/index.html very useful, or at
| least
| interesting.
| 
| Otis
| --
| SOLR Performance Monitoring - http://sematext.com/spm/index.html
| 
| 
| 
| 
| On Fri, Nov 23, 2012 at 12:56 AM, sagarzond sagarz...@gmail.com
| wrote:
| 
|  In our application we are providing product master data search with
|  SOLR.
|  Now
|  our requirement want to provide user context based search(means we
|  are
|  providing top search result using user history).
| 
|  For that i have created one score table having following field
| 
|  1)product_id
| 
|  2)user_id
| 
|  3)score_value
| 
|  As soon as user clicked for any product that will create entry in
|  this
|  table
|  and also increase score_value if already present product for that
|  user. We
|  are planning to use boost field and eDisMax from SOLR to improve
|  search
|  result but for this i have to use one to many mapping between score
|  and
|  product table(Because we are having one product with different
|  score value
|  for different user) and solr not providing one to many mapping.
| 
|  We can solved this issue (one to many mapping handling) by
|  de-normalizing
|  structure as having multiple product entry with different score
|  value for
|  different user but it result huge amount of redundant data.
| 
|  Is this(de-normalized structure) currect way to handle or is there
|  any
|  other
|  way to handle such context based search.
| 
|  Plz help me
| 
| 
| 
|  --
|  View this message in context:
|  
http://lucene.472066.n3.nabble.com/User-context-based-search-in-apache-solr-tp4021964.html
|  Sent from the Solr - User mailing list archive at Nabble.com.
| 
|

Re: configuring solr xml as a datasource

2012-11-24 Thread Lance Norskog

You don't need the transformers. 

I think the paths should be what is in the XML file. 
forEach=/add

And the paths need to use the syntax for name=fname and name=number. I 
think this is it, but you should make sure.

xpath=/add/doc/field[@name='fname']
xpath=/add/doc/field[@name='number']

Look at the end of this section:
http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1

- Original Message -
| From: Leena Jawale leena.jaw...@lntinfotech.com
| To: solr-user@lucene.apache.org
| Sent: Monday, November 19, 2012 10:02:07 PM
| Subject: configuring solr xml as a datasource
| 
| Hi,
| 
| I am new to solr. I am trying to use solr xml data source for solr
| search engine.
| I have created test.xml file as
| -
| add
| doc
| field name=fnameleena1/field
| field name=number101/field
| /doc
| /add
| 
| I have created data-config.xml file
| 
| dataConfig
| dataSource type=FileDataSource encoding=UTF-8 /
| document
| entity name=page
| processor=XPathEntityProcessor
| stream=true
| forEach=/rootelement
| url=C:\solr\conf\test.xml
| transformer=RegexTransformer,DateFormatTransformer
| field column=namexpath=/rootelement/name /
| field column=number xpath=/rootelement/number /
| 
|/entity
| /document
| /dataConfig
| 
| And added below code in solrconfig.xml :
| requestHandler name=/dataimport
| class=org.apache.solr.handler.dataimport.DataImportHandler
| lst name=defaults
|   str name=configC:\solr\conf\data-config.xml/str
|   /lst
|   /requestHandler
| 
| But when I go to this link
|  http://localhost:8080/solr/dataimport?command=full-import
| Its showing Total Rows Fetched=0 , Total Documents Processed=0.
| How can I solve this problem? Please provide me the solution.
| 
| 
| Thanks  Regards,
| Leena Jawale
| Software Engineer Trainee
| BFS BU
| Phone No. - 9762658130
| Email -
| leena.jaw...@lntinfotech.commailto:leena.jaw...@lntinfotech.com
| 
| 
| 
| The contents of this e-mail and any attachment(s) may contain
| confidential or privileged information for the intended
| recipient(s). Unintended recipients are prohibited from taking
| action on the basis of information in this e-mail and using or
| disseminating the information, and must notify the sender and delete
| it from their system. LT Infotech will not accept responsibility or
| liability for the accuracy or completeness of, or the presence of
| any virus or disabling code in this e-mail
|

Re: User context based search in apache solr

2012-11-24 Thread Lance Norskog

Right, he has talked about this in various ways. But the key is take the 
user-item matrix in full and generate a new data model for recommendation. 
These approaches shove that datamodel into the search index. It is a batch 
process.

LucidWorks does this for search clicks.

- Original Message -
| From: Otis Gospodnetic otis.gospodne...@gmail.com
| To: solr-user@lucene.apache.org
| Sent: Saturday, November 24, 2012 7:39:04 PM
| Subject: Re: User context based search in apache solr
| 
| On the other hand, people have successfully built recommendation
| engines on
| top of Lucene or Solr before, and I think Ted Dunning just mentioned
| this
| over on the Mahout ML a few weeks ago. have a look at
| http://search-lucene.com/m/dbxtb1ykRkM though I think I recall a
| separate
| recent email where he was a bit more explicit about this.
| 
| Otis
| --
| SOLR Performance Monitoring - http://sematext.com/spm/index.html
| Search Analytics - http://sematext.com/search-analytics/index.html
| 
| 
| 
| 
| On Sat, Nov 24, 2012 at 9:30 PM, Lance Norskog goks...@gmail.com
| wrote:
| 
|  sagarzond- you are trying to embed a recommendation system into
|  search.
|  Recommendations are inherently a matrix problem, where Solr and
|  other
|  search engines are one-dimensional databases. What you have is a
|  sparse
|  user-product matrix. This book has a good explanation of
|  recommender
|  systems:
| 
|  Mahout In Action
|  http://manning.com/owen/
| 
| 
| 
|  - Original Message -
|  | From: Otis Gospodnetic otis.gospodne...@gmail.com
|  | To: solr-user@lucene.apache.org
|  | Sent: Saturday, November 24, 2012 5:05:53 PM
|  | Subject: Re: User context based search in apache solr
|  |
|  | Hi,
|  |
|  | I don't have a full picture here, but why not just have userID =
|  | {list of
|  | clicked product IDs} stored somewhere (in memory, disk, DB...)
|  | and
|  | then, at
|  | search time, retrieve last N product IDs, run MLT query on those
|  | IDs,
|  | and
|  | then do whatever you desire to do... either take top N of those
|  | hits
|  | and
|  | slap them on top of regular results, or take top N of those and
|  | boost
|  | them
|  | in the main results, or ...  if you are into this, you may find
|  | http://sematext.com/search-analytics/index.html very useful, or
|  | at
|  | least
|  | interesting.
|  |
|  | Otis
|  | --
|  | SOLR Performance Monitoring - http://sematext.com/spm/index.html
|  |
|  |
|  |
|  |
|  | On Fri, Nov 23, 2012 at 12:56 AM, sagarzond sagarz...@gmail.com
|  | wrote:
|  |
|  |  In our application we are providing product master data search
|  |  with
|  |  SOLR.
|  |  Now
|  |  our requirement want to provide user context based search(means
|  |  we
|  |  are
|  |  providing top search result using user history).
|  | 
|  |  For that i have created one score table having following field
|  | 
|  |  1)product_id
|  | 
|  |  2)user_id
|  | 
|  |  3)score_value
|  | 
|  |  As soon as user clicked for any product that will create entry
|  |  in
|  |  this
|  |  table
|  |  and also increase score_value if already present product for
|  |  that
|  |  user. We
|  |  are planning to use boost field and eDisMax from SOLR to
|  |  improve
|  |  search
|  |  result but for this i have to use one to many mapping between
|  |  score
|  |  and
|  |  product table(Because we are having one product with different
|  |  score value
|  |  for different user) and solr not providing one to many mapping.
|  | 
|  |  We can solved this issue (one to many mapping handling) by
|  |  de-normalizing
|  |  structure as having multiple product entry with different score
|  |  value for
|  |  different user but it result huge amount of redundant data.
|  | 
|  |  Is this(de-normalized structure) currect way to handle or is
|  |  there
|  |  any
|  |  other
|  |  way to handle such context based search.
|  | 
|  |  Plz help me
|  | 
|  | 
|  | 
|  |  --
|  |  View this message in context:
|  | 
|  
http://lucene.472066.n3.nabble.com/User-context-based-search-in-apache-solr-tp4021964.html
|  |  Sent from the Solr - User mailing list archive at Nabble.com.
|  | 
|  |
| 
|

Re: Solr Delta Import Handler not working

2012-11-19 Thread Lance Norskog

|  dataSource=null

I think this should not be here. The datasource should default to the 
dataSource listing. And 'rootEntity=true' should be in the 
XPathEntityProcessor block, because you are adding each file as one document.

- Original Message -
| From: Spadez james_will...@hotmail.com
| To: solr-user@lucene.apache.org
| Sent: Sunday, November 18, 2012 7:34:34 AM
| Subject: Re: Solr Delta Import Handler not working
| 
| Update! Thank you to Lance for the help. Based on your suggestion I
| have
| fixed up a few things.
| 
| *My Dataconfig now has the filename pattern fixed and root
| entity=true*
| /dataConfig
|   dataSource type=FileDataSource /
|   document
| entity
|   name=document
|   processor=FileListEntityProcessor
|   baseDir=/var/lib/employ
|   fileName=^.*\.xml$
|   recursive=false
|   rootEntity=true
|   dataSource=null
|   entity
| processor=XPathEntityProcessor
| url=${document.fileAbsolutePath}
| useSolrAddSchema=true
| stream=true
|   /entity
| /entity
|   /document
| /dataConfig/
| 
| *My data.xml has a corrected date format with T:*
| /add
| doc
| field name=id123/field
|   field name=titleDelta Import 2/field
| field name=descriptionThis is my long description/field
|   field name=truncated_descriptionThis is/field
| 
| field name=companyGoogle/field
| field name=location_nameEngland/field
| field name=date2007-12-31T22:29:59/field
| field name=sourceGoogle/field
| field name=urlwww.google.com/field
| field name=latlng45.17614,45.17614/field
| /doc
| /add/
| 
| 
| 
| --
| View this message in context:
| 
http://lucene.472066.n3.nabble.com/Solr-Delta-Import-Handler-not-working-tp4020897p4020925.html
| Sent from the Solr - User mailing list archive at Nabble.com.
|

Re: Solr Delta Import Handler not working

2012-11-17 Thread Lance Norskog

I think this means the pattern did not match any files:
str name=Total Rows Fetched0/str

The wiki example includes a '^' at the beginning of the filename pattern. This 
matches a complete line. 
http://wiki.apache.org/solr/DataImportHandler#Transformers_Example

More:
Add rootEntity=true. It cannot hurt to be explicit.

The date format needs a 'T' instead of a space:
http://en.wikipedia.org/wiki/ISO_8601

Cheers!

- Original Message -
| From: Spadez james_will...@hotmail.com
| To: solr-user@lucene.apache.org
| Sent: Saturday, November 17, 2012 2:49:30 PM
| Subject: Solr Delta Import Handler not working
| 
| Hi,
| 
| These are the exact steps that I have taken to try and get delta
| import
| handler working. If I can provide any more information to help let me
| know.
| I have literally spent the entire friday night and today on this and
| I throw
| in the towel. Where have I gone wrong?
| 
| *Added this line to the solrconfig:*
| /requestHandler name=/dataimport
| class=org.apache.solr.handler.dataimport.DataImportHandler
| lst name=defaults
|   str name=config/home/solr/data-config.xml/str
| /lst
|   /requestHandler/
| 
| *Then my data-config.xml looks like this:*
| /dataConfig
|   dataSource type=FileDataSource /
|   document
| entity
|   name=document
|   processor=FileListEntityProcessor
|   baseDir=/var/lib/data
|   fileName=.*.xml$
|   recursive=false
|   rootEntity=false
|   dataSource=null
|   entity
| processor=XPathEntityProcessor
| url=${document.fileAbsolutePath}
| useSolrAddSchema=true
| stream=true
|   /entity
| /entity
|   /document
| /dataConfig/
| 
| *Then in my var/lib/data folder I have a data.xml file that looks
| like
| this:*
| /add
| doc
|   field name=id123/field
|   field name=descriptionThis is my long description/field
|   field name=companyGoogle/field
|   field name=location_nameEngland/field
|   field name=date2007-12-31 22:29:59/field
|   field name=sourceGoogle/field
|   field name=urlwww.google.com/field
|   field name=latlng45.17614,45.17614/field
| /doc
| /add/
| 
| *Finally I then ran this command:*
| /http://localhost:8080/solr/dataimport?command=delta-importclean=false/
| 
| *And I get this result (failed):*
| /response
| lst name=responseHeader
| int name=status0/int
| int name=QTime1/int
| /lst
| lst name=initArgs
| lst name=defaults
| str name=config/opt/solr/example/solr/conf/data-config.xml/str
| /lst
| /lst
| str name=commanddelta-import/str
| str name=statusidle/str
| str name=importResponse/
| lst name=statusMessages
| str name=Time Elapsed0:15:9.543/str
| str name=Total Requests made to DataSource0/str
| str name=Total Rows Fetched0/str
| str name=Total Documents Processed0/str
| str name=Total Documents Skipped0/str
| str name=Delta Dump started2012-11-17 17:32:56/str
| str name=Identifying Delta2012-11-17 17:32:56/str
| str name=*Indexing failed*. Rolled back all changes./str
| str name=Rolledback2012-11-17 17:32:56/str
| /lst
| str name=WARNING
| This response format is experimental. It is likely to change in the
| future.
| /str
| /response/
| 
| 
| 
| 
| 
| --
| View this message in context:
| 
http://lucene.472066.n3.nabble.com/Solr-Delta-Import-Handler-not-working-tp4020897.html
| Sent from the Solr - User mailing list archive at Nabble.com.
|

Re: More references for configuring Solr

2012-11-11 Thread Lance Norskog

LucidFind collects several sources of information in one searchable archive:

http://find.searchhub.org/?q=sort=#%2Fp%3Asolr

- Original Message -
| From: Dmitry Kan dmitry@gmail.com
| To: solr-user@lucene.apache.org
| Sent: Sunday, November 11, 2012 2:24:21 AM
| Subject: Re: More references for configuring Solr
| 
| Hi,
| 
| here are some resources:
| http://wiki.apache.org/solr/ (Solr wiki)
| http://lucene.apache.org/solr/books.html (books published on Solr)
| 
| the goes googling on a specific topic. But before reading a book
| might not
| be a bad idea..
| 
| -- Dmitry
| 
| On Sat, Nov 10, 2012 at 1:15 PM, FARAHZADI, EMAD
| emad.farahz...@netapp.comwrote:
| 
|   Dear Sir or Madam,
| 
|  ** **
| 
|  I want to use to Solr for my final project in university in part of
|  searching and indexing.
| 
|  I’d be appreciated if you send me more resources or documentations
|  about
|  Solr.
| 
|  ** **
| 
|  Regards
| 
|  ** **
| 
|  ** **
| 
| 
|  *
| 
|  Emad Farahzadi [image: brand-site-home-telescope-160x95]*
| 
|  *Professional Services Consultant*
| 
|  *NetApp Middle-East*
| 
|  *
|  **Office: +971 4   4466203
|  Cell:+971 50 9197237*
| 
|  ** **
| 
|  *NetApp MEA (Middle East   Africa)
|  Office No. 214
|  Building 2, 2nd Floor
|  Dubai Internet City
|  P.O. Box 500199
|  Dubai, U.A.E. *
| 
|   [image: netapp-cloud-esig-dollar]
| 
|  ** **
| 
| 
| 
| 
| --
| Regards,
| 
| Dmitry Kan
|

Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters

2012-11-07 Thread Lance Norskog

You can debug this with the 'Analysis' page in the Solr UI. You pick 
'text_general' and then give words with umlauts in the text box for indexing 
and queries.

Lance

- Original Message -
| From: Daniel Brügge daniel.brue...@googlemail.com
| To: solr-user@lucene.apache.org
| Sent: Wednesday, November 7, 2012 8:45:45 AM
| Subject: SolrCloud, Zookeeper and Stopwords with Umlaute or other special 
characters
| 
| Hi,
| 
| i am running a SolrCloud cluster with the 4.0.0 version. I have a
| stopwords
| file
| which is in the correct encoding. It contains german Umlaute like
| e.g. 'ü'.
| I am
| also running a standalone Zookeeper which contains this stopwords
| file. In
| my schema
| i am using the stopwords file in the standard way:
| 
| 
|  fieldType name=text_general class=solr.TextField
|  positionIncrementGap=100
|analyzer type=index
|  tokenizer class=solr.StandardTokenizerFactory/
|  filter class=solr.StopFilterFactory
|  ignoreCase=true
|  words=my_stopwords.txt
|  enablePositionIncrements=true /
| 
| 
| When I am indexing i recognized, that all stopwords without Umlaute
| are
| correctly removed, but the ones with
| Umlaute still exist.
| 
| Is this a problem with ZK or Solr?
| 
| Thanks  regards
| 
| Daniel
|

Re: Where to get more documents or references about sold cloud?

2012-11-06 Thread Lance Norskog

LucidFind is a searchable archive of Solr documentation and email lists:

http://find.searchhub.org/?q=solrcloud

- Original Message -
| From: Jack Krupansky j...@basetechnology.com
| To: solr-user@lucene.apache.org
| Sent: Monday, November 5, 2012 4:44:46 AM
| Subject: Re: Where to get more documents or references about sold cloud?
| 
| Is most of the Web blocked in your location? When I Google
| SolrCloud,
| Google says that there are About 61,400 results with LOTS of
| informative
| links, including blogs, videos, slideshares, etc. just on the first
| two
| pages pf search results alone.
| 
| If you have specific questions, please ask them with specific detail,
| but
| try reading a few of the many sources of information available on the
| Web
| first.
| 
| -- Jack Krupansky
| 
| -Original Message-
| From: SuoNayi
| Sent: Monday, November 05, 2012 3:32 AM
| To: solr-user@lucene.apache.org
| Subject: Where to get more documents or references about sold cloud?
| 
| Hi all, there is only one entry about solr cloud on the
| wiki,http://wiki.apache.org/solr/SolrCloud.
| I have googled a lot and found no more details about solr cloud, or
| maybe I
| miss something?
| 
|

Re: Does SolrCloud supports MoreLikeThis?

2012-11-06 Thread Lance Norskog

The question you meant to ask is: Does MoreLikeThis support Distributed 
Search? and the answer apparently is no. This is the issue to get it working:

https://issues.apache.org/jira/browse/SOLR-788

(Distributed Search is independent of SolrCloud.) If you want to make unit 
tests, that would really help- they won't work now but they will make it easier 
for someone to get the patch working again. Also, the patch will not get 
committed without unit tests.

Lance

- Original Message -
| From: Luis Cappa Banda luisca...@gmail.com
| To: solr-user@lucene.apache.org
| Sent: Monday, November 5, 2012 7:54:59 AM
| Subject: Re: Does SolrCloud supports MoreLikeThis?
| 
| Thanks for the answer, Darren! I still have the hope that MLT is
| supported
| in the current version. An important feature of the product that I´m
| developing depends on that, and even if I can emulate MLT with a
| Dismax or
| E-dismax component, the thing is that MLT fits and works perfectly...
| 
| Regards,
| 
| Luis Cappa.
| 
| 
| 2012/11/5 Darren Govoni dar...@ontrenet.com
| 
|  There is a ticket for that with some recent activity (sorry I don't
|  have
|  it handy right now), but I'm not sure if that work made it into the
|  trunk,
|  so probably solrcloud does not support MLT...yet. Would love an
|  update from
|  the dev team though!
| 
|  brbrbr--- Original Message ---
|  On 11/5/2012  10:37 AM Luis Cappa Banda wrote:brThat´s the
|  question, :-)
|  br
|  brRegards,
|  br
|  brLuis Cappa.
|  br
| 
|

Re: After adding field to schema, the field is not being returned in results.

2012-11-02 Thread Lance Norskog

If any value is in a bogus format, the entire document batch in that HTTP 
request fails. That is the right timestamp format.
The index may be corrupted somehow. Can you try removing all of the fields in 
data/ and trying again?

- Original Message -
| From: Erick Erickson erickerick...@gmail.com
| To: solr-user@lucene.apache.org
| Sent: Friday, November 2, 2012 7:32:40 AM
| Subject: Re: After adding field to schema, the field is not being returned in 
results.
| 
| Well, I'm at my wits end. I tried your field definitions (using the
| exampledocs XML) and they work just fine. As far as if you mess up
| the date
| on the way in, you should be seeing stack traces in your log files.
| 
| The only way I see not getting the Sorry, no Term Info available :(
| message is if you don't have any values in the field. So, my guess is
| that
| you're not getting the format right and the docs aren't getting
| indexed,
| but that's just a guess. You can freely sort even if there are no
| values at
| all in a particular field. This can be indicated if you sort asc and
| desc
| and the order doesn't change. It just means the field is defined in
| the
| schema, not necessarily that there are any values in it.
| 
| So, I claim you have no date values in your index. The fact that you
| can
| sort is just an artifact of sortMissingFirst/Last doing something
| sensible.
| 
| Next question, are you absolutely sure that your indexing program and
| your
| searching program are pointing at the same server?
| 
| So what I'd do next is
| 1 create a simple XML doc that conforms to your schema and use the
| post.jar tool to send it to your server. Watch the output log for any
| date
| format exceptions.
| 2 Use the admin UI to insure that you can see terms in docs added
| this way.
| 3 from there back up and see what step in the indexing process isn't
| working (assuming that's the problem). Solr logs help here.
| 
| Note I'm completely PHP-ignorant, I have no clue whether the
| formatting
| you're doing is OK or not. You might try logging the value somewhere
| in
| your php so you an post that and/or include it in your sample XML
| file...
| 
| Best
| Erick
| 
| 
| On Fri, Nov 2, 2012 at 10:02 AM, Dotan Cohen dotanco...@gmail.com
| wrote:
| 
|  On Thu, Nov 1, 2012 at 9:28 PM, Lance Norskog goks...@gmail.com
|  wrote:
|   Have you uploaded data with that field populated? Solr is not
|   like a
|  relational database. It does not automatically populate a new field
|  when
|  you add it to the schema. If you sort on a field, a document with
|  no data
|  in that field comes first or last (I don't know which).
|  
| 
|  Thank you. In fact, I am being careful to try to pull up records
|  after
|  the date in which the application was updated to populate the
|  field.
| 
| 
|  --
|  Dotan Cohen
| 
|  http://gibberish.co.il
|  http://what-is-what.com
| 
|

Re: After adding field to schema, the field is not being returned in results.

2012-11-01 Thread Lance Norskog

Have you uploaded data with that field populated? Solr is not like a relational 
database. It does not automatically populate a new field when you add it to the 
schema. If you sort on a field, a document with no data in that field comes 
first or last (I don't know which). 

- Original Message -
| From: Dotan Cohen dotanco...@gmail.com
| To: solr-user@lucene.apache.org
| Sent: Wednesday, October 31, 2012 6:54:47 PM
| Subject: Re: After adding field to schema, the field is not being returned in 
results.
| 
| On Thu, Nov 1, 2012 at 2:52 AM, Otis Gospodnetic
| otis.gospodne...@gmail.com wrote:
|  Hi,
| 
|  That should work just fine.  It;s either a bug or you are doing
|  something
|  you didn't mention.  Maybe you can provide a small, self-enclosed
|  unit test
|  and stick it in JIRA?
| 
| 
| I would assume that it's me doing something wrong! How does this
| look:
| 
| /solr/select?q=*rows=1sort=created_iso8601%20descfl=created_iso8601,created
| 
| response
|   lst name=responseHeader
| int name=status0/int
| int name=QTime1/int
| lst name=params
|   str name=q*:*/str
|   str name=rows1/str
|   str name=flcreated_iso8601,created/str
| /lst
|   /lst
|   result name=response numFound=1037937 start=0
| doc
|   int name=created1350854389/int
| /doc
|   /result
| /response
| 
| Surely the sort parameter would throw an error if the
| created_iso8601field did not exist. That field is indexed and stored,
| with no parameters defined on handlers that may list the fields to
| return as Alexandre had mentioned.
| 
| 
| --
| Dotan Cohen
| 
| http://gibberish.co.il
| http://what-is-what.com
|

Re: throttle segment merging

2012-10-28 Thread Lance Norskog

1) Do you use compound files (CFS)? This adds a lot of overhead to merging.
2) Does ES use the same merge policy code as Solr?

In solrconfig.xml, here are the lines that control segment merging. You can 
probably set mergeFactor to 20 and cut the amount of disk I/O.

!-- Expert: Merge Policy 
 The Merge Policy in Lucene controls how merging of segments is done.
 The default since Solr/Lucene 3.3 is TieredMergePolicy.
 The default since Lucene 2.3 was the LogByteSizeMergePolicy,
 Even older versions of Lucene used LogDocMergePolicy.
  --
!--
mergePolicy class=org.apache.lucene.index.TieredMergePolicy
  int name=maxMergeAtOnce10/int
  int name=segmentsPerTier10/int
/mergePolicy
  --
   
!-- Merge Factor
 The merge factor controls how many segments will get merged at a time.
 For TieredMergePolicy, mergeFactor is a convenience parameter which
 will set both MaxMergeAtOnce and SegmentsPerTier at once.
 For LogByteSizeMergePolicy, mergeFactor decides how many new segments
 will be allowed before they are merged into one.
 Default is 10 for both merge policies.
  --
!-- 
mergeFactor10/mergeFactor
  --

!-- Expert: Merge Scheduler
 The Merge Scheduler in Lucene controls how merges are
 performed.  The ConcurrentMergeScheduler (Lucene 2.3 default)
 can perform merges in the background using separate threads.
 The SerialMergeScheduler (Lucene 2.2 default) does not.
 --
!-- 
   mergeScheduler 
class=org.apache.lucene.index.ConcurrentMergeScheduler/
   --


- Original Message -
| From: Radim Kolar h...@filez.com
| To: solr-user@lucene.apache.org
| Sent: Saturday, October 27, 2012 7:44:46 PM
| Subject: Re: throttle segment merging
| 
| Dne 26.10.2012 3:47, Tomás Fernández Löbbe napsal(a):
|  Is there way to set-up logging to output something when segment
|  merging
|  runs?
| 
|  I think segment merging is logged when you enable infoStream
|  logging (you
|  should see it commented in the solrconfig.xml)
| no, segment merging is not logged at info level. it needs customized
| log
| config.
| 
| 
|  Can be segment merges throttled?
|   You can change when and how segments are merged with the merge
| policy, maybe it's enough for you changing the initial settings
| (mergeFactor for example)?
| 
| I am now researching elasticsearch, it can do it, its lucene 3.6
| based
|

Re: Get metadata for query

2012-10-27 Thread Lance Norskog

Nope! Each document comes back with its own list of stored fields. If you want 
to find all fields in an index, you have to fetch every last document and OR in 
the fields in that document. There is no Solr call to get a full list of static 
or dynamic fields.

If you use lots of dynamic fields I can see how this would be useful for 
pan-index tasks like assessing data quality.

- Original Message -
| From: Jack Krupansky j...@basetechnology.com
| To: solr-user@lucene.apache.org
| Sent: Friday, October 26, 2012 7:41:58 PM
| Subject: Re: Get metadata for query
| 
| I'm not sure I understand the real question here. What is the
| metadata.
| 
| I mean, q=xfl=* gives you all the (stored) fields for documents
| matching
| the query.
| 
| What else is there?
| 
| -- Jack Krupansky
| 
| -Original Message-
| From: Lance Norskog
| Sent: Friday, October 26, 2012 9:42 PM
| To: solr-user@lucene.apache.org
| Subject: Re: Get metadata for query
| 
| Ah, there's the problem- what is a fast way to fetch all fields in a
| collection, including dynamic fields?
| 
| - Original Message -
| | From: Otis Gospodnetic otis.gospodne...@gmail.com
| | To: solr-user@lucene.apache.org
| | Sent: Friday, October 26, 2012 3:05:04 PM
| | Subject: Re: Get metadata for query
| |
| | Hi,
| |
| | No... but you could simply query your index, get all the fields you
| | need and process them to get what you need.
| |
| | Otis
| | --
| | Search Analytics - http://sematext.com/search-analytics/index.html
| | Performance Monitoring - http://sematext.com/spm/index.html
| |
| |
| | On Fri, Oct 26, 2012 at 10:19 AM, Torben Honigbaum
| | torben.honigb...@neuland-bfi.de wrote:
| |  Hi everybody,
| | 
| |  with http://localhost:8983/solr/admin/luke it's possible to get
| |  metadata for all indices. But is there a way to get only the
| |  metadata for a special query? I want to query all documents which
| |  are in a special category. For the query I need the metadata
| |  containing a list of all fields of the documents.
| | 
| |  Thank you
| |  Torben
| | 
| 
|

Re: Get metadata for query

2012-10-27 Thread Lance Norskog

Erk, haven't used /luke in years. Apologies.

About that JS: does distributed search do the right thing when the 
distributed part is not implemented? Or does every script have to explicitly 
include distributed search support?

- Original Message -
| From: Erik Hatcher erik.hatc...@gmail.com
| To: solr-user@lucene.apache.org
| Sent: Saturday, October 27, 2012 4:14:12 AM
| Subject: Re: Get metadata for query
| 
| Lance Lance Lance :)  As the OP said, you can use /admin/luke to
| get all the fields (static and dynamic) used in the index.  I've
| used that trick to get a list of all *_facet dynamic fields to then
| have my UI (Blackight's first prototypes, aka Solr Flare) turn
| around and facet on them.  The request to /admin/luke was done once
| and cached.
| 
| But I think what Torben is going for is the
| FieldsUsedUpdateProcessor trick like
| https://issues.apache.org/jira/browse/SOLR-1280.
| 
| In Solr 4 there is a JavaScript update processor example, commented
| out, that will add a field to every document containing the names of
| the fields (constrained to the name pattern of attr_* in the
| example) for that document.  One can then use that to facet upon.
| 
| In Solr 4, it's here:
| 
https://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_0_0/solr/example/solr/collection1/conf/update-script.js
| 
| Note, the field name in a comment in there is incorrect (I'll commit
| a fix), but if you used that update processor, you could then do a
| query and facet on field attribute_ss and across that result set see
| what fields are contained within it.  I've seen this trick employed
| at the Smithsonian first hand, where there are so many different
| attributes across the documents that it's hard to know what the best
| facets are for the result set.
| 
|   Erik
| 
| 
| On Oct 27, 2012, at 04:09 , Lance Norskog wrote:
| 
|  Nope! Each document comes back with its own list of stored fields.
|  If you want to find all fields in an index, you have to fetch
|  every last document and OR in the fields in that document. There
|  is no Solr call to get a full list of static or dynamic fields.
|  
|  If you use lots of dynamic fields I can see how this would be
|  useful for pan-index tasks like assessing data quality.
|  
|  - Original Message -
|  | From: Jack Krupansky j...@basetechnology.com
|  | To: solr-user@lucene.apache.org
|  | Sent: Friday, October 26, 2012 7:41:58 PM
|  | Subject: Re: Get metadata for query
|  | 
|  | I'm not sure I understand the real question here. What is the
|  | metadata.
|  | 
|  | I mean, q=xfl=* gives you all the (stored) fields for documents
|  | matching
|  | the query.
|  | 
|  | What else is there?
|  | 
|  | -- Jack Krupansky
|  | 
|  | -Original Message-
|  | From: Lance Norskog
|  | Sent: Friday, October 26, 2012 9:42 PM
|  | To: solr-user@lucene.apache.org
|  | Subject: Re: Get metadata for query
|  | 
|  | Ah, there's the problem- what is a fast way to fetch all fields
|  | in a
|  | collection, including dynamic fields?
|  | 
|  | - Original Message -
|  | | From: Otis Gospodnetic otis.gospodne...@gmail.com
|  | | To: solr-user@lucene.apache.org
|  | | Sent: Friday, October 26, 2012 3:05:04 PM
|  | | Subject: Re: Get metadata for query
|  | |
|  | | Hi,
|  | |
|  | | No... but you could simply query your index, get all the fields
|  | | you
|  | | need and process them to get what you need.
|  | |
|  | | Otis
|  | | --
|  | | Search Analytics -
|  | | http://sematext.com/search-analytics/index.html
|  | | Performance Monitoring - http://sematext.com/spm/index.html
|  | |
|  | |
|  | | On Fri, Oct 26, 2012 at 10:19 AM, Torben Honigbaum
|  | | torben.honigb...@neuland-bfi.de wrote:
|  | |  Hi everybody,
|  | | 
|  | |  with http://localhost:8983/solr/admin/luke it's possible to
|  | |  get
|  | |  metadata for all indices. But is there a way to get only the
|  | |  metadata for a special query? I want to query all documents
|  | |  which
|  | |  are in a special category. For the query I need the metadata
|  | |  containing a list of all fields of the documents.
|  | | 
|  | |  Thank you
|  | |  Torben
|  | | 
|  | 
|  | 
| 
|

Re: lukeall.jar for Solr4r?

2012-10-27 Thread Lance Norskog

Aha! Andrzej has not built a 4.0 release version. You need to check out the 
source and compile your own.

http://code.google.com/p/luke/downloads/list

- Original Message -
| From: Carrie Coy c...@ssww.com
| To: solr-user@lucene.apache.org
| Sent: Friday, October 26, 2012 7:33:45 AM
| Subject: lukeall.jar for Solr4r?
| 
| Where can I get a copy of Luke capable of reading Solr4 indexes?  My
| lukeall-4.0.0-ALPHA.jar no longer works.
| 
| Thx,
| Carrie Coy
|

Re: DIH throws NullPointerException when using dataimporter.functions.escapeSql with parent entities

2012-10-26 Thread Lance Norskog

Which database rows cause the problem? The bug report talks about fields with 
an empty string. Do your rows have empty string values?

- Original Message -
| From: Dominik Siebel m...@dsiebel.de
| To: solr-user@lucene.apache.org
| Sent: Monday, October 22, 2012 3:15:29 AM
| Subject: Re: DIH throws NullPointerException when using 
dataimporter.functions.escapeSql with parent entities
| 
| That's what I thought.
| I'm just curious that nobody else seems to have this problem although
| I found the exact same issue description in the issue tracker
| (https://issues.apache.org/jira/browse/SOLR-2141) which goes back to
| October 2010 and is flagged as Resolved: Cannot Reproduce.
| 
| 
| 2012/10/20 Lance Norskog goks...@gmail.com:
|  If it worked before and does not work now, I don't think you are
|  doing anything wrong :)
| 
|  Do you have a different version of your JDBC driver?
|  Can you make a unit test with a minimal DIH script and schema?
|  Or, scan through all of the JIRA issues against the DIH from your
|  old Solr capture date.
| 
| 
|  - Original Message -
|  | From: Dominik Siebel m...@dsiebel.de
|  | To: solr-user@lucene.apache.org
|  | Sent: Thursday, October 18, 2012 11:22:54 PM
|  | Subject: Fwd: DIH throws NullPointerException when using
|  | dataimporter.functions.escapeSql with parent entities
|  |
|  | Hi folks,
|  |
|  | I am currently migrating our Solr servers from a 4.0.0 nightly
|  | build
|  | (aprox. November 2011, which worked very well) to the newly
|  | released
|  | 4.0.0 and am running into some issues concerning the existing
|  | DataImportHandler configuratiions. Maybe you have an idea where I
|  | am
|  | going wrong here.
|  |
|  | The following lines are a highly simplified excerpt from one of
|  | the
|  | problematic imports:
|  |
|  | entity name=path rootEntity=false query=SELECT p.id,
|  | IF(p.name
|  | IS NULL, '', p.name) AS name FROM path p GROUP BY p.id
|  |
|  | entity name=item rootEntity=true query=
|  | SELECT
|  | i.*,
|  |
|  | CONVERT('${dataimporter.functions.escapeSql(path.name)}' USING
|  | utf8) AS path_name
|  | FROM items i
|  | WHERE i.path_id = ${path.id} /
|  |
|  | /entity
|  |
|  | While this configuration worked without any problem for over half
|  | a
|  | year now, when upgrading to 4.0.0-BETA AND 4.0.0 the Import
|  | throws
|  | the
|  | followeing Stacktrace and exits:
|  |
|  |  SEVERE: Exception while processing: path document :
|  | null:org.apache.solr.handler.dataimport.DataImportHandlerException:
|  | java.lang.NullPointerException
|  |
|  | which is caused by
|  |
|  | Caused by: java.lang.NullPointerException
|  | at
|  | 
org.apache.solr.handler.dataimport.EvaluatorBag$1.evaluate(EvaluatorBag.java:79)
|  |
|  | In other words: The EvaluatorBag doesn't seem to resolve the
|  | given
|  | path.name variable properly and returns null.
|  |
|  | Does anyone have any idea?
|  | Appreciate your input!
|  |
|  | Regards
|  | Dom
|  |
|

Re: Get metadata for query

2012-10-26 Thread Lance Norskog

Ah, there's the problem- what is a fast way to fetch all fields in a 
collection, including dynamic fields?

- Original Message -
| From: Otis Gospodnetic otis.gospodne...@gmail.com
| To: solr-user@lucene.apache.org
| Sent: Friday, October 26, 2012 3:05:04 PM
| Subject: Re: Get metadata for query
| 
| Hi,
| 
| No... but you could simply query your index, get all the fields you
| need and process them to get what you need.
| 
| Otis
| --
| Search Analytics - http://sematext.com/search-analytics/index.html
| Performance Monitoring - http://sematext.com/spm/index.html
| 
| 
| On Fri, Oct 26, 2012 at 10:19 AM, Torben Honigbaum
| torben.honigb...@neuland-bfi.de wrote:
|  Hi everybody,
| 
|  with http://localhost:8983/solr/admin/luke it's possible to get
|  metadata for all indices. But is there a way to get only the
|  metadata for a special query? I want to query all documents which
|  are in a special category. For the query I need the metadata
|  containing a list of all fields of the documents.
| 
|  Thank you
|  Torben
|

Re: Search and Entity structure

2012-10-26 Thread Lance Norskog

A side point: in fact, the connection between MBA and grade is not lost. The 
values in a multi-valued field are stored in order. You can have separate 
multi-valued fields with matching entries, and the values will be fetched in 
order and you can match them by counting. This is not database-ish, but it is a 
permanent feature.

Lance

- Original Message -
| From: v vijith vvij...@gmail.com
| To: solr-user@lucene.apache.org
| Sent: Friday, October 26, 2012 12:50:29 PM
| Subject: Re: Search and Entity structure
| 
| The schema content that I have put in is
| 
|field name=EMPID type=integer indexed=true stored=true
| required=true multiValued=false /
|field name=empname type=string indexed=true stored=true
|/
|field name=gradeid type=integer indexed=true
|stored=true/
|field name=gradename type=string indexed=true
|stored=true/
|field name=grade type=string indexed=true stored=true/
|  uniqueKeyEMPID/uniqueKey
| 
| The dataconfig file is
| document
| entity name=employee query=select * from employee
| entity name=qualification  query=select * from
| qualification where empid='${employee.EMPID}'/
| /entity
| /document
| 
| With this as well, when I try, I get the entity as below -
| result name=response numFound=3 start=0
| doc
| int name=EMPID3/intstr name=empnameViktor/str/doc
| docint name=EMPID2/int
| str name=empnameGeorge/str
| str name=gradeC/str
| int name=gradeid4/int
| str name=gradenamePM/str/doc
| docint name=EMPID1/intstr name=empnameJohn/str
| str name=gradeB/strint name=gradeid2/intstr
| name=gradenameLEAD/str
| /doc
| 
| The issue is that, employee George has 2 qualifications but is not
| shown in the result. This is due to unique id I believe. Can you
| provide some help?
| 
| 
| 
| On Fri, Oct 26, 2012 at 8:46 PM, Gora Mohanty g...@mimirtech.com
| wrote:
|  On 25 October 2012 23:48, v vijith vvij...@gmail.com wrote:
|  Dear All,
| 
|  Apologize for lengthy email 
| 
|  SOLR Version: 4
| 
|  Im a newbie to SOLR and have gone through tutorial but could not
|  get a
|  solution. The below requirement doesnt seem to be impossible but I
|  think Im missing the obvious.
| 
|  In my RDBMS, there is a Qualification table and an Employee table.
|  An
|  employee can have many qualifications. The qualification can have
|  following attributes - GradeName and Grade. The search using sql
|  query
|  to achieve my requirement is as below
| 
|  select * from qualification a, employee b where a.empid= b.empid
|  and
|  a.gradename='MBA' and b.grade='A';
| 
|  This will return me the employee along with the dept who has the
|  grade
|  as MBA and has grade of A.
| 
|  Employee: 2 records
|  -
|  Empid: 1
|  Name: John
|  Location: California
| 
|  Qualifications:
|  Gradeid: 1
|  Empid: 1
|  Name: MBA
|  Grade: B
| 
|  Gradeid: 2
|  Empid: 1
|  Name: LEAD
|  Grade: A
|  
| 
|  Empid: 2
|  Name: George
|  Location: Nevada
| 
|  Qualifications:
|  Gradeid: 3
|  Empid: 2
|  Name: MBA
|  Grade: A
| 
|  Gradeid: 4
|  Empid: 2
|  Name: Graduate
|  Grade: C
| 
|  Stop thinking of Solr in terms of RDBMS. Instead, flatten out your
|  data. Thus, in your example, you could have a schema with the
|  following fields:
|  doc_id name location qualification grade
|  doc_id is a unique identifier for Solr. If you want to retain Empid
|  and Gradeid you could also add these.
| 
|  and the following entries
|  1 John California MBA B
|  2 John California Lead A
|  3 George Nevada MBA A
|  4 George Nevada Graduate C
| 
|  Searching for qualification:MBA and grade:A will then give you only
|  record 3.
| 
|  Regards,
|  Gora
|

Re: Solr-4.0.0 DIH not indexing xml attributes

2012-10-19 Thread Lance Norskog

Do other fields get added?
Do these fields have type problems? I.e. is 'attr1' a number and you are adding 
a string?
There is a logging EP that I think shows the data found- I don't know how to 
use it.
Is it possible to post the whole DIH script?

- Original Message -
| From: Billy Newman newman...@gmail.com
| To: solr-user@lucene.apache.org
| Sent: Friday, October 19, 2012 9:06:08 AM
| Subject: Solr-4.0.0 DIH not indexing xml attributes
| 
| Hello all,
| 
| I am having problems indexing xml attributes using the DIH.
| 
| I have the following xml:
| 
| root
| Stuff attr1=some attr attr2=another attr
| ...
| /Stuff
| /root
| 
| I am using the following XPath for my fields:
| field column=attr1 xpath=/root/Stuff/@attr1 /
| field column=attr2 xpath=/root/Stuff/@attr2 /
| 
| 
| However nothing is getting inserted into my index.
| 
| I am pretty sure this should work so I have no idea what is wrong.
| 
| Can anyone else confirm that this is a problem?  Or is it just me?
| 
| Thanks,
| Billy
|

Re: DIH throws NullPointerException when using dataimporter.functions.escapeSql with parent entities

2012-10-19 Thread Lance Norskog

If it worked before and does not work now, I don't think you are doing anything 
wrong :)

Do you have a different version of your JDBC driver?
Can you make a unit test with a minimal DIH script and schema?
Or, scan through all of the JIRA issues against the DIH from your old Solr 
capture date.


- Original Message -
| From: Dominik Siebel m...@dsiebel.de
| To: solr-user@lucene.apache.org
| Sent: Thursday, October 18, 2012 11:22:54 PM
| Subject: Fwd: DIH throws NullPointerException when using 
dataimporter.functions.escapeSql with parent entities
| 
| Hi folks,
| 
| I am currently migrating our Solr servers from a 4.0.0 nightly build
| (aprox. November 2011, which worked very well) to the newly released
| 4.0.0 and am running into some issues concerning the existing
| DataImportHandler configuratiions. Maybe you have an idea where I am
| going wrong here.
| 
| The following lines are a highly simplified excerpt from one of the
| problematic imports:
| 
| entity name=path rootEntity=false query=SELECT p.id, IF(p.name
| IS NULL, '', p.name) AS name FROM path p GROUP BY p.id
| 
| entity name=item rootEntity=true query=
| SELECT
| i.*,
| 
| CONVERT('${dataimporter.functions.escapeSql(path.name)}' USING
| utf8) AS path_name
| FROM items i
| WHERE i.path_id = ${path.id} /
| 
| /entity
| 
| While this configuration worked without any problem for over half a
| year now, when upgrading to 4.0.0-BETA AND 4.0.0 the Import throws
| the
| followeing Stacktrace and exits:
| 
|  SEVERE: Exception while processing: path document :
| null:org.apache.solr.handler.dataimport.DataImportHandlerException:
| java.lang.NullPointerException
| 
| which is caused by
| 
| Caused by: java.lang.NullPointerException
| at
| 
org.apache.solr.handler.dataimport.EvaluatorBag$1.evaluate(EvaluatorBag.java:79)
| 
| In other words: The EvaluatorBag doesn't seem to resolve the given
| path.name variable properly and returns null.
| 
| Does anyone have any idea?
| Appreciate your input!
| 
| Regards
| Dom
|

Re: Flushing RAM to disk

2012-10-17 Thread Lance Norskog

There is no backed by disk RamDirectory feature. The MMapDirectory uses the 
operating system to do almost exactly the same thing, in a much better way. 
That is why it is the default.

- Original Message -
| From: deniz denizdurmu...@gmail.com
| To: solr-user@lucene.apache.org
| Sent: Tuesday, October 16, 2012 9:58:07 PM
| Subject: Flushing RAM to disk
| 
| Hi all, I have a question about solr directories... Basically I will
| be using
| RAM directory for the project, but I am curious if it is possible to
| flush
| (or copy from ) RAM to disk? via cronjob or a timer in java code? if
| yes,
| could anyone give me some details about it?
| 
| thank you
| 
| 
| 
| -
| Zeki ama calismiyor... Calissa yapar...
| --
| View this message in context:
| http://lucene.472066.n3.nabble.com/Flushing-RAM-to-disk-tp4014128.html
| Sent from the Solr - User mailing list archive at Nabble.com.
|

Re: Flushing RAM to disk

2012-10-17 Thread Lance Norskog

I do not know how to load an index from disk into a RAMDirectory in Solr.

- Original Message -
| From: deniz denizdurmu...@gmail.com
| To: solr-user@lucene.apache.org
| Sent: Wednesday, October 17, 2012 12:15:52 AM
| Subject: Re: Flushing RAM to disk
| 
| I heard about MMapDirectory - actually my test env is using that- ,
| but the
| question was just an idea... and how about using SolrCloud? I mean
| can we
| set shards to use ram and replicas to use MMapDirectory? is this
| possible?
| 
| 
| 
| -
| Zeki ama calismiyor... Calissa yapar...
| --
| View this message in context:
| http://lucene.472066.n3.nabble.com/Flushing-RAM-to-disk-tp4014128p4014155.html
| Sent from the Solr - User mailing list archive at Nabble.com.
|

Re: How many documents in each Lucene segment?

2012-10-16 Thread Lance Norskog

CheckIndex prints these stats.

java -cp lucene-core-WHATEVER.jar org.apache.lucene.index.CheckIndex

- Original Message -
| From: Shawn Heisey s...@elyograg.org
| To: solr-user@lucene.apache.org
| Sent: Monday, October 15, 2012 9:46:33 PM
| Subject: Re: How many documents in each Lucene segment?
| 
| On 10/15/2012 8:06 PM, Michael Ryan wrote:
|  Easiest way I know of without parsing any of the index files is to
|  take the size of the fdx file in bytes and divide by 8. This will
|  give you the exact number of documents before 4.0, and a close
|  approximation in 4.0.
| 
|  Though, the fdx file might not be on disk if you haven't committed.
| 
| When you are importing 12 million documentsfrom a database, you get
| LOTS
| of completed segments even if there is no commit until the end.  The
| ramBuffer fills up pretty quick.
| 
| I intend to figure out how many documents are in the segments
| (ramBufferSizeMB=256) and try out an autoCommit setting a little bit
| lower than that.  I had trouble with autoCommit on previous versions,
| but with 4.0 I can turn off openSearcher, which may allow it to work
| right.
| 
| Thanks,
| Shawn
| 
|

Re: Solr Autocomplete

2012-10-15 Thread Lance Norskog

http://find.searchhub.org/?q=autosuggest+OR+autocomplete

- Original Message -
| From: Rahul Paul rahul.p...@iiitb.org
| To: solr-user@lucene.apache.org
| Sent: Monday, October 15, 2012 9:01:14 PM
| Subject: Solr Autocomplete
| 
| Hi,
| I am using mysql for solr indexing data in solr. I have two fields:
| name
| and college. How can I add auto suggest based on these two fields?
| 
| 
| 
| --
| View this message in context:
| http://lucene.472066.n3.nabble.com/Solr-Autocomplete-tp4013859.html
| Sent from the Solr - User mailing list archive at Nabble.com.
|

Re: Solr - db-data-config.xml general asking to entity

2012-10-14 Thread Lance Norskog

Two answers:
1) Do you have maybe user names or timestamps for the comments?
Usually people want those also.
2) You can store the comments as one long string, or as multiple
entries in a field. Your database should have a concatenate function
that will take field X from multiple documents in a join and make a
long string. I would concatenate the comments into one string with a
magic separator, and then just split them up in my application.

It is not simple to get multiple join results into one document.
Here's how it works:
a) Collect all of the results of 'comment' into one long string and
use a unique character to separate them. This is your one document
from your query.
b) Use the RegexTransformer in the DIH to split the long string into
several values.

http://lucidworks.lucidimagination.com/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler



On Sat, Oct 13, 2012 at 6:21 AM, Marvin markus.pfeif...@ebcont-et.com wrote:
 Hi there!
 I have 2 tables 'blog' and 'comment'. A blog can contains n comments (blog
 --1:n-- comment). Up to date I use following select to insert the data into
 solr index:

 entity name=blog dataSource=mssqlDatasource pk=id
 transformer=ClobTransformer
 query=SELECT b.id, b.market, b.title AS blogTitle, b.message AS
 blogMessage, c.message AS commentMessage
 FROM blog b LEFT JOIN comment c ON b.id = c.source_id
 AND c.source_type = 'blog'
 field column=blogMessage name=blogMessage clob=true /
 field column=commentMessage name=commentMessage clob=true
 /
 /entity

 The index result looks like:

 doc
  str name=id1/str
  str name=market12/str
  str name=titleblog of title 1/str
  str name=blogMessagemessage of blog 1/str
  str name=commentMessagemessage of comment/str
 /doc

 doc
  str name=id1/str
  str name=market12/str
  str name=titleblog of title 1/str
  str name=blogMessagemessage of blog 1/str
  str name=commentMessagemessage of comment - Im the second
 comment/str
 /doc

 I would say this is stupid because I got too many index data with the same
 blog just the comments are different. Is it possible to set 'comments' as
 'subentity' like following:

 entity name=blog dataSource=mssqlDatasource pk=id
 transformer=ClobTransformer
 query=SELECT b.id, b.market, b.title AS blogTitle, b.message AS
 blogMessage
 FROM blog b
  field column=blogMessage name=blogMessage clob=true /

   entity name=comment dataSource=mssqlDatasource pk=id
  transformer=ClobTransformer
  query=SELECT c.id, c.message as commentMessage
 FROM comment c
 WHERE c.source_id = ${blog.id}
 field column=commentMessage name=commentMessage
 clob=true /
   /entity
 /entity


 Is that possible?
 How would the result looks like (cant test it until monday)?
 All example I found the sub entity just select 1 column but I need at
 least 2.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-db-data-config-xml-general-asking-to-entity-tp4013533.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goks...@gmail.com

Re: How to import a part of index from main Solr server(based on a query) to another Solr server and then do incremental import at intervals later(the updated index)?

2012-10-14 Thread Lance Norskog

Solr's Java Replication feature downloads changes to an index. It does
not need to pull the entire index.

I think what you need to do with the SolrEntityProcessor is this:
do a Solr sorted query on your last modified field and fetch the
timestamp from the first row. This would go in an outer 'entity'.
Inside this, you would have the 'root entity' for each document. This
might be your database query. You use the timestamp value from the
outer SolrEP query in your SQL.

I have not tried this- you will have to experiment.


On Sat, Oct 13, 2012 at 7:04 PM, jefferyyuan yuanyun...@gmail.com wrote:
 Thanks for the reply, but I think SolrReplication may not help in this case,
 as we don't want to replicate all indexs to solr2, just a part of
 index(index of doc created by me). Seems SolrReplication doesn't support
 replicate a part of index(based on a query) to the slave.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-import-a-part-of-index-from-main-Solr-server-based-on-a-query-to-another-Solr-server-and-then-tp4013479p4013580.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goks...@gmail.com

Re: which api to use to manage solr ?

2012-10-12 Thread Lance Norskog

SolrJ is in Java, RSolr and ruby-solr are for ruby, etc. These are for
low-level programming.

There is a Wordpress plugin for Solr, Django, Magento e-commerce, and
some other apps. Blacklight is an content manager for libraries.

What do you want to do with Solr?

On Fri, Oct 12, 2012 at 4:45 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 Good evening,

 SolrJ lives in the same house as Solr itself, so...

 Otis
 --
 Performance Monitoring - http://sematext.com/spm
 On Oct 12, 2012 5:39 PM, autregalaxie yassine.el-ha...@esial.net wrote:

 Good morning everybody,

 I'm a new user of Solr, i have to develop new interface to manage Solr. I
 have found severel api to do that ( Blacklight, Sunspot, Solrj,
 ruby-Solr...) and I need your help to know wish one are better and more
 reliable.

 Thank You



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/which-api-to-use-to-manage-solr-tp4013491.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com

Re: Using

2012-10-12 Thread Lance Norskog

After that, remove your ivy repository (home/.ivy2) and try again. And
rename your Maven repository just to avoid anything.

 I have had weird problems with connectivity to different Ivy
repositories. I use a VPN service that pops out in different countries
(blackVPN) and some countries worked and other countries did not.

On Fri, Oct 12, 2012 at 3:52 PM, Erick Erickson erickerick...@gmail.com wrote:
 I've been building 4.x regularly. Have you tried ant clean-jars?

 Best
 Erick

 On Fri, Oct 12, 2012 at 6:32 PM, P Williams
 williams.tricia.l...@gmail.com wrote:
 Hi,

 Has anyone tried using dependency org=org.apache.solr
 name=solr-test-framework rev=4.0.0 conf=test-default/ with Apache
 IVY in their project?

 rev 3.6.1 works but any of the 4.0.0 ALPHA, BETA and release result in:
 [ivy:resolve] :: problems summary ::
 [ivy:resolve]  WARNINGS
 [ivy:resolve]   [FAILED ]
 org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit:
  (0ms)
 [ivy:resolve]    shared: tried
 [ivy:resolve]
 C:\Users\pjenkins\.ant/shared/org.eclipse.jetty.orbit/javax.servlet/3.0.0.v201112011016/orbits/javax.servlet.orbit
 [ivy:resolve]    public: tried
 [ivy:resolve]
 http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.orbit
 [ivy:resolve]   ::
 [ivy:resolve]   ::  FAILED DOWNLOADS::
 [ivy:resolve]   :: ^ see resolution messages for details  ^ ::
 [ivy:resolve]   ::
 [ivy:resolve]   ::
 org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit
 [ivy:resolve]   ::
 [ivy:resolve]
 [ivy:resolve]
 [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

 Can anybody point me to the source of this error or a workaround?

 Thanks,
 Tricia



-- 
Lance Norskog
goks...@gmail.com

Re: Query foreign language synonyms / words of equivalent meaning?

2012-10-10 Thread Lance Norskog

I want an update processor that runs Translation Party.

http://translationparty.com/

http://downloadsquad.switched.com/2009/08/14/translation-party-achieves-hilarious-results-using-google-transl/

- Original Message -
| From: SUJIT PAL sujit@comcast.net
| To: solr-user@lucene.apache.org
| Sent: Wednesday, October 10, 2012 2:51:37 PM
| Subject: Re: Query foreign language synonyms / words of equivalent meaning?
| 
| Hi,
| 
| We are using google translate to do something like what you
| (onlinespending) want to do, so maybe it will help.
| 
| During indexing, we store the searchable fields from documents into a
| fields named _en, _fr, _es, etc. So assuming we capture title and
| body from each document, the fields are (title_en, body_en),
| (title_fr, body_fr), etc, with their own analyzer chains. These
| documents come from a controlled source (ie not the web), so we know
| the language they are authored in.
| 
| During searching, a custom component intercepts the client language
| and the query. The query is sent to google translate for language
| detection. The largest amount of docs in the corpus is english, so
| if the detected language is either english or the client language,
| then we call google translate again to find the translated query in
| the other (english or client) language. Another custom component
| constructs an OR query between the two languages one component of
| which is aimed at the _en field set and the other aimed at the _xx
| (client language) field set.
| 
| -sujit
| 
| On Oct 9, 2012, at 11:24 PM, Bernd Fehling wrote:
| 
|  
|  As far as I know, there is no built-in functionality for language
|  translation.
|  I would propose to write one, but there are many many pitfalls.
|  If you want to translate from one language to another you might
|  have to
|  know the starting language. Otherwise you get problems with
|  translation.
|  
|  Not (german) - distress (english), affliction (english)
|  
|  - you might have words in one language which are stopwords in other
|  language not
|  - you don't have a one to one mapping, it's more like 1 to n+x
|   toilette (french) - bathroom, rest room / restroom, powder room
|  
|  This are just two points which jump into my mind but there are tons
|  of pitfalls.
|  
|  We use the solution of a multilingual thesaurus as synonym
|  dictionary.
|  http://en.wikipedia.org/wiki/Eurovoc
|  It holds translations of 22 official languages of the European
|  Union.
|  
|  So a search for europäischer währungsfonds gives also results
|  with
|  european monetary fund, fonds monétaire européen, ...
|  
|  Regards
|  Bernd
|  
|  
|  
|  Am 10.10.2012 04:54, schrieb onlinespend...@gmail.com:
|  Hi,
|  
|  English is going to be the predominant language used in my
|  documents, but
|  there may be a spattering of words in other languages (such as
|  Spanish or
|  French). What I'd like is to initiate a query for something like
|  bathroom
|  for example and for Solr to return documents that not only contain
|  bathroom but also baño (Spanish). And the same goes when
|  searching for 
|  baño. I'd like Solr to return documents that contain either
|  bathroom or 
|  baño.
|  
|  One possibility is to pre-translate all indexed documents to a
|  common
|  language, in this case English. And if someone were to search
|  using a
|  foreign word, I'd need to translate that to English before issuing
|  a query
|  to Solr. This appears to be problematic, since I'd have to know
|  whether the
|  indexed words and the query are even in a foreign language, which
|  is not
|  trivial.
|  
|  Another possibility is to pre-build a list of foreign word
|  synonyms. So baño
|  would be listed as a synonym for bathroom. But I'd need to include
|  other
|  languages (such as toilette in French) and other words. This
|  requires that
|  I know in advance all possible words I'd need to include foreign
|  language
|  versions of (not to mention needing to know which languages to
|  include).
|  This isn't trivial either.
|  
|  I'm assuming there's no built-in functionality that supports the
|  foreign
|  language translation on the fly, so what do people propose?
|  
|  Thanks!
|  
|  
|  --
|  *
|  Bernd FehlingUniversitätsbibliothek Bielefeld
|  Dipl.-Inform. (FH)LibTec - Bibliothekstechnologie
|  Universitätsstr. 25 und Wissensmanagement
|  33615 Bielefeld
|  Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de
|  
|  BASE - Bielefeld Academic Search Engine - www.base-search.net
|  *
| 
|

Re: Using additional dictionary with DirectSolrSpellChecker

2012-10-10 Thread Lance Norskog

Hapax legomena (terms with DF of 1) are very often typos. You can automatically 
build a stopword file from these. If you want to be picky, you can use only 
words with a very small distance from words with much larger DF.

- Original Message -
| From: Robert Muir rcm...@gmail.com
| To: solr-user@lucene.apache.org
| Sent: Wednesday, October 10, 2012 5:40:23 PM
| Subject: Re: Using additional dictionary with DirectSolrSpellChecker
| 
| On Wed, Oct 10, 2012 at 9:02 AM, O. Klein kl...@octoweb.nl wrote:
|  I don't want to tweak the threshold. For majority of cases it works
|  fine.
| 
|  It's for cases where term has low frequency but is spelled
|  correctly.
| 
|  If you lower the threshold you would also get incorrect spelled
|  terms as
|  suggestions.
| 
| 
| Yeah there is no real magic here when the corpus contains typos. this
| existing docFreq heuristic was just borrowed from the old index-based
| spellchecker.
| 
| I do wonder if using # of occurrences (totalTermFreq) instead of # of
| documents with the term (docFreq) would improve the heuristic.
| 
| In all cases I think if you want to also integrate a dictionary or
| something, it seems like this could somehow be done with the
| File-based spellchecker?
|

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1422 matches

Mail list logo