date:20120216

Entity with multiple datasources

2012-02-16 Thread Radu Toev

Hello,

I created a data-config.xml file where I define a datasource and an entity
with 12 fields.
In my use case I have 2 databases with the same schema, so I want to
combine in one index the 2 databases.
I defined a second dataSource tag and duplicateed the entity with its
field(changed the name and the datasource).
What I'm expecting is to get around 7k results(I have around 6k in the
first db and 1k in the second). However I'm getting a total of 2k.
Where could be the problem?

Thanks

Re: 'foruns' don't match 'forum' with NGramFilterFactory (or EdgeNGramFilterFactory)

2012-02-16 Thread Dirceu Vieira

Hi,

It's funny that if you try fóruns it matches:
http://bhakta.casadomato.org:8982/solr/select/?q=f%C3%B3runsversion=2.2start=0rows=10indent=on
But not when you try foruns, it does not.

Check this out...

http://bhakta.casadomato.org:8982/solr/admin/analysis.jsp?nt=typename=textverbose=onhighlight=onval=f%C3%B3rumqverbose=onqval=foruns

See that stemming does not work for the word foruns.

Could it be because fórum is part of the PT dictionary but not forum?

Regards,

2012/2/14 Bráulio Bhavamitra brauli...@gmail.com

 Hello all,

 I'm experimenting with NGramFilterFactory and EgdeNGramFilterFactory.

 Both of them shows a match in my solr admin analysis, but when I query
 'foruns'
 doesn't find any 'forum'.
 analysis

 http://bhakta.casadomato.org:8982/solr/admin/analysis.jsp?nt=typename=textverbose=onhighlight=onval=f%C3%B3runsqverbose=onqval=f%C3%B3runs
 search

 http://bhakta.casadomato.org:8982/solr/select/?q=forunsversion=2.2start=0rows=10indent=on

 Anybody knows what's the problem?

 bráulio




-- 
Dirceu Vieira Júnior
---
+47 9753 2473
dirceuvjr.blogspot.com
twitter.com/dirceuvjr

problem to indexing pdf directory

2012-02-16 Thread alessio crisantemi

Hi all,
I have a problem to configure a pdf indexing from a directory in my solr
wit DIH:

with this data-config


dataConfig
 dataSource type=BinFileDataSource /
 document
  entity
name=tika-test
processor=FileListEntityProcessor
baseDir=D:\gioconews_archivio\marzo2011
fileName=.*pdf
recursive=true
rootEntity=false
dataSource=null/
  entity processor=FileListEntityProcessor
url=D:\gioconews_archivio\marzo2011 format=text 
   field column=author  name=author meta=true/
   field column=title name=title meta=true/
 field column=description name=description /
 field column=comments name=comments /

 field column=content_type name=content_type /
 field column=last_modified name=last_modified /
  /entity
 /document
/dataConfig

I obtain this result:



  str name=commandfull-import/str

  str name=statusidle/str

  str name=importResponse /

- lst name=statusMessages

  str name=Time Elapsed0:0:2.44/str

  str name=Total Requests made to DataSource0/str

  str name=Total Rows Fetched43/str

  str name=Total Documents Skipped0/str

  str name=Full Dump Started2012-02-12 19:06:00/str

  str name=Indexing failed. Rolled back all changes./str

  str name=Rolledback2012-02-12 19:06:00/str
  /lst


suggestions?
thank you
alessio

Best requestHandler for typing error.

2012-02-16 Thread stockii

Hello.

Which RH do you use to find typing errors like goolge = do you mean
google ?!

I want to use my Autosuggestion EdgeNGram with a clever AutoCorrection!



What do you use ?

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 
1 Core with 45 Million Documents other Cores  200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-requestHandler-for-typing-error-tp3749576p3749576.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Do we need reindexing from solr 1.4.1 to 3.5.0?

2012-02-16 Thread Kashif Khan

I kept old schema files and solrconfig file but there were some errors due to
which solr was not loading. I dono what are those things. We have few our
own custom plugins developed with 1.4.1

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Do-we-need-reindexing-from-solr-1-4-1-to-3-5-0-tp3739353p3749629.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Do we need reindexing from solr 1.4.1 to 3.5.0?

2012-02-16 Thread Kashif Khan

we have both stored = true and false fields in the schema. So we cant reindex
wat u said. we have tried that earlier.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Do-we-need-reindexing-from-solr-1-4-1-to-3-5-0-tp3739353p3749631.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Using Solr for a rather busy Yellow Pages-type index - good idea or not really?

2012-02-16 Thread Mikhail Khludnev

Pls find inlined.

On Thu, Feb 16, 2012 at 10:30 AM, Alexey Verkhovsky 
alexey.verkhov...@gmail.com wrote:

 Hi, all,

 I'm new here. Used Solr on a couple of projects before, but didn't need to
 dive deep into anything until now. These days, I'm doing a spike for a
 yellow pages type search server with the following technical
 requirements:

 ~10 mln listings in the database. A listing has a name, address,
 description, coordinates and a number of tags / filtering fields; no more
 than a kilobyte all told; i.e. theoretically the whole thing should fit in
 RAM without sharding. A typical query is either all text matches on name
 and/or description within a bounded box, or some combination of tag
 matches within a bounded box. Bounded boxes are 1 to 50 km wide, and
 contain up to 10^5 unfiltered listings (the average is more like 10^3).
 More than 50% of all the listings are in the frequently requested bounding
 boxes, however a vast majority of listings are almost never displayed
 (because they don't match the other filters).

 Data never changes (i.e., a daily batch update; rebuild of the entire
 index and restart of all search servers is feasible, as long as it takes
 minutes, not hours).

Everybody start from daily bounce, but end up with UPDATED_AT column and
delta updates , just consider urgent content fix usecase. Don't think it's
worth to rely on daily bounce as a cornerstone of architecture.


 This thing ideally should serve up to 10^3 requests
 per second on a small (as in, less than 10 commodity boxes) cluster. In
 other words, a typical request should be CPU bound and take ~100-200 msec
 to process. Because of coordinates (that are almost never the same),
 caching of queries makes no sense;

you can use grid of coordinates to reduce their entropy, if you filter by
bounding box argument is bounding box not a coordinates. Anyway
postfiltering and cache=false for such filters
http://yonik.wordpress.com/2012/02/10/advanced-filter-caching-in-solr/


 from what little I understand about
 Lucene internals, caching of filters probably doesn't make sense either.

But solr does it http://wiki.apache.org/solr/SolrCaching#filterCache


 After perusing documentation and some googling (but almost no source code
 exploring yet), I understand how the schema and the queries will look like,
 and now have to figure out a specific configuration that fits the
 performance/scalability requirements. Here is what I'm thinking:

 1. Search server is an internal service that uses embedded Solr for the
 indexing part. RAMDirectoryFactory as index storage.

Bad idea. It's purposed mostly for tests, the closest purposed for
production analogue is
org.apache.lucene.store.instantiated.InstantiatedIndex


 2. All data is in some sort of persistent storage on a file system, and is
 loaded into the memory when a search server starts up.

AFAIK the state of art is use file directory (MMAP or whatever), rely on
Linux file system RAM cache. Also Solr and partially Lucene cache some
stuff in HEAP themselves
http://wiki.apache.org/solr/SolrCaching#Types_of_Caches_and_Example_Configuration.
So, this is mostly done already.


 3. Data updates are handled as update the persistent storage, start
 another cluster, load the world into RAM, flip the load balancer, kill the
 old cluster

no again. Lucene has pretty cool model of segments and generations purposed
to incremental update. And Solr does a lot to do search in old generation
and warnup the new one simultaneously (it just takes some memory, you know,
two times). I don;t think that manual A/B scheme is applicable. Anyway, you
can (but don't relly need to) play around replication facilities e.g.
disable traffic for half of nodes, push new index on it, let them warmup,
enable traffic (such machinery never works smoothly due number of moving
parts)


 4. Solr returns IDs with relevance scores; actual presentations of listings
 (as JSON documents) are constructed outside of Solr and cached in
 Memcached, as a mostly static content with a few templated bits, like
 distance%=DISTANCE_TO(-123.0123, 45.6789) %.

Use separate nodes to do a search and another nodes to stream the content
sounds good (mentioned in every book). Looks like beside of the score you
can also return distance to user i.e. no need to %=DISTANCE_TO(-123.0123,
45.6789) % , just %=doc.DISTANCE% see
http://wiki.apache.org/solr/SpatialSearch?#Returning_the_distance



 5. All Solr caching is switched off.

But why?




 Obviously, we are not the first people to do something like this with Solr,
 so I'm hoping for some collective wisdom on the following:

 Does this sounds like a feasible set of requirements in terms of
 performance and scalability for Solr? Are we on the right path to solving
 this problem well? If not, what should we be doing instead? What nasty
 technical/architectural gotchas are we probably missing at this stage?

 One particular advice I'd be really happy to hear is you may not need
 RAMDataFactory if

Re: Entity with multiple datasources

2012-02-16 Thread Dmitry Kan

1. Do you see any errors / exceptions in the logs?
2. Could you have duplicates?

On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev radut...@gmail.com wrote:

 Hello,

 I created a data-config.xml file where I define a datasource and an entity
 with 12 fields.
 In my use case I have 2 databases with the same schema, so I want to
 combine in one index the 2 databases.
 I defined a second dataSource tag and duplicateed the entity with its
 field(changed the name and the datasource).
 What I'm expecting is to get around 7k results(I have around 6k in the
 first db and 1k in the second). However I'm getting a total of 2k.
 Where could be the problem?

 Thanks




-- 
Regards,

Dmitry Kan

Re: Spatial Search and faceting

2012-02-16 Thread Eric Grobler

Hi William,

Thanks for the feedback.

I will try the group query and see how the performance with 2 queries is.

Best Regards
Ericz

On Thu, Feb 16, 2012 at 4:06 AM, William Bell billnb...@gmail.com wrote:

 One way to do it is to group by city and then sort=geodist() asc

 select?group=truegroup.field=citysort=geodist() descrows=10fl=city

 It might require 2 calls to SOLR to get it the way you want.

 On Wed, Feb 15, 2012 at 5:51 PM, Eric Grobler impalah...@googlemail.com
 wrote:
  Hi Solr community,
 
  I am doing a spatial search and then do a facet by city.
  Is it possible to then sort the faceted cities by distance?
 
  We would like to display the hits per city, but sort them by distance.
 
  Thanks  Regards
  Ericz
 
  q=iphone
  fq={!bbox}
  sfield=geopoint
  pt=49.594857,8.468614
  d=50
  fl=id,description,city,geopoint
 
  facet=true
  facet.field=city
  f.city.facet.limit=10
  f.city.facet.sort=count //geodist() asc



 --
 Bill Bell
 billnb...@gmail.com
 cell 720-256-8076

Realtime search with multi clients updating index simultaneously.

2012-02-16 Thread v_shan

I have a heldesk application developed in PHP/MySQL. I want to implement real
time Full text search and I have shortlisted Solr. MySQL database will store
all the tickets and their updates and that data will be imported for
building Solr index. All Search requests will be handled by Solr.

What I want is a real time search. The moment someone updates a ticket, it
should be available for search.

As per my understanding of Solr, this is how I think the system will work.
A user updates a ticket - database record is modified - a request is sent
to Solr server to modify corresponding document in index.

I have read a book on Solr and below questions are troubling me.
1. The book mentions that commits are slow in Solr. Depending on the index
size, Solr's auto-warming
configuration, and Solr's cache state prior to committing, a commit can take
a non-trivial amount of time. Typically, it takes a few seconds, but it can
take
some number of minutes in extreme cases. If this is true then how will I
know when the data will be availbale for search and how can I implemnt
realtime search? Also I don't want the ticket update operation to be slowed
down (by adding extra step of updating Solr index)

2. It is also mentioned that there is no transaction isolation. This means
that if more than one Solr client
were to submit modifications and commit them at overlapping times, it is
possible for part of one client's set of changes to be committed before that
client told Solr to commit. This applies to rollback as well. If this is a
problem
for your architecture then consider using one client process responsible for
updating Solr.

Does it mean that due to lack of transactional commits, Solr can mess up the
updates when multiple people update the ticket simultaneously?

Now the question before me is: Is Solr fit in my case? If yes, How?

--
View this message in context:
http://lucene.472066.n3.nabble.com/Realtime-search-with-multi-clients-updating-index-simultaneously-tp3749881p3749881.html
Sent from the Solr - User mailing list archive at Nabble.com.

88 matches

Mail list logo