from:"Alexey Serba"

Re: Faceting Question

2012-11-15 Thread Alexey Serba

Seems like pivot faceting is what you looking for (
http://wiki.apache.org/solr/SimpleFacetParameters#Pivot_.28ie_Decision_Tree.29_Faceting
)

Note: it currently does not work in distributed mode - see
https://issues.apache.org/jira/browse/SOLR-2894

On Thu, Nov 15, 2012 at 7:46 AM, Jamie Johnson  wrote:
> Sorry some more info. I have a field to store source and another for date.
>  I currently use faceting to get a temporal distribution across all
> sources.  What is the best way to get a temporal distribution per source?
>  Is the only thing I can do to execute 1 query for the list of sources and
> then another query for each source?
>
> On Wednesday, November 14, 2012, Jamie Johnson  wrote:
>> I've recently been asked to be able to display a temporal facet broken
> down by source, so source1 has the following temporal distribution, source
> 2 has the following temporal distribution etc.  I was wondering what the
> best way to accomplish this is?  My current thoughts were that I'd need to
> execute a completely separate query for each, is this right?  Could field
> aliasing some how be used to execute this in a single request to solr?  Any
> thoughts would really be appreciated.

Re: Faceting Facets

2012-09-03 Thread Alexey Serba

http://wiki.apache.org/solr/SimpleFacetParameters#Pivot_.28ie_Decision_Tree.29_Faceting

On Mon, Sep 3, 2012 at 6:38 PM, Dotan Cohen  wrote:
> Is there any way to nest facet searches in Solr? Specifically, I have
> a User field and a DateTime field. I need to know how many Documents
> match each User for each one-hour period in the past 24 hours. That
> is, 16 Users * 24 time periods = 384 values to return.
>
> I could run 16 queries and facet on DateTime, or 24 queries and facet
> on User. However, if there is a way to facet the facets, then I would
> love to know. Thanks!
>
> --
> Dotan Cohen
>
> http://gibberish.co.il
> http://what-is-what.com

Re: Java class "[B" has no public instance field or method named "split".

2012-08-31 Thread Alexey Serba

http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are_added_to_the_Solr_document_as_object_strings_like_B.401f23c5

On Sat, Sep 1, 2012 at 2:17 AM, Cirelli, Stephen J.
 wrote:
> Anyone know why I'm getting this exception? I'm following the example
> here < http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer>
> but I get the below error. The field type in my schema.xml is string,
> text doesn't work either. Why would I get an error that there's no split
> method on a string?
>
> Caused by: sun.org.mozilla.javascript.internal.EvaluatorException: Java
> class "[B" has no public instance field or method named "split".
> (#52)
>
> Here's the JS
>
> function parseAttachments(row){
> var mainDelim = '(|)', subDelim = '-|-',
> attRow = [//This must be in the order that it was
> concatinated in the query.
> { index:0, field:'attachmentFileName',
> arr: new java.util.ArrayList()},
> { index:1, field:'attachmentSize',
> arr: new java.util.ArrayList()},
> { index:2, field:'attachmentMIMEType',
> arr: new java.util.ArrayList()},
> { index:3,
> field:'attachmentExtractedText', arr: new java.util.ArrayList()},
> { index:4, field:'attachmentLink',
> arr: new java.util.ArrayList()}
> ]
>
> var allAttachments =
> row.get('attachments').split(mainDelim);
> for(var i=0,l=allAttachments.length; i var attachment = allAttachments[i].split(subDelim);
>
> for(var j=0,jl=attRow.length; j var itm = attachment[j],
> arr = attRow[j].arr;
> arr.add(itm);
> }
> }
> for(var j=0,jl=attRow.length; j var itm = attRow[j];
> row.put(itm.field, itm.arr);
> }
> row.remove('attachments');
> return row;
> }

Re: Query Time problem on Big Index Solr 3.5

2012-08-31 Thread Alexey Serba

1. Use filter queries

> Here a example of query, there are any incorrect o anything that can I
> change?
> http://xxx:8893/solr/candidate/select/?q=+(IdCandidateStatus:2)+(IdCobranded:3)+(IdLocation1:12))+(LastLoginDate:[2011-08-26T00:00:00Z
> TO 2012-08-28T00:00:00Z])

What is the logic here? Are you AND-ing these boolean clauses? If yes,
then I would change queries to

http://xxx:8893/solr/candidate/select/?q=*:*&fq=IdCandidateStatus:2&fq=IdCobranded:3&fq=IdLocation1:12&fq=LastLoginDate:[2011-08-26T00:00:00Z
TO 2012-08-28T00:00:00Z]

I.e. move queries into fq (filter query) parameter.
* it should be faster as it seems you don't need score here. Sort by
id/date instead.
* fq-s will be cached separately thus increasing cache hit rate.

2. Do not optimize your index

> I have a master, and 6 slaves, they are been syncronized every 10 minutes. 
> And the index always is optimized.
DO NOT optimize your index! (unless you re-create the whole index
completely every 10 mins). It basically kills the idea of replication
(after every optimize command slaves download the whole index).

Re: Injest pauses

2012-08-29 Thread Alexey Serba

Could you take jstack dump when it's happening and post it here?

> Interestingly it is not pausing during every commit so at least a portion of 
> the time the async commit code is working.  Trying to track down the case 
> where a wait would still be issued.
> 
> -Original Message-
> From: Voth, Brad (GE Corporate) 
> Sent: Wednesday, August 29, 2012 12:32 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Injest pauses
> 
> Thanks, I'll continue with my testing and tracking down the block.
> 
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
> Sent: Wednesday, August 29, 2012 12:28 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Injest pauses
> 
> On Wed, Aug 29, 2012 at 11:58 AM, Voth, Brad (GE Corporate) 
>  wrote:
>> Anyone know the actual status of SOLR-2565, it looks to be marked as 
>> resolved in 4.* but I am still seeing long pauses during commits using
>> 4.*
> 
> SOLR-2565 is definitely committed - adds are no longer blocked by commits (at 
> least at the Solr level).
> 
> -Yonik
> http://lucidworks.com

Re: LateBinding

2012-08-29 Thread Alexey Serba

http://searchhub.org/dev/2012/02/22/custom-security-filtering-in-solr/

See section about PostFilter.

On Wed, Aug 29, 2012 at 4:43 PM,   wrote:
> Hello,
>
> Has anyone ever implementet the security feature called late-binding?
>
> I am trying this but I am very new to solr and I would be very glad if I
> would get some hints to this.
>
> Regards,
> Johannes

Re: Injest pauses

2012-08-29 Thread Alexey Serba

Hey Brad,

> This leads me to believe that a single merge thread is blocking indexing from 
> occuring.
> When this happens our producers, which distribute their updates amongst all 
> the shards, pile up on this shard and wait.
Which version of Solr you are using? Have you tried 4.0 beta?

* 
http://searchhub.org/dev/2011/04/09/solr-dev-diary-solr-and-near-real-time-search/
* https://issues.apache.org/jira/browse/SOLR-2565

Alexey

Re: Sharing and performance testing question.

2012-08-29 Thread Alexey Serba

> Any tips on load testing solr? Ideally we would like caching to not effect
> the result as much as possible.

1. Siege tool
This is probably the simplest option. You can generate urls.txt file
and pass it to the tool. You should also capture server performance
(CPU, memory, qps, etc) using tools like newrelic, zabbix, etc.

2. SolrMeter
http://code.google.com/p/solrmeter/

3. Solr benchmark module (not committed yet)
You to run complex benchmarks using different algorithms
* https://issues.apache.org/jira/browse/SOLR-2646
* 
http://searchhub.org/dev/2011/07/11/benchmarking-the-new-solr-near-realtime-improvements/

Re: Indexing and querying BLOBS stored in Mysql

2012-08-24 Thread Alexey Serba

I would recommend to create a simple data import handler to test tika
parsing for large BLOBs, i.e. remove not related entities, remove all
the configuration for delta imports and keep just entity that
retrieves blobs and entity that parses binary content
(fieldReader/TikaEntityProcessor).

Some comments:
1. Maybe you are running delta import and there are not new records in database?
2. deltaQuery should only return id-s and not other columns/data,
because you don't use them in deltaQueryImport (see
dataimporter.delta.id )
3. not all entities have HTMLStripTransformer in a transformers list,
but use them in fields. TemplateTransformer is not used at all.

>dataSource="db"
> transformer="HTMLStripTransformer"
> query="select id, title, title AS grid_title, model, type, url,
> last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT(
> body,' ',title)  AS content from aitiologikes_ektheseis where type = 'text'"
> deltaImportQuery="select id, title, title AS grid_title, model, type, 
> url,
> last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT(
> body,' ',title)  AS content from aitiologikes_ektheseis where type = 'text'
> and id='${dataimporter.delta.id}'"
> deltaQuery="select id, title, title AS grid_title, model, type, url,
> last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT(
> body,' ',title)  AS content from aitiologikes_ektheseis where type = 'text'
> and last_modified > '${dataimporter.last_index_time}'">
> 
> 
> 
>  />
> 
> 
> 
>  stripHTML="true"  />
>  />
> 
> 
>
>query="select id, title, title AS grid_title, model, type, url,
> last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con AS
> text from aitiologikes_ektheseis where type = 'bin'"
>   deltaImportQuery="select id, title, title AS grid_title, model, 
> type,
> url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con
> AS text from aitiologikes_ektheseis where type = 'bin' and
> id='${dataimporter.delta.id}'"
>   deltaQuery="select id, title, title AS grid_title, model, type, url,
> last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con AS
> text from aitiologikes_ektheseis where type = 'bin' and last_modified >
> '${dataimporter.last_index_time}'"
>   transformer="TemplateTransformer"
>   dataSource="db">
>
>   
> 
>   
>stripHTML="true" />
>   
>   
>   
>stripHTML="true"  />
>stripHTML="true" />
>
>  processor="TikaEntityProcessor"
> dataField="aitiologikes_ektheseis_bin.text" format="text">
>   
> 
>
> 
>
> ...
> ...
> 
>
> 
>
> *A portion from schema.xml (the fieldTypes and filed definition):*
>
>  positionIncrementGap="100">
>
>   
> 
>  words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
> 
> 
>  words="lang/stopwords_el.txt" enablePositionIncrements="true"/>
> 
> 
>  protected="protwords.txt"/>
> 
>   
>
>   
> 
>  ignoreCase="true" expand="true"/>
>  words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
>  words="lang/stopwords_el.txt" enablePositionIncrements="true"/>
> 
> 
> 
> 
>  protected="protwords.txt"/>
> 
>   
> 
>
>
>
>
> 
> 
> 
> 
> 
> 
>  words="lang/stopwords_el.txt" enablePositionIncrements="true"/>
> 
> 
>  dictionary="dictionaries/el_GR.dic" affix="dictionaries/el_GR.aff"
> ignoreCase="true" />
> 
>
> 
> 
> 
> 
> 
> 
>  words="lang/stopwords_el.txt" enablePositionIncrements="true"/>
> 
> 
>  dictionary="dictionaries/el_GR.dic" affix="dictionaries/el_GR.aff"
> ignoreCase="true" />
> 
> 
>
>
> 
>multiValued="false"/>
>multiValued="false"/>
>stored="true"/>
>stored="true"/>
>multiValued="false"/>
>   
>   
>   
>   
>multiValued="true"/>
>stored="true" multiValued="true"/>
> 
>
> I really need help on this!
>
> With respect,
>
> Tom
>
> Greece
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Indexing-and-querying-BLOBS-stored-in-Mysql-tp4002940.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Custom Geocoder with Solr and Autosuggest

2012-08-16 Thread Alexey Serba

> My first decision was to divide SOLR into two cores, since I am already
> using SOLR as my search server. One core would be for the main search of the
> site and one for the geocoding.
Correct. And you can even use that location index/collection for
locations extraction for a non structural documents - i.e. if you
don't have separate field with geographical names in your corpus (or
location data is just not good enough compared to what can be mined
from documents)

> My second decision is to store the name data in a normalised state, some
> examples are shown below:
> London, England
> England
> Swindon, Wiltshire, England
Yes, you can add postcode/outcodes there also. And I would add
additional field "type" region/county/town/postcode/outcode.

> The third decision was to return “autosuggest” results, for example when the
> user types “Lond” I would like to suggest “London, England”. For this to
> work I think it makes sense to return up to 5 results via JSON based on
> relevancy and have these displayed under the search box.
Yeah, you might want to boost cities more than towns (I'm sure there
are plenty ambiguous terms), use some kind of geoip service,
additional scoring factors.

> My fourth decision is that when the user actually hits the “search” button
> on the location field, SOLR is again queries and returns the most relevant
> result, including the co-ordinates which are stored.
You can also have special logic to decide if you want to use spatial
search or just simple textual match would be better. I.e. you have
"England" in your example. It doesn't sound practical to return
coordinates and use spatial search for this use case, right?

HTH,
Alexey

Re: MySQL Exception: Communications link failure WITH DataImportHandler

2012-08-16 Thread Alexey Serba

My memory is vague, but I think I've seen something similar with older
versions of Solr.

Is it possible that you have significant database import and there's a
big segments merge happening in the middle causing blocking in dih
indexing process (and reading records from database as well), since
long inactivity in communication with db server and timeout as a
result. If this is the case then you can either increase timeout limit
on db server (don't remember the actual parameter) or upgrade Solr to
newer version that doesn't have such long pauses (4.0 beta?).

On Thu, Aug 16, 2012 at 12:37 PM, Jienan Duan  wrote:
> Hi all:
> I have resolved this problem by configuring a jndi datasource in tomcat.
> But I still want to find out why it throw an exception in DIH when I
> configure datasource in data-configure.xml but a jndi resource.
>
> Regards.
>
> 2012/8/16 Jienan Duan 
>
>> Hi all:
>> I'm using DataImportHandler load data from MySQL.
>> It works fine on my develop machine and online environment.
>> But I got an exception on test environment:
>>
>>> Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
 Communications link failure
>>>
>>>
 The last packet sent successfully to the server was 0 milliseconds ago.
 The driver has not received any packets from the server.
>>>
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
>>>
>>> at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>>>
>>> at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>>>
>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>>
>>> at com.mysql.jdbc.Util.handleNewInstance(Util.java:406)
>>>
>>> at
 com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1074)
>>>
>>> at com.mysql.jdbc.MysqlIO.(MysqlIO.java:343)
>>>
>>> at
 com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2132)
>>>
>>> ... 26 more
>>>
>>> Caused by: java.net.ConnectException: Connection timed out
>>>
>>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>>>
>>> at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
>>>
>>> at
 java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
>>>
>>> at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
>>>
>>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>>>
>>> at java.net.Socket.connect(Socket.java:529)
>>>
>>> at java.net.Socket.connect(Socket.java:478)
>>>
>>> at java.net.Socket.(Socket.java:375)
>>>
>>> at java.net.Socket.(Socket.java:218)
>>>
>>> at
 com.mysql.jdbc.StandardSocketFactory.connect(StandardSocketFactory.java:253)
>>>
>>> at com.mysql.jdbc.MysqlIO.(MysqlIO.java:292)
>>>
>>> ... 27 more
>>>
>>> This make me confused,because the test env and online env almost
>> same:Tomcat runs on a Linux Server with JDK6,MySql5 runs on another.
>> Even I wrote a simple JDBC test class it works,a jsp file with JDBC code
>> also works.Only DataImportHandler failed.
>> I'm trying to read Solr source code and found that it seems Solr has it's
>> own ClassLoader.I'm not sure if it goes wrong with Tomcat on some specific
>> configuration.
>> Dose anyone know how to fix this problem? Thank you very much.
>>
>> Best Regards.
>>
>> Jienan Duan
>>
>> --
>> --
>> 不走弯路，就是捷径。
>> http://www.jnan.org/
>>
>>
>
>
> --
> --
> 不走弯路，就是捷径。
> http://www.jnan.org/

Re: Solr Index linear growth - Performance degradation.

2012-08-14 Thread Alexey Serba

>10K queries
How do you generate these queries? I.e. is this a single or multi
threaded application?

Can you provide full queries you send to Solr servers and solrconfig
request handler configuration? Do you use function queries, grouping,
faceting, etc?


On Tue, Aug 14, 2012 at 10:31 AM, feroz_kh  wrote:
> Its 7,200,000 hits == number of documents found by all 10K queries.
> We have RHEL tikanga version.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Index-linear-growth-Performance-degradation-tp4000934p4001069.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Running out of memory

2012-08-12 Thread Alexey Serba

> It would be vastly preferable if Solr could just exit when it gets a memory
> error, because we have it running under daemontools, and that would cause
> an automatic restart.
-XX:OnOutOfMemoryError="; "
Run user-defined commands when an OutOfMemoryError is first thrown.

> Does Solr require the entire index to fit in memory at all times?
No.

But it's hard to say about your particular problem without additional
information. How often do you commit? Do you use faceting? Do you sort
by Solr fields and if yes what are those fields? And you should also
check caches.

Re: Is this too much time for full Data Import?

2012-08-08 Thread Alexey Serba

9m*15 - that's a lot of queries (>400 QPS).

I would try reduce the number of queries:

1. Rewrite your main (root) query to select all possible data
* use SQL joins instead of DIH nested entities
* select data from 1-N related tables (tags, authors, etc) in the main
query using GROUP_CONCAT (that's MySQL specific function, but there
are similar functions for other RDBMS-es) aggregate function and then
split concatenated data in a DIH transformer.

2. Identify small tables in nested entities and cache them completely
in CachedSqlEntityProcessor.



On Wed, Aug 8, 2012 at 10:35 AM, Mikhail Khludnev
 wrote:
> Hello,
>
> Does your indexer utilize CPU/IO? - check it by iostat/vmstat.
> If it doesn't, take several thread dumps by jvisualvm sampler or jstack,
> try to understand what blocks your threads from progress.
> It might happen you need to speedup your SQL data consumption, to do this,
> you can enable threads in DIH (only in 3.6.1), move from N+1 SQL queries to
> select all/cache approach
> http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor and
> https://issues.apache.org/jira/browse/SOLR-2382
>
> Good luck
>
> On Wed, Aug 8, 2012 at 9:16 AM, Pranav Prakash  wrote:
>
>> Folks,
>>
>> My full data import takes ~80hrs. It has around ~9m documents and ~15 SQL
>> queries for each document. The database servers are different from Solr
>> Servers. Each document has an update processor chain which (a) calculates
>> signature of the document using SignatureUpdateProcessorFactory and (b)
>> Finds out terms which have term frequency > 2; using a custom processor.
>> The index size is ~ 480GiB
>>
>> I want to know if the amount of time taken is too large compared to the
>> document count? How do I benchmark the stats and what are some of the ways
>> I can improve this? I believe there are some optimizations that I could do
>> at Update Processor Factory level as well. What would be a good way to get
>> dirty on this?
>>
>> *Pranav Prakash*
>>
>> "temet nosce"
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Tech Lead
> Grid Dynamics
>
> 
>

Re: Large RDBMS dataset

2011-12-29 Thread Alexey Serba

> The problem is that for each record in "fd", Solr makes three distinct SELECT 
> on the other three tables. Of course, this is absolutely inefficient.

You can also try to use GROUP_CONCAT (it's MySQL function, but maybe
there's something similar in MS SQL) to select all the nested 1-N
entities in a single result set as strings joined using some separator
and then split them into multivalued fields in post processing phase
(using regex template transformer or similar)

Re: Decimal Mapping problem

2011-12-29 Thread Alexey Serba

Try to cast MySQL decimal data type to string, i.e.

CAST( IF(drt.discount IS NULL,'0',(drt.discount/100)) AS CHAR) as discount
(or CAST AS TEXT)

On Mon, Dec 19, 2011 at 1:24 PM, Niels Stevens  wrote:
> Hey everybody,
>
> I'm having an issue importing Decimal numbers from my Mysql DB to Solr.
> Is there anybody with some advise, I will start and try to explain my
> problem.
>
> According to my findings, I think the lack of a explicit mapping of a
> Decimal value in the schema.xml
> is causing some issues I'm experiencing.
>
> The decimal numbers I'm trying to import look like this :
>
> 0.075000
> 7.50
> 2.25
>
>
> but after the import statement the results for the equivalent Solr field
> are returned as this:
>
> [B@1413d20
> [B@11c86ff
> [B@1e2fd0d
>
>
> The import statement for this particular field looks like:
>
>  IF(drt.discount IS NULL,'0',(drt.discount/100)) ...
>
>
> Now I thought that using the Round functions from mysql to 3 numbers after
> the dot.
> In conjunction with a explicite mapping field in the schema.xml could solve
> this issue.
> Is there someone with some similar problems with decimal fields or anybody
> with an expert view on this?
>
> Thanks a lot in advance.
>
> Regards,
>
> Niels Stevens

Re: a question on jmx solr exposure

2011-12-29 Thread Alexey Serba

Which Solr version do you use? Maybe it has something to do with
default collection?

I do see separate jmx domain for every collection, i.e.

solr/collection1
solr/collection2
solr/collection3
...

On Wed, Dec 21, 2011 at 1:56 PM, Dmitry Kan  wrote:
> Hello list,
>
> This might be not the right place to ask the jmx specific questions, but I
> decided to try, as we are polling SOLR statistics through jmx.
>
> We currently have two solr cores with different schemas A and B being run
> under the same tomcat instance. Question is: which stat is jconsole going
> to see under solr/ ?
>
> From the numbers (e.g. numDocs of searcher), jconsole see the stats of A.
> Where do stats of B go? Or is firstly activated core will capture the jmx
> "pipe" and won't let B's stats to go through?
>
> --
> Regards,
>
> Dmitry Kan

Re: Solr 3.3: DIH configuration for Oracle

2011-08-17 Thread Alexey Serba

Why do you need to collect both primary keys T1_ID_RECORD and
T2_ID_RECORD in your delta query. Isn't T2_ID_RECORD primary key value
enough to get all data from both tables? (you have table1-table2
relation as 1-N, right?)

On Thu, Aug 11, 2011 at 12:52 AM, Eugeny Balakhonov  wrote:
> Hello, all!
>
>
>
> I want to create a good DIH configuration for my Oracle database with deltas
> support. Unfortunately I am not able to do it well as DIH has the strange
> restrictions.
>
> I want to explain a problem on a simple example. In a reality my database
> has very difficult structure.
>
>
>
> Initial conditions: Two tables with following easy structure:
>
>
>
> Table1
>
> -          ID_RECORD    (Primary key)
>
> -          DATA_FIELD1
>
> -          ..
>
> -          DATA_FIELD2
>
> -          LAST_CHANGE_TIME
>
> Table2
>
> -          ID_RECORD    (Primary key)
>
> -          PARENT_ID_RECORD (Foreign key to Table1.ID_RECORD)
>
> -          DATA_FIELD1
>
> -          ..
>
> -          DATA_FIELD2
>
> -          LAST_CHANGE_TIME
>
>
>
> In performance reasons it is necessary to do selection of the given tables
> by means of one request (via inner join).
>
>
>
> My db-data-config.xml file:
>
>
>
> 
>
> 
>
>     password=""/>
>
>    
>
>        
>            query="select * from TABLE1 t1 inner join TABLE2 t2 on
> t1.ID_RECORD = t2.PARENT_ID_RECORD"
>
>            deltaQuery="select t1.ID_RECORD T1_ID_RECORD, t1.ID_RECORD
> T2_ID_RECORD
>
>                               from TABLE1 t1 inner join TABLE2 t2 on
> t1.ID_RECORD = t2.PARENT_ID_RECORD
>
>                               where TABLE1.LAST_CHANGE_TIME >
> to_date('${dataimporter.last_index_time}', '-MM-DD HH24:MI:SS')
>
>                               or TABLE2.LAST_CHANGE_TIME >
> to_date('${dataimporter.last_index_time}', '-MM-DD HH24:MI:SS')"
>
>            deltaImportQuery="select * from TABLE1 t1 inner join TABLE2 t2
> on t1.ID_RECORD = t2.PARENT_ID_RECORD
>
>            where t1.ID_RECORD = ${dataimporter.delta.T1_ID_RECORD} and
> t2.ID_RECORD = ${dataimporter.delta.T2_ID_RECORD}"
>
>        />
>
>    
>
> 
>
>
>
> In result I have following error:
>
>
>
> java.lang.IllegalArgumentException: deltaQuery has no column to resolve to
> declared primary key pk='T1_ID_RECORD, T2_ID_RECORD'
>
>
>
> I have analyzed the source code of DIH. I found that in the DocBuilder class
> collectDelta() method works with value of entity attribute "pk" as with
> simple string. But in my case this is array with two values: T1_ID_RECORD,
> T2_ID_RECORD
>
>
>
> What do I do wrong?
>
>
>
> Thanks,
>
> Eugeny
>
>
>
>

Re: Weird issue with solr and jconsole/jmx

2011-06-24 Thread Alexey Serba

I just encountered the same bug - JMX registered beans don't survive
Solr core reloads.

I believe the reason is that when you do core reload
* when the new core is created - it overwrites/over-register beans in
registry (in mbeanserver)
* when the new core is ready in the core register phase CoreContainer
closes old core that results to unregistering jmx beans

As a result there's only one bean in registry
"id=org.apache.solr.search.SolrIndexSearcher,type=Searcher@33099cc
main" left after Core reload. It is because this in the only new
(dynamically named bean) that is created by new core and not
un-registered in oldCore.close. I'll try to reproduce that in test and
file bug in Jira.


On Tue, Mar 16, 2010 at 4:25 AM, Andrew Greenburg  wrote:
> On Tue, Mar 9, 2010 at 7:44 PM, Chris Hostetter
>  wrote:
>>
>> : I connected to one of my solr instances with Jconsole today and
>> : noticed that most of the mbeans under the solr hierarchy are missing.
>> : The only thing there was a Searcher, which I had no trouble seeing
>> : attributes for, but the rest of the statistics beans were missing.
>> : They all show up just fine on the stats.jsp page.
>> :
>> : In the past this always worked fine. I did have the core reload due to
>> : config file changes this morning. Could that have caused this?
>>
>> possibly... reloading the core actually causes a whole new SolrCore
>> object (with it's own registry of SOlrInfoMBeans) to be created and then
>> swapped in place of hte previous core ... so perhaps you are still looking
>> at the "stats" of the old core which is no longer in use (and hasn't been
>> garbage collected because the JMX Manager still had a refrence to it for
>> you? ... i'm guessing at this point)
>>
>> did disconnecting from jconsole and reconnecting show you the correct
>> stats?
>
> Disconnecting and reconnecting didn't help. The queryCache and
> documentCache and some others started showing up after I did a commit
> and opened a new searcher, but the whole tree never did fill in.
>
> I'm guessing that the request handler stats stayed associated with the
> old, no longer visible core in JMX since new instances weren't created
> when the core reloaded. Does that make sense? The stats on the web
> stats page continued to be fresh.
>

Re: Solr and Tag Cloud

2011-06-19 Thread Alexey Serba

Consider you have multivalued field _tag_ related to every document in
your corpus. Then you can build tag cloud relevant for all data set or
specific query by retrieving facets for field _tag_ for "*:*" or any
other query. You'll get a list of popular _tag_ values relevant to
this query with occurrence counts.

If you want to build tag cloud for general analyzed text fields you
still can do that the same way, but you should note that you can hit
some performance/memory problems if you have significant data set and
huge text fields. You should probably use stop words to filter popular
general terms.

On Sat, Jun 18, 2011 at 8:12 AM, Jamie Johnson  wrote:
> Does anyone have details of how to generate a tag cloud of popular terms
> across an entire data set and then also across a query?
>

Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

2011-06-17 Thread Alexey Serba

> Do you mean that we  have current Index as it is and have a separate core
> which  has only the user-id ,product-id relation and at while querying ,do a
> join between the two cores based on the user-id.
Exactly. You can index user-id, product-id relation either to the same
core or to different core on the same Solr instance.

> This would involve us to Index/delete the product  as and when the user
> subscription for a product changes ,This would involve some amount of
> latency if the Indexing (we have a queue system for Indexing across the
> various instances) or deletion is delayed
Right, but I'm not sure if it's possible to achieve good performance
requiring zero latency.

> IF we want to go ahead with this solution ,We currently are using solr 1.3
> , so  is this functionality available as a patch for solr 1.3?
No. AFAIK it's in trunk only.

> Would it be
> possible to  do with a separate Index  instead of a core ,then I can create
> only one  Index common for all our instances and then use this instance to
> do the join.
No, I don't think that's possible with join feature. I guess that
would require network request per search req and number of mapped ids
could be huge, so it could affect performance significantly.

> You'll need to be a bit careful using joins, as the performance hit
> can be significant if you have lots of cross-referencing to do, which
> I believe you would given your scenario.
As far as I understand join query would build bitset filter which can
be cached in filterCache, etc. The only performance impact I can think
of is that user-product relations table could be too big to fit into
single instance.

Re: Updating only one indexed field for all documents quickly.

2011-06-16 Thread Alexey Serba

>> with the integer field. If you just want to influence the
>> score, then just plain external field fields should work for
>> you.
>
> Is this an appropriate solution, give our use case?
>
Yes, check out ExternalFileField

* http://search.lucidimagination.com/search/document/CDRG_ch04_4.4.4
* 
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
* http://www.slideshare.net/greggdonovan/solr-lucene-etsy-by-gregg-donovan/28

Re: Strange behavior

2011-06-16 Thread Alexey Serba

Have you stopped Solr before manually copying the data? This way you
can be sure that index is the same and you didn't have any new docs on
the fly.

2011/6/14 Denis Kuzmenok :
> What  should  i provide, OS is the same, environment is the same, solr
> is  completely  copied,  searches  work,  except that one, and that is
> strange..
>
>> I think you will need to provide more information than this, no-one on this 
>> list is omniscient AFAIK.
>
>> François
>
>> On Jun 14, 2011, at 10:44 AM, Denis Kuzmenok wrote:
>
>>> Hi.
>>>
>>> I've  debugged search on test machine, after copying to production server
>>> the  entire  directory  (entire solr directory), i've noticed that one
>>> query  (SDR  S70EE  K)  does  match  on  test  server, and does not on
>>> production.
>>> How can that be?
>>>
>
>
>
>
>

Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

2011-06-16 Thread Alexey Serba

> So a search for a product once the user logs in and searches for only the
> products that he has access to Will translate to something like this . ,the
> product ids are obtained form the db  for a particular user and can run
> into  n  number.
>
>  &fq=product_id(100 10001  ..n number)
>
> but we are currently running into too many Boolean expansion error .We are
> not able to tie the user also into roles as each user is mainly any one who
> comes to site and purchases a product .

I'm wondering if new trunk Solr join functionality can help here.

* http://wiki.apache.org/solr/Join

In theory you can index your products (product_id, ...) and
user_id-product many-to-many relation (user_product_id, user_id) into
signle/different cores and then do join, like
f=search terms&fq={!join from=product_id to=user_product_id}user_id:10101

But I haven't tried that, so I'm just speculating.

Re: Complex situation

2011-06-16 Thread Alexey Serba

Am I right that you are only interested in results / facets for
current season? If it's so then you can index start/end dates as a
separate number fields and build your search filters like this
"fq=+start_date_month:[* TO 6] +start_date_day:[* TO 17]
+end_date_month:[* TO 6] +end_date_day:[16 TO *]" where 6/16 is
current month/day.

On Thu, Jun 16, 2011 at 5:20 PM, roySolr  wrote:
> Hello,
>
> First i will try to explain the situation:
>
> I have some companies with openinghours. Some companies has multiple seasons
> with different openinghours. I wil show some example data :
>
> Companyid          Startdate(d-m)  Enddate(d-m)     Openinghours_end
> 1                        01-01                01-04                 17:00
> 1                        01-04                01-08                 18:00
> 1                        01-08                31-12                 17:30
>
> 2                        01-01                31-12                 20:00
>
> 3                        01-01                01-06                 17:00
> 3                        01-06                31-12                 18:00
>
> What i want is some facets on the left site of my page. They have to look
> like this:
>
> Closing today on:
> 17:00(23)
> 18:00(2)
> 20:00(1)
>
> So i need to get the NOW to know which openinghours(seasons) i need in my
> facet results. How should my index look like?
> Can anybody helps me how i can save this data in the solr index?
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Complex-situation-tp3071936p3071936.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: URGENT HELP: Improving Solr indexing time

2011-06-13 Thread Alexey Serba

16276
...
> so I am doing a delta import of around 500,000 rows at a
> time.

http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

Re: Need query help

2011-06-06 Thread Alexey Serba

See "Tagging and excluding Filters" section

* 
http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters

2011/6/6 Denis Kuzmenok :
> For now i have a collection with:
> id (int)
> price (double) multivalue
> brand_id (int)
> filters (string) multivalue
>
> I  need  to  get available brand_id, filters, price values and list of
> id's   for   current   query.  For  example now i'm doing queries with
> facet.field=brand_id/filters/price:
> 1) to get current id's list: (brand_id:100 OR brand_id:150) AND 
> (filters:p1s100 OR filters:p4s20)
> 2) to get available filters on selected properties (same properties but
> another  values):  (brand_id:100 OR brand_id:150) AND (filters:p1s* OR
> filters:p4s*)
> 3) to get available brand_id (if any are selected, if none - take from
> 1st query results): (filters:p1s100 OR filters:p4s20)
> 4) another request to get available prices if any are selected
>
> Is there any way to simplify this task?
> Data needed:
> 1) Id's for selected filters, price, brand_id
> 2) Available filters, price, brand_id from selected values
> 3) Another values for selected properties (is any chosen)
> 4) Another brand_id for selected brand_id
> 5) Another price for selected price
>
> Will appreciate any help or thoughts!
>
> Cheers,
> Denis Kuzmenok
>
>

Re: Solr memory consumption

2011-06-02 Thread Alexey Serba

> Commits  are  divided  into  2  groups:
> - often but small (last changed
> info)
1) Make sure that it's not too often and you don't have commit
overlapping problem.
http://wiki.apache.org/solr/FAQ#What_does_.22PERFORMANCE_WARNING:_Overlapping_onDeckSearchers.3DX.22_mean_in_my_logs.3F

2) You may also try to limit cache sizes and check if it helps.

3) If it doesn't help then try to monitor your app using jconsole
* try to hit garbage collector and see if it frees some memory
* browse solr jmx attributes and see if there'r any hints re solr
caches usage, etc

4) Try to run jmap -heap -histo and see if there's any hints there

5) If none of above helps then you probably need to examine your
memory usage using some kind of java profiler tool (like yourkit
profiler)


> Size: 4 databases about 1G (sum), 1 database (with n-gram) for 21G..
> I  don't  know any other way to search for product names except n-gram
> =\
Standard text field with solr.WordDelimiterFilterFactory and
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" during
indexing isn't good enough? You might want to limit min and max ngram
size, just to reduce your index size.

Re: Documents update

2011-06-01 Thread Alexey Serba

> Will it be slow if there are 3-5 million key/value rows?
AFAIK it shouldn't affect search time significantly as Solr caches it
in memory after you reloading Solr core / issuing commit.

But obviously you need more memory and commit/reload will take more time.

Re: Better Spellcheck

2011-06-01 Thread Alexey Serba

> I've tried to use a spellcheck dictionary built from my own content, but my
> content ends up having a lot of misspelled words so the spellcheck ends up
> being less than effective.
You can try to use sp.dictionary.threshold parameter to solve this problem
* http://wiki.apache.org/solr/SpellCheckerRequestHandler#sp.dictionary.threshold

> It also misses phrases. When someone
> searches for "Untied States" I would hope the spellcheck would suggest
> "United States" but it just recognizes that "untied" is a valid word and
> doesn't suggest any thing.
So you are saying about auto suggest component and not spellcheck
right? These are two different use cases.

If you want auto suggest and you have some search logs for your system
then you can probably use the following solution:
* 
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

If you don't have significant search logs history and want to populate
your auto suggest dictionary from index or some text file you should
check
* http://wiki.apache.org/solr/Suggester

Re: DIH render html entities

2011-06-01 Thread Alexey Serba

Maybe HTMLStripTransformer is what you are looking for.

* http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer

On Tue, May 31, 2011 at 5:35 PM, Erick Erickson  wrote:
> Convert them to what? Individual fields in your docs? Text?
>
> If the former, you might get some joy from the XpathEntityProcessor.
> If you want to just strip the markup and index all the content you
> might get some joy from the various *html* analyzers listed here:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
> Best
> Erick
>
> On Fri, May 27, 2011 at 5:19 AM, anass talby  wrote:
>> Sorry my question was not clear.
>> when I get data from database, some field contains some html special chars,
>> and what i want to do is just convert them automatically.
>>
>> On Fri, May 27, 2011 at 1:00 PM, Gora Mohanty  wrote:
>>
>>> On Fri, May 27, 2011 at 3:50 PM, anass talby 
>>> wrote:
>>> > Is there any way to render html entities in DIH for a specific field?
>>> [...]
>>>
>>> This does not make too much sense: What do you mean by
>>> "rendering HTML entities". DIH just indexes, so where would
>>> it render HTML to, even if it could?
>>>
>>> Please take a look at http://wiki.apache.org/solr/UsingMailingLists
>>>
>>> Regards,
>>> Gora
>>>
>>
>>
>>
>> --
>>       Anass
>>
>

Re: Solr memory consumption

2011-06-01 Thread Alexey Serba

Hey Denis,

* How big is your index in terms of number of documents and index size?
* Is it production system where you have many search requests?
* Is there any pattern for OOM errors? I.e. right after you start your
Solr app, after some search activity or specific Solr queries, etc?
* What are 1) cache settings 2) facets and sort-by fields 3) commit
frequency and warmup queries?
etc

Generally you might want to connect to your jvm using jconsole tool
and monitor your heap usage (and other JVM/Solr numbers)

* http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html
* http://wiki.apache.org/solr/SolrJmx#Remote_Connection_to_Solr_JMX

HTH,
Alexey

2011/6/1 Denis Kuzmenok :
> There  were  no  parameters  at  all,  and java hitted "out of memory"
> almost  every day, then i tried to add parameters but nothing changed.
> Xms/Xmx  -  did  not solve the problem too. Now i try the MaxPermSize,
> because it's the last thing i didn't try yet :(
>
>
> Wednesday, June 1, 2011, 9:00:56 PM, you wrote:
>
>> Could be related to your crazy high MaxPermSize like Marcus said.
>
>> I'm no JVM tuning expert either. Few people are, it's confusing. So if
>> you don't understand it either, why are you trying to throw in very
>> non-standard parameters you don't understand?  Just start with whatever
>> the Solr example jetty has, and only change things if you have a reason
>> to (that you understand).
>
>> On 6/1/2011 1:19 PM, Denis Kuzmenok wrote:
>>> Overall  memory on server is 24G, and 24G of swap, mostly all the time
>>> swap  is  free and is not used at all, that's why "no free swap" sound
>>> strange to me..
>
>
>
>
>

Re: Indexing 20M documents from MySQL with DIH

2011-05-05 Thread Alexey Serba

{quote}
...
Caused by: java.io.EOFException: Can not read response from server.
Expected to read 4 bytes, read 0 bytes before connection was
unexpectedly lost.
   at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
   at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
   ... 22 more
Apr 21, 2011 3:53:28 AM
org.apache.solr.handler.dataimport.EntityProcessorBase getNext
SEVERE: getNext() failed for query 'REDACTED'
org.apache.solr.handler.dataimport.DataImportHandlerException:
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
Communications link failure

The last packet successfully received from the server was 128
milliseconds ago.  The last packet sent successfully to the server was
25,273,484 milliseconds ago.
...
{quote}

It could probably be because of autocommit / segment merging. You
could try to disable autocommit / increase mergeFactor

{quote}
I've used sphinx in the past, which uses multiple queries to pull out
a subset of records ranged based on PrimaryKey, does Solr offer
functionality similar to this? It seems that once a Solr index gets to
a certain size, the indexing of a batch takes longer than MySQL's
net_write_timeout, so it kills the connection.
{quote}

I was thinking about some hackish solution to paginate results

  
  

Or something along those lines ( you'd need to to calculate offset in
pages query )

But unfortunately MySQL does not provide generate_series function
(it's postgres function and there'r similar solutions for oracle and
mssql).


On Mon, Apr 25, 2011 at 3:59 AM, Scott Bigelow  wrote:
> Thank you everyone for your help. I ended up getting the index to work
> using the exact same config file on a (substantially) larger instance.
>
> On Fri, Apr 22, 2011 at 5:46 AM, Erick Erickson  
> wrote:
>> {{{A custom indexer, so that's a fairly common practice? So when you are
>> dealing with these large indexes, do you try not to fully rebuild them
>> when you can? It's not a nightly thing, but something to do in case of
>> a disaster? Is there a difference in the performance of an index that
>> was built all at once vs. one that has had delta inserts and updates
>> applied over a period of months?}}}
>>
>> Is it a common practice? Like all of this, "it depends". It's certainly
>> easier to let DIH do the work. Sometimes DIH doesn't have all the
>> capabilities necessary. Or as Chris said, in the case where you already
>> have a system built up and it's easier to just grab the output from
>> that and send it to Solr, perhaps with SolrJ and not use DIH. Some people
>> are just more comfortable with their own code...
>>
>> "Do you try not to fully rebuild". It depends on how painful a full rebuild
>> is. Some people just like the simplicity of starting over every 
>> day/week/month.
>> But you *have* to be able to rebuild your index in case of disaster, and
>> a periodic full rebuild certainly keeps that process up to date.
>>
>> "Is there a difference...delta inserts...updates...applied over months". Not
>> if you do an optimize. When a document is deleted (or updated), it's only
>> marked as deleted. The associated data is still in the index. Optimize will
>> reclaim that space and compact the segments, perhaps down to one.
>> But there's no real operational difference between a newly-rebuilt index
>> and one that's been optimized. If you don't delete/update, there's not
>> much reason to optimize either
>>
>> I'll leave the DIH to others..
>>
>> Best
>> Erick
>>
>> On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow  wrote:
>>> Thanks for the e-mail. I probably should have provided more details,
>>> but I was more interested in making sure I was approaching the problem
>>> correctly (using DIH, with one big SELECT statement for millions of
>>> rows) instead of solving this specific problem. Here's a partial
>>> stacktrace from this specific problem:
>>>
>>> ...
>>> Caused by: java.io.EOFException: Can not read response from server.
>>> Expected to read 4 bytes, read 0 bytes before connection was
>>> unexpectedly lost.
>>>        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
>>>        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
>>>        ... 22 more
>>> Apr 21, 2011 3:53:28 AM
>>> org.apache.solr.handler.dataimport.EntityProcessorBase getNext
>>> SEVERE: getNext() failed for query 'REDACTED'
>>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>>> Communications link failure
>>>
>>> The last packet successfully received from the server was 128
>>> milliseconds ago.  The last packet sent successfully to the server was
>>> 25,273,484 milliseconds ago.
>>> ...
>>>
>>>
>>> A custom indexer, so that's a fairly common practice? So when you are
>>> dealing with these large indexes, do you try not to fully rebuild them
>>> when you can? It's not a nightly thing, but something to do in case of
>>> a disaster? Is there a difference in the performance

Re: Solr performance issue

2011-03-22 Thread Alexey Serba

> Btw, I am monitoring output via jconsole with 8gb of ram and it still goes
> to 8gb every 20 seconds or so,
> gc runs, falls down to 1gb.

Hmm, jvm is eating 8Gb for 20 seconds - sounds a lot.

Do you return all results (ids) for your queries? Any tricky
faceting/sorting/function queries?

Re: Dataimport performance

2010-12-19 Thread Alexey Serba

> With subquery and with left join:   320k in 6 Min 30
It's 820 records per second. It's _really_ impressive considering the
fact that DIH performs separate sql query for every record in your
case.

>> So there's one track entity with an artist sub-entity. My (admittedly
>> rather limited) experience has been that sub-entities, where you have
>> to run a separate query for every row in the parent entity, really
>> slow down data import.
Sub entities slows down data import indeed. You can try to avoid
separate query for every row by using CachedSqlEntityProcessor. There
are couple of options - 1) you can load all sub-entity data in memory
or 2) you can reduce the number of sql queries by caching sub entity
data per id. There's no silver bullet and each option has its own pros
and cons.

Also Ephraim proposed a really neat solution with GROUP_CONCAT, but
I'm not sure that all RDBMS-es support that.


2010/12/15 Robert Gründler :
> i've benchmarked the import already with 500k records, one time without the 
> artists subquery, and one time without the join in the main query:
>
>
> Without subquery: 500k in 3 min 30 sec
>
> Without join and without subquery: 500k in 2 min 30.
>
> With subquery and with left join:   320k in 6 Min 30
>
>
> so the joins / subqueries are definitely a bottleneck.
>
> How exactly did you implement the custom data import?
>
> In our case, we need to de-normalize the relations of the sql data for the 
> index,
> so i fear i can't really get rid of the join / subquery.
>
>
> -robert
>
>
>
>
>
> On Dec 15, 2010, at 15:43 , Tim Heckman wrote:
>
>> 2010/12/15 Robert Gründler :
>>> The data-config.xml looks like this (only 1 entity):
>>>
>>>      
>>>        
>>>        
>>>        
>>>        
>>>        
>>>        >> name="sf_unique_id"/>
>>>
>>>        
>>>          
>>>        
>>>
>>>      
>>
>> So there's one track entity with an artist sub-entity. My (admittedly
>> rather limited) experience has been that sub-entities, where you have
>> to run a separate query for every row in the parent entity, really
>> slow down data import. For my own purposes, I wrote a custom data
>> import using SolrJ to improve the performance (from 3 hours to 10
>> minutes).
>>
>> Just as a test, how long does it take if you comment out the artists entity?
>
>

Re: Custom scoring for searhing geographic objects

2010-12-19 Thread Alexey Serba

Hi Pavel,

I had the similar problem several years ago - I had to find
geographical locations in textual descriptions, geocode these objects
to lat/long during indexing process and allow users to filter/sort
search results to specific geographical areas. The important issue was
that there were several types of geographical objects - street < town
< region < country. The idea was to geocode to most narrow
geographical area as possible. Relevance logic in this case could be
specified as "find the most narrow result that is unique identified by
your text or search query".  So I came up with custom algorithm that
was quite good in terms of performance and precision/recall. Here's
the simple description:
* You can intersect all text/searchquery terms with locations
dictionary to find only geo terms
* Search in your locations Lucene index and filter only street objects
(the most narrow areas). Due to tf*idf formula you'll get the most
relevant results. Then you need to post process N (3/5/10) results and
verify that they are matches indeed. I did intersect search terms with
result's terms and make another lucene search to verify if these terms
are unique identifying the match. If it's then return matching street.
If there's no any match proceed using the same algorithm with towns,
regions, countries.

HTH,
Alexey

On Wed, Dec 15, 2010 at 6:28 PM, Pavel Minchenkov  wrote:
> Hi,
> Please give me advise how to create custom scoring. I need to result that
> documents were in order, depending on how popular each term in the document
> (popular = how many times it appears in the index) and length of the
> document (less terms - higher in search results).
>
> For example, index contains following data:
>
> ID    | SEARCH_FIELD
> --
> 1     | Russia
> 2     | Russia, Moscow
> 3     | Russia, Volgograd
> 4     | Russia, Ivanovo
> 5     | Russia, Ivanovo, Altayskaya street 45
> 6     | Russia, Moscow, Kremlin
> 7     | Russia, Moscow, Altayskaya street
> 8     | Russia, Moscow, Altayskaya street 15
> 9     | Russia, Moscow, Altayskaya street 15/26
>
>
> And I should get next results:
>
>
> Query                     | Document result set
> --
> Russia                    | 1,2,4,3,6,7,8,9,5
> Moscow                  | 2,6,7,8,9
> Ivanovo                    | 4,5
> Altayskaya              | 7,8,9,5
>
> In fact --- it is a search for geographic objects (cities, streets, houses).
> At the same time can be given only part of the address, and the results
> should appear the most relevant results.
>
> Thanks.
> --
> Pavel Minchenkov
>

Re: Newbie: Indexing unrelated MySQL tables

2010-12-14 Thread Alexey Serba

> I figured I would create three entities and relevant
> schema.xml entries in this way:
>
> dataimport.xml:
> 
> 
> 
That's correct. You can list several entities under document element.
You can index them separately using entity parameter (i.e. add
entity=Users to you full import HTTP request). Do not forget to add
clean=false so you won't delete previously indexed documents. Or you
can index all entities in one request (by default).

> schema.xml:
> 
> 
> 
> 
> 
> 
> 
> 
> 
Why do you use string type for textual fields (description, company,
name, firstname, lastname, etc)? Is it intentional to use these fields
in filtering/faceting?

You can also add "default" searchable multivalued field (type=text)
and copy field instructions to copy all textual content into this
field ( http://wiki.apache.org/solr/SchemaXml#Copy_Fields ). Thus you
will be able to search in "default" field for terms from all fields
(firstname, lastname, name, description, company, position, location,
etc).

You would probably want to add field type=user/artwork/job. You will
be able to facet/filter on that fields and provide better user search
experience.

> This obviously does not work as I want. I only get results from the "users"
> table, and I cannot get results from neither "artwork" nor "jobs".
Are you sure that this is because the indexing isn't working? How do
you search for your data? What query parser (standard/dismax)/etc?

> I have
> found out that the possible solution is in putting  tags in the
>  tag and somehow aliasing column names for Solr, but the logic
> behind this is completely alien to me and the blind tests I tried did not
> yield anything.
You don't need to list your fields explicitly in fields declaration.

BTW, what database do you use? Oracle has some issue with upper casing
column names that could be a problem.

> My logic says that the "id" field is getting replaced by the
> "id" field of other entities and indexes are being overwritten.
Are your ids unique across different objects? I.e. is there any job
with the same id as user? If so then you would probably want to prefix
your ids like:





> But if I
> aliased all "id" fields in all entities into something else, such as
> "user_id" and "job_id", I couldn't figure what to put in the 
> configuration in schema.xml because I have three different id fields from
> three different tables that are all primary keyed in the database!
You can still create separate id fields if you need to search for
different objects by id and don't mess with prefixed ids. But it's not
required.

HTH,
Alexey

Re: my index has 500 million docs ,how to improve so lr search performance？

2010-12-14 Thread Alexey Serba

How much memory do you allocate for JVMs? Considering you have 10 JVMs
per server (10*N) you might have not enough memory for OS file system
cache ( you need to keep some memory free for that )

> all indexs size is about 100G
is this per server or whole size?


On Mon, Nov 15, 2010 at 8:35 AM, lu.rongbin  wrote:
>
> In addition,my index has only two store fields, id and price, and other
> fields are index. I increase the document and query cache. the ec2
> m2.4xLarge instance is 8 cores, 68G memery. all indexs size is about 100G.
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/my-index-has-500-million-docs-how-to-improve-solr-search-performance-tp1902595p1902869.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Syncing 'delta-import' with 'select' query

2010-12-14 Thread Alexey Serba

What Solr version do you use?

It seems that sync flag has been added to 3.1 and 4.0 (trunk) branches
and not to 1.4
https://issues.apache.org/jira/browse/SOLR-1721

On Wed, Dec 8, 2010 at 11:21 PM, Juan Manuel Alvarez  wrote:
> Hello everyone!
> I have been doing some tests, but it seems I can't make the
> synchronize flag work.
>
> I have made two tests:
> 1) DIH with commit=false
> 2) DIH with commit=false + commit via Solr XML update protocol
>
> And here are the log results:
> For (1) the command is
> "/solr/dataimport?command=delta-import&commit=false&synchronous=true"
> and the first part of the output is:
>
> Dec 8, 2010 4:42:51 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/dataimport params={command=status} status=0 
> QTime=0
> Dec 8, 2010 4:42:51 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/dataimport
> params={schema=testproject&dbHost=127.0.0.1&dbPassword=fuz10n!&dbName=fzm&commit=false&dbUser=fzm&command=delta-import&projectId=1&synchronous=true&dbPort=5432}
> status=0 QTime=4
> Dec 8, 2010 4:42:51 PM org.apache.solr.handler.dataimport.DataImporter
> doDeltaImport
> INFO: Starting Delta Import
> Dec 8, 2010 4:42:51 PM org.apache.solr.handler.dataimport.SolrWriter
> readIndexerProperties
> INFO: Read dataimport.properties
> Dec 8, 2010 4:42:51 PM org.apache.solr.handler.dataimport.DocBuilder doDelta
> INFO: Starting delta collection.
> Dec 8, 2010 4:42:51 PM org.apache.solr.handler.dataimport.DocBuilder
> collectDelta
>
>
> For (2) the commands are
> "/solr/dataimport?command=delta-import&commit=false&synchronous=true"
> and "/solr/update?commit=true&waitFlush=true&waitSearcher=true" and
> the first part of the output is:
>
> Dec 8, 2010 4:22:50 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/dataimport params={command=status} status=0 
> QTime=0
> Dec 8, 2010 4:22:50 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/dataimport
> params={schema=testproject&dbHost=127.0.0.1&dbPassword=fuz10n!&dbName=fzm&commit=false&dbUser=fzm&command=delta-import&projectId=1&synchronous=true&dbPort=5432}
> status=0 QTime=1
> Dec 8, 2010 4:22:50 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/dataimport params={command=status} status=0 
> QTime=0
> Dec 8, 2010 4:22:50 PM org.apache.solr.handler.dataimport.DataImporter
> doDeltaImport
> INFO: Starting Delta Import
> Dec 8, 2010 4:22:50 PM org.apache.solr.handler.dataimport.SolrWriter
> readIndexerProperties
> INFO: Read dataimport.properties
> Dec 8, 2010 4:22:50 PM org.apache.solr.update.DirectUpdateHandler2 commit
> INFO: start 
> commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)
>
> In (2) it seems like the commit is being fired before the delta-update 
> finishes.
>
> Am I using the "synchronous" flag right?
>
> Thanks in advance!
> Juan M.
>
> On Mon, Dec 6, 2010 at 6:46 PM, Juan Manuel Alvarez  
> wrote:
>> Thanks for all the help! It is really appreciated.
>>
>> For now, I can afford the parallel requests problem, but when I put
>> synchronous=true in the delta import, the call still returns with
>> outdated items.
>> Examining the log, it seems that the commit operation is being
>> executed after the operation returns, even when I am using
>> commit=true.
>> Is it possible to also execute the commit synchronously?
>>
>> Cheers!
>> Juan M.
>>
>> On Mon, Dec 6, 2010 at 4:29 PM, Alexey Serba  wrote:
>>>> When you say "two parallel requests from two users to single DIH
>>>> request handler", what do you mean by "request handler"?
>>> I mean DIH.
>>>
>>>> Are you
>>>> refering to the HTTP request? Would that mean that if I make the
>>>> request from different HTTP sessions it would work?
>>> No.
>>>
>>> It means that when you have two users that simultaneously changed two
>>> objects in the UI then you have two HTTP requests to DIH to pull
>>> changes from the db into Solr index. If the second request comes when
>>> the first is not fully processed then the second request will be
>>> rejected. As a result your index would be outdated (w/o the latest
>>> update) until the next update.
>>>
>>
>

Re: Syncing 'delta-import' with 'select' query

2010-12-06 Thread Alexey Serba

> When you say "two parallel requests from two users to single DIH
> request handler", what do you mean by "request handler"?
I mean DIH.

> Are you
> refering to the HTTP request? Would that mean that if I make the
> request from different HTTP sessions it would work?
No.

It means that when you have two users that simultaneously changed two
objects in the UI then you have two HTTP requests to DIH to pull
changes from the db into Solr index. If the second request comes when
the first is not fully processed then the second request will be
rejected. As a result your index would be outdated (w/o the latest
update) until the next update.

Re: DIH - rdbms to index confusion

2010-12-06 Thread Alexey Serba

> I have a table that contains the data values I'm wanting to return when
> someone makes a search.  This table has, in addition to the data values, 3
> id's (FKs) pointing to the data/info that I'm wanting the users to be able
> to search on (while also returning the data values).
>
> The general rdbms query would be something like:
> select f.value, g.gar_name, c.cat_name from foo f, gar g, cat c, dub d
> where g.id=f.gar_id
> and c.id=f.cat_id
> and d.id=f.dub_id
>
You can put this general rdbms query as is into single DIH entity - no
need to split it.

You would probably want to split it if your main table has one to many
relation with other tables, so you can't retrieve all the data and
have single result set row per Solr document.

Re: Syncing 'delta-import' with 'select' query

2010-12-06 Thread Alexey Serba

Hey Juan,

It seems that DataImportHandler is not a right tool for your scenario
and you'd better use Solr XML update protocol.
* http://wiki.apache.org/solr/UpdateXmlMessages

You still can work around your outdated GUI view problem with calling
DIH synchronously, by adding synchronous=true to your request. But it
won't solve the problem with two parallel requests from two users to
single DIH request handler, because DIH doesn't support that, and if
previous request is still running it bounces the second request.

HTH,
Alex



On Fri, Dec 3, 2010 at 10:33 PM, Juan Manuel Alvarez  wrote:
> Hello everyone! I would like to ask you a question about DIH.
>
> I am using a database and DIH to sync against Solr, and a GUI to
> display and operate on the items retrieved from Solr.
> When I change the state of an item through the GUI, the following happens:
> a. The item is updated in the DB.
> b. A delta-import command is fired to sync the DB with Solr.
> c. The GUI is refreshed by making a query to Solr.
>
> My problem comes between (b) and (c). The delta-import operation is
> executed in a new thread, so my call returns immediately, refreshing
> the GUI before the Solr index is updated causing the item state in the
> GUI to be outdated.
>
> I had two ideas so far:
> 1. Querying the status of the DIH after the delta-import operation and
> do not return until it is "idle". The problem I see with this is that
> if other users execute delta-imports, the status will be "busy" until
> all operations are finished.
> 2. Use Zoie. The first problem is that configuring it is not as
> straightforward as it seems, so I don't want to spend more time trying
> it until I am sure that this will solve my issue. On the other hand, I
> think that I may suffer the same problem since the delta-import is
> still firing in another thread, so I can't be sure it will be called
> fast enough.
>
> Am I pointing on the right direction or is there another way to
> achieve my goal?
>
> Thanks in advance!
> Juan M.
>

Re: dataimports response returns before done?

2010-12-06 Thread Alexey Serba

> After issueing a dataimport, I've noticed solr returns a response prior to 
> finishing the import. Is this correct?   Is there anyway i can make solr not 
> return until it finishes?
Yes, you can add synchronous=true to your request. But be aware that
it could take a long time and you can see http timeout exception.

> If not, how do I ping for the status whether it finished or not?
See command=status


On Fri, Dec 3, 2010 at 8:55 PM, Tri Nguyen  wrote:
> Hi,
>
> After issueing a dataimport, I've noticed solr returns a response prior to 
> finishing the import. Is this correct?   Is there anyway i can make solr not 
> return until it finishes?
>
> If not, how do I ping for the status whether it finished or not?
>
> thanks,
>
> tri

Re: Query performance very slow even after autowarming

2010-12-06 Thread Alexey Serba

* Do you use EdgeNGramFilter in index analyzer only? Or you also use
it on query side as well?

* What if you create additional field first_letter (string) and put
first character/characters (multivalued?) there in your external
processing code. And then during search you can filter all documents
that start with letter "a" using fq=a filter query. Would that solve
your performance problems?

* It makes sense to specify what are you trying to achieve and
probably more people can help you with that.

On Fri, Dec 3, 2010 at 10:47 AM, johnnyisrael  wrote:
>
> Hi,
>
> I am using edgeNgramFilterfactory on SOLR 1.4.1 [ class="solr.EdgeNGramFilterFactory" maxGramSize="100" minGramSize="1" />]
> for my indexing.
>
> Each document will have about 5 fields in it and only one field is indexed
> with EdgeNGramFilterFactory.
>
> I have about 1.4 million documents in my index now and my index size is
> approx 296MB.
>
> I made the field that is indexed with EdgeNGramFilterFactory as default
> search field. All my query responses are very slow, some of them taking more
> than 10seconds to respond.
>
> All my query responses are very slow, Queries with single letters are still
> very slow.
>
> /select/?q=m
>
> So I tried query warming as follows.
>
> 
>      
>        a
>        b
>        c
>        d
>        e
>        f
>        g
>        h
>        i
>        j
>        k
>        l
>        m
>        n
>        o
>        p
>        q
>        r
>        s
>        t
>        u
>        v
>        w
>        x
>        y
>        z
>      
> 
>
> The same above is done for firstSearcher as well.
>
> My cache settings are as follows.
>
>       class="solr.LRUCache"
>      size="16384"
>      initialSize="4096"
> autowarmCount="4096"/>
>
>       class="solr.LRUCache"
>      size="16384"
>      initialSize="4096"
> autowarmCount="1024"/>
>
>       class="solr.LRUCache"
>      size="16384"
>      initialSize="16384"
> />
>
> Still after query warming, few single character search is taking up to 3
> seconds to respond.
>
> Am i doing anything wrong in my cache setting or autowarm setting or am i
> missing anything here?
>
> Thanks,
>
> Johnny
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Query-performance-very-slow-even-after-autowarming-tp2010384p2010384.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: DIH delta, deltaQuery

2010-11-26 Thread Alexey Serba

Are you sure that it's deltaQuery that's taking a minute? It only
retrieves ids of updated records and then deltaImportQuery is executed
N times for each id record. You might want to try the following
technique - http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

On Wed, Nov 24, 2010 at 3:06 PM, stockii  wrote:
>
> Hello.
>
> i wonder why this deltaQuery takes over a minute:
>
> deltaQuery="SELECT id FROM sessions
>                WHERE created BETWEEN DATE_ADD( NOW(), INTERVAL - 1 HOUR ) AND 
> NOW()
>                OR modified BETWEEN '${dataimporter.sessions 
> .last_index_time}' AND
> DATE_ADD( NOW(), INTERVAL - 1 HOUR  ) "
>
> the database have only 700 Entries and the compare with modified takes so
> long !!? when i remove the modified compare its fast.
>
> when i put this query in my mysql database the query need 0.0014 seconds
> ... wha is it so slow?
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/DIH-delta-deltaQuery-tp1960246p1960246.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Basic Solr Configurations and best practice

2010-11-26 Thread Alexey Serba

> 1-      How to combine data from DIH and content extracted from file system
> document into one document in the index?
http://wiki.apache.org/solr/TikaEntityProcessor
You can have one sql entity that retrieves metadata from database and
another nested entity that parses binary file into additional fields
in the document.

> 2-      Should I move the per-user permissions into a separate index? What
> technique to implement?
I would start with keeping permissions in the same index as the actual content.


On Tue, Nov 23, 2010 at 11:35 AM, Darx Oman  wrote:
> Hi guys
>
> I'm kind of new to solr and I'm wondering how to configure solr to best
> fulfills my requirements.
>
> Requirements are as follow:
>
> I have 2 data sources: database and file system documents. Every document in
> the file system has related information stored in the database.  Both the
> file content and the related database fields must be indexed.  Along with
> the DB data is per-user permissions for every document.  I'm using DIH for
> the DB and Tika for the file System.  The documents contents nearly never
> change, while the DB data especially the permissions changes very
> frequently. Total number of documents roughly around 2M and each document is
> about 500KB.
>
> 1-      How to combine data from DIH and content extracted from file system
> document into one document in the index?
>
> 2-      Should I move the per-user permissions into a separate index? What
> technique to implement?
>

Re: using DIH with mets/alto file sets

2010-11-26 Thread Alexey Serba

> The idea is to create a full text index of the alto content, accompanied by 
> the author/title info from the mets file for purposes of results display.

- Then you need to list only alto files in your landscapes entity
(fileName="^ID.{3}-ALTO\d{3}.xml$" or something like that), because
you don't want to index every mets file as a separate solr document,
right?

- Also it seems you might want to try to add regex transformer that
extract ID from avto file name
   

- And finally add nested entity to process mets file for every alto record

"
  
"
and extract mets elements/attributes and index them as a separate fields.

P.S. I haven't tried similar scenario, so just speculating

On Fri, Nov 19, 2010 at 12:09 AM, Fred Gilmore  wrote:
> mets/alto is an xml standard for describing physical objects.  In this case,
> we're describing books.  The mets file holds the metadata (author, title,
> etc.), the alto file is the physical description (words on the page,
> formatting of the page).  So it's a one (mets) to many (alto) relationship.
>
> the directory structure:
>
> /our/collection/IDxxx/:
>
> IDxxx-mets.xml
> ALTO/
>
> /our/collection/IDxxx/ALTO/:
>
> IDxxx-ALTO001.xml
> IDxxx-ALTO002.xml
>
> ie. an xml file per scanned book page.
>
> Beyond the ID number as part of the file names, the mets file contains no
> reference to the alto children.  The alto children do contain a reference to
> the jpg page scan, which is labelled with the ID number as part of the name.
>
> The idea is to create a full text index of the alto content, accompanied by
> the author/title info from the mets file for purposes of results display.
>  The first try with this is attempting a recursive FileDataSource approach.
>
> It was relatively easy to create a "content" field which holds the text of
> the page (each word is actually an attribute of a separate tag), but I'm
> having difficulty determining how I'm going to conditionally add the author
> and title data from the METS file to the rows created with the ALTO content
> field.  It'll involve regex'ing out the ID number associated with both the
> mets and alto filenames for starters, but even at that, I don't see how to
> keep it straight since it's not one mets=one alto and it's also not a static
> string for the entire index.
>
> thanks for any hints you can provide.
>
> Fred
> University of Texas at Austin
> ==
> data-config.xml thus far:
>
> 
> 
> 
>  processor="FileListEntityProcessor" fileName=".xml$" recursive="true"
> baseDir="/home/utlol/htdocs/lib-landscapes-new/publications/">
>  stream="true"
> pk="filename"
> url="${landscapes.fileAbsolutePath}"
> processor="XPathEntityProcessor"
> forEach="/mets | /alto"
> transformer="TemplateTransformer,RegexTransformer,LogTransformer"
> logTemplate=" processing ${landscapes.fileAbsolutePath}"
> logLevel="info"
>>
>
> 
> 
> 
>
>
>  xpath="/mets/dmdSec/mdWrap/xmlData/mods/titleInfo/title" />
> 
>  xpath="/alto/Description/sourceImageInformation/fileName" />
>  xpath="/alto/Layout/Page/PrintSpace/TextBlock/TextLine/String/@CONTENT" />
> 
> 
> 
> 
> ==
> METS example:
>
> 
> http://www.w3.org/2001/XMLSchema-instance";
> xmlns="http://www.loc.gov/METS/";
> xsi:schemaLocation="http://www.loc.gov/METS/
> http://schema.ccs-gmbh.com/docworks/version20/mets-docworks.xsd";
> xmlns:MODS="http://www.loc.gov/mods/v3"; xmlns:mix="http://www.loc.gov/mix/";
> xmlns:xlink="http://www.w3.org/1999/xlink"; TYPE="METAe_Monograph"
> LABEL="ENVIRONMENTAL GEOLOGIC ATLAS OF THE TEXAS COASTAL ZONE- Kingsville
> Area">
> 
> 
> CCS docWORKS/METAe Version 6.3-0
> docWORKS-ID: 1677
> 
> 
> 
> 
> 
> 
> 
> ENVIRONMENTAL GEOLOGIC ATLAS OF THE TEXAS COASTAL ZONE-
> Kingsville Area
> 
> 
> L F. Brown, Jr., J. H. McGowen, T. J. Evans, C.
> G.
> Groat
> 
> aut
> 
> 
> 
> W. L.
> Fisher
> 
> aut
> 
> 
>
> 
> ALTO example:
>
> 
> http://www.w3.org/2001/XMLSchema-instance";
> xsi:noNamespaceSchemaLocation="http://schema.ccs-gmbh.com/metae/alto-1-1.xsd";
> xmlns:xlink="http://www.w3.org/TR/xlink";>
> 
> mm10
> 
> /Docworks/IN/GeologyBooks/txu-oclc-6917337/txu-oclc-6917337-009.jpg
> 
> 
> 
> 
> CCS Content Conversion Specialists GmbH,
> Germany
> CCS docWORKS
> 6.3-0.93
> 
> 
> 
> 
> ABBYY (BIT Software), Russia
> FineReader
> 7.0
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  HEIGHT="2345"/>
>  HEIGHT="314"/>
>  HEIGHT="2345">
>  STYLEREFS="TXT_0 PAR_CENTER">
> 
>  CONTENT="Preface" WC="0.98" CC="000"/>
> 
>
>
>
>

Re: Searching with wrong keyboard layout or using translit

2010-10-31 Thread Alexey Serba

Another approach for this problem is to use another Solr core for
storing users queries for auto complete functionality ( see
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
) and index not only user_query field, but also transliterated and
diff_layout versions and use dismax query parser to search suggestions
in all fields.

This solution is only viable if you have huge log of user queries (
which I believe google does ).

HTH,
Alex



2010/10/29 Alexander Kanarsky :
> Pavel,
>
> it depends on size of your documents corpus, complexity and types of
> the queries you plan to use etc. I would recommend you to search for
> the discussions on synonyms expansion in Lucene (index time vs. query
> time tradeoffs etc.) since your problem is quite similar to that
> (think Moskva vs. Moskwa). Unless you have a small corpus, I would go
> with the second approach and expand the terms during the query time.
> However, the first approach might be useful, too: say, you may want to
> boost the score for the documents that naturally contain the word
> 'Moskva', so such a documents will be at the top of the result list.
> Having both forms indexed will allow you to achieve this easily by
> utilizing Solr's dismax query (to boost the results from the field
> with the original terms):
> http://localhost:8983/solr/select/?q=Moskva&defType=dismax&qf=text^10.0+text_translit^0.1
> ('text' field has the original Cyrillic tokens, 'text_translit' is for
> transliterated ones)
>
> -Alexander
>
>
> 2010/10/28 Pavel Minchenkov :
>> Alexander,
>>
>> Thanks,
>> What variat has better performance?
>>
>>
>> 2010/10/28 Alexander Kanarsky 
>>
>>> Pavel,
>>>
>>> I think there is no single way to implement this. Some ideas that
>>> might be helpful:
>>>
>>> 1. Consider adding additional terms while indexing. This assumes
>>> conversion of Russian text to both "translit" and "wrong keyboard"
>>> forms and index converted terms along with original terms (i.e. your
>>> Analyzer/Filter should produce Moskva and Vjcrdf for term Москва). You
>>> may re-use the same field (if you plan for a simple term queries) or
>>> create a separate fields for the generated terms (better for phrase,
>>> proximity queries etc. since it keeps the original text positional
>>> info). Then the query could use any of these forms to fetch the
>>> document. If you use separate fields, you'll need to expand/create
>>> your query to search for them, of course.
>>> 2. If you have to index just an original Russian text, you might
>>> generate all term forms while analyzing the query, then you could
>>> treat the converted terms as a synonyms and use the combination of
>>> TermQuery for all term forms or the MultiPhraseQuery for the phrases.
>>> For Solr in this case you probably will need to add a custom filter
>>> similar to SynonymFilter.
>>>
>>> Hope this helps,
>>> -Alexander
>>>
>>> On Wed, Oct 27, 2010 at 1:31 PM, Pavel Minchenkov 
>>> wrote:
>>> > Hi,
>>> >
>>> > When I'm trying to search Google with wrong keyboard layout -- it
>>> corrects
>>> > my query, example: http://www.google.ru/search?q=vjcrdf (I typed word
>>> > "Moscow" in Russian but in English keyboard layout).
>>> > Also, when I'm searching using
>>> > translit, It does the same: http://www.google.ru/search?q=moskva
>>> >
>>> > What is the right way to implement this feature in Solr?
>>> >
>>> > --
>>> > Pavel Minchenkov
>>> >
>>>
>>
>>
>>
>> --
>> Pavel Minchenkov
>>
>

Re: problem on running fullimport

2010-10-24 Thread Alexey Serba

" Caused by: java.sql.SQLException: Illegal value for setFetchSize(). "

Try to add batchSize="-1" to your data source declaration

http://wiki.apache.org/solr/DataImportHandlerFaq#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F

On Fri, Oct 15, 2010 at 3:42 PM, swapnil dubey  wrote:
> Hi,
>
> I am using the full import option with the data-config file as mentioned
> below
>
> 
>    url="jdbc:mysql:///xxx" user="xxx" password="xx"  />
>    
>            
>            
>            
>    
> 
>
>
> on running the full-import option I am getting the error mentioned below.I
> had already included the dataimport.properties file in my conf file.help me
> to get the issue resolved
>
> 
> -
> 
> 0
> 334
> 
> -
> 
> -
> 
> data-config.xml
> 
> 
> full-import
> debug
> 
> -
> 
> -
> 
> -
> 
> select studentName from test1
> -
> 
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
> execute query: select studentName from test1 Processing Document # 1
>    at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>    at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:253)
>    at
> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
>    at
> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
>    at
> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:184)
>    at
> org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58)
>    at
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:71)
>    at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237)
>    at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:357)
>    at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
>    at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
>    at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
>    at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
>    at
> org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:203)
>    at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>    at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>    at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>    at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>    at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>    at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>    at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>    at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>    at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>    at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>    at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>    at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>    at org.mortbay.jetty.Server.handle(Server.java:285)
>    at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>    at
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
>    at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
>    at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
>    at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>    at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>    at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: java.sql.SQLException: Illegal value for setFetchSize().
>    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1075)
>    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:989)
>    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:984)
>    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:929)
>    at com.mysql.jdbc.StatementImpl.setFetchSize(StatementImpl.java:2496)
>    at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:242)
>    ... 33 more
> 
> 0:0:0.50
> 
> 
> 
> idle
> Configuration Re-loaded sucessfully
> -
> 
> 0:0:0.299
> 1
> 0
> 0
> 0
> 2010-10-15 16:42:21
> Indexing failed. Rolled back all changes.
> 2010-10-15 16:42:21
> 
> -
> 
> This response forma

Re: DataImportHandler dynamic fields clarification

2010-10-13 Thread Alexey Serba

Harry, could you please file a jira for this and I'll address this in
a patch. I fixed related issue (SOLR-2102) and I think it's pretty
similar.

> Interesting, I was under the impression that case does not matter.
>
> From http://wiki.apache.org/solr/DataImportHandler#A_shorter_data-config :
> "It is possible to totally avoid the field entries in entities if the names
> of the fields are same (case does not matter) as those in Solr schema"
>
Yeah, case does not matter only for explicit mapping of sql columns to
Solr fields. The reason is that DIH populates hash map for case
insensitive match only for explicit mappings.

You can also workaround this upper case column names in Oracle using
the following SQL clause:
=
data-config.xml



schema.xml

=

HTH,
Alexey


On Thu, Sep 30, 2010 at 9:10 PM, harrysmith  wrote:
>
>>
>>Two things, one are your DB column uppercase as this would effect the out.
>>
>>
>
> Interesting, I was under the impression that case does not matter.
>
> From http://wiki.apache.org/solr/DataImportHandler#A_shorter_data-config :
> "It is possible to totally avoid the field entries in entities if the names
> of the fields are same (case does not matter) as those in Solr schema"
>
> I confirmed that matching the schema.xml field case to the database table is
> needed for dynamic fields, and the wiki statement above is incorrect, or at
> the very least confusing, possibly a bug.
>
> My database is Oracle 10g and the column names have been created in all
> uppercase in the database.
>
> In Oracle:
> Table name: wide_table
> Column names: COLUMN_1 ... COLUMN_100 (yes, uppercase)
>
> Please see following scenarios and results I found:
>
> data-config.xml
> 
> 
> 
>
> schema.xml
>  multiValued="true" />
>
> Result:
> Nothing Imported
>
> =
>
> data-config.xml
> 
> 
> 
>
> schema.xml
>  multiValued="true" />
>
> Result:
> Note query column names changed to uppercase.
> Nothing Imported
>
> =
>
>
> data-config.xml
> 
> 
> 
>
> schema.xml
>  multiValued="true" />
>
> Result:
> Note ONLY the field entry was changed to caps
>
> All records imported, with only COLUMN_100 id field.
>
> 
>
> data-config.xml
> 
> 
> 
>
> schema.xml
>  multiValued="true" />
>
> Result:
> Note BOTH the field entry was changed to caps in data-config.xml, and the
> dynamicField wildcard in schema.xml
>
> All records imported, with all fields specified. This is the behavior
> desired.
>
> =
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>>
>>Second what does your db-data-config.xml look like
>>
>>
>
> The relevant data-config.xml is as follows:
>
> 
> 
>  
> 
> 
>
> Ideally, I would rather have the query be 'select * from wide_table" with
> the fields being dynamically matched by the column name from the
> dynamicField wildcard from the schema.xml.
>
> 
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/DataImportHandler-dynamic-fields-clarification-tp1606159p1609578.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Help need in setting up delta imports

2010-09-24 Thread Alexey Serba

Your example doesn't mention deleting Employee. Is this a valid use case?

If not then you can simplify things:

query="SELECT name, address from employee where endtimestamp is null"
deltaQuery= "SELECT DISTINCT name FROM  employee eventtimestamp  >
'${dataimporter.last_index_time}' "
deltaImportQuery="SELECT name, address from employee where
endtimestamp is null and name='${deltaimport.delta.name}'" >

And yes, you need to make your Solr field "name" unique in schema.xml


On Tue, Sep 21, 2010 at 11:18 PM, Papiya Das. Misra  wrote:
> We have our data in a datawarehouse where any changes are made by adding 
> another row and marking the previous row as old my the means of a timestamp.
>
> So, for instance, the name of the table is Employee and has the following 
> structure.
>
> Name|Address|eventtimestamp|endtimestamp
> John|NYC| 2010-09-21 12:10:11.164638|null
>
> -          Primary key is Name
>
> -          When you update, say the address, the database entries become
> Name|Address|eventtimestamp|endtimestamp
> John|NYC| 2010-09-21 13:10:11.164638|2010-09-21 13:10:11.164638
> John|CHG| 2010-09-21 13:10:11.164638|null
>
> Here is my db config -
>
> 
>     holdability="HOLD_CURSORS_OVER_COMMIT" />
>    
>                        query="SELECT name, address from employee where endtimestamp is 
> null"
>    deletedPkQuery="SELECT DISTINCT name FROM  employee
>                         eventtimestamp  > '${dataimporter.last_index_time}' "
>    deltaQuery= "SELECT DISTINCT name FROM  employee
>                         eventtimestamp  > '${dataimporter.last_index_time}' "
>    deltaImportQuery="SELECT name, address from employee where endtimestamp is 
> null and name='${deltaimport}'" >
>   
>
>    
> 
>
>
> When I do delta import, I end up with two rows for the same employee. Any 
> ideas or experiences regarding implementation of delta import are welcome too.
>
>
> Thanks
> Papiya
>
>
>
> 
> Pink OTC Markets Inc. provides the leading inter-dealer quotation and trading 
> system in the over-the-counter (OTC) securities market. We create innovative 
> technology and data solutions to efficiently connect market participants, 
> improve price discovery, increase issuer disclosure, and better inform 
> investors. Our marketplace, comprised of the issuer-listed OTCQX and 
> broker-quoted Pink Sheets, is the third largest U.S. equity trading venue for 
> company shares.
>
> This document contains confidential information of Pink OTC Markets and is 
> only intended for the recipient. Do not copy, reproduce (electronically or 
> otherwise), or disclose without the prior written consent of Pink OTC 
> Markets. If you receive this message in error, please destroy all copies in 
> your possession (electronically or otherwise) and contact the sender above.
>

Re: Delta Import with something other than Date

2010-09-10 Thread Alexey Serba

> Can you provide a sample of passing the parameter via URL? And how using it 
> would look in the data-config.xml
http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters

Re: Solr is indexing jdbc properties

2010-09-06 Thread Alexey Serba

http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are_added_to_the_Solr_document_as_object_strings_like_B.401f23c5

Try to add convertType attribute to dataSource declaration, i.e.
 

HTH,
Alex

On Mon, Sep 6, 2010 at 5:49 PM, savvas.andreas
 wrote:
>
> Hello,
>
> I am trying to index some data stored in an SQL Server database through DIH.
> My setup in data-config.xml is the following:
>
> 
>                            name="mssqlDatasource"
>              driver="net.sourceforge.jtds.jdbc.Driver"
>
> url="jdbc:jtds:sqlserver://{db.host}:1433/{db};instance=SQLEXPRESS"
>              user="{username}"
>              password="{password}"/>
>  
>                            dataSource="mssqlDatasource"
>            query="select id,
>                        title
>                        from WORK">
>                        
>                        
>    
>  
> 
>
> However, when I run the indexer (invoking
> http://127.0.0.1:8983/solr/admin/dataimport.jsp?handler=/dataimport) I get
> all the rows in my index but with incorrect data indexed.
>
> More specifically, by examining the top 10 terms for the title field I get:
>
> term    frequency
> impl    1241371
> jdbc    1241371
> net     1241371
> sourceforg      1241371
> jtds    1241371
> clob    1241371
> netsourceforgejtdsjdbcclobimpl  1186981
> c       185070
> a       179901
> e       160759
>
> which is clearly wrong..Does anybody know why Solr is indexing the jdbc
> properties instead of the actual data?
>
> Any pointers would be much appreciated.
>
> Thank you very much.
> -- Savvas
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-is-indexing-jdbc-properties-tp1426473p1426473.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Data Import Handler Query

2010-08-12 Thread Alexey Serba

Try to define image solr fields <-> db columns mapping explicitly in
"image" entity, i.e.







See 
http://www.lucidimagination.com/search/document/c8f2ed065ee75651/dih_and_multivariable_fields_problems

On Thu, Aug 12, 2010 at 2:30 AM, Manali Joshi  wrote:
> I tried making the schema fields that get the image data to
> multiValued="true". But it still gets only the first image data. It doesn't
> have information about all the images.
>
>
>
>
> On Wed, Aug 11, 2010 at 1:15 PM, kenf_nc  wrote:
>
>>
>> It may not be the data config. Do you have the fields in the schema.xml
>> that
>> the image data is going to set to be multiValued="true"?
>>
>> Although, I would think the last image would be stored, not the first, but
>> haven't really tested this.
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Data-Import-Handler-Query-tp1092010p1092917.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>

Re: DIH and multivariable fields problems

2010-08-10 Thread Alexey Serba

> Have others successfully imported dynamic multivalued fields in a
> child entity using the DataImportHandler via the child entity returning
> multiple records through a RDBMS?
Yes, it's working ok with static fields.

I didn't even know that it's possible to use variables in field names
( "dynamic" names ) in DIH configuration. This use case is quite
unusual.

> This is increasingly more looking like a bug. To recap, I am trying to use
> the DIH to import multivalued dynamic fields and using a variable to name
> that field.
I'm not an expert in DIH source code but it seems there's special
processing of "dynamic" fields that prevents handling field type (and
multivalued attribute). Specifically there's conditional jump
("continue") over field type detection code in case of "dynamic" field
name ( see DataImporter:initEntity ). I guess the reason of such
behavior is that you can't determine field type based on dynamic field
name ("${variable}_s") at that time (configuration parsing). I'm
wondering if it's possible to determine field types at runtime (when
actual field "title_s" name is resolved).

I encountered similar problem with implicit sql_column <-> solr_field
mapping using SqlEntityProcessor, i.e. when you select some columns
and do not explicitly list all these columns as fields entries in your
configuration. In this case field type detection doesn't work either.
I think that moving type detection process into runtime would solve
that problem also. Am i missing something obvious that prevents us
from doing field type detection at runtime?

Alex

On Tue, Aug 10, 2010 at 4:20 AM, harrysmith  wrote:
>
> This is increasingly more looking like a bug. To recap, I am trying to use
> the DIH to import multivalued dynamic fields and using a variable to name
> that field.
>
> Upon further testing, the multivalued import works fine with a
> static/constant name, but only keeps the first record when naming the field
> dynamically. See below for relevant snips.
>
> From schema.xml :
>  multiValued="true" />
>
> From data-config.xml :
>
> 
> 
> 
> 
> 
>
> 
> Produces the following, note that there are 3 records that should be
> returned and are correctly done, with the field name being a constant.
>
> - 
> - 
>  9892962
> - 
>  record 1
>  record 2
>  record 3
>  Polygraph Newsletter Title
>  
> - 
>  Polygraph Newsletter Title
>  
>  
>  
>
> ===
>
> Now, changing the field name to a variable..., note only the first record is
> retained for the 'Relation_s' field -- there should be 3 records.
>
> 
> becomes
> 
>
> produces the following:
> - 
> - 
> - 
>  record 1
>  
> - 
>  Polygraph Newsletter Title
>  
>  9892962
> - 
>  Polygraph Newsletter Title
>  
>  
>  
>
> Only the first record is retained. There was also another post (which
> recieved no replies) in the archive that reported the same issue. The DIH
> debug logs do show 3 records correctly being returned, so somehow these are
> not getting added.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/DIH-and-multivariable-fields-problems-tp1032893p1065244.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Implementing lookups while importing data

2010-08-10 Thread Alexey Serba

> We are currently doing this via a JOIN on the numeric
> field, between the main data table and the lookup table, but this
> dramatically slows down indexing.
I believe SQL JOIN is the fastest and easiest way in your case (in
comparison with nested entity even using CachedSqlEntity). You
probably don't have proper indexes in your database - check SQL query
plan.

Re: DIH: Rows fetch OK, Total Documents Failed??

2010-08-10 Thread Alexey Serba

Do you have any required fields or uniqueKey in your schema.xml? Do
you provide values for all these fields?

AFAIU you don't need commonField attribute for id and title fields. I
don't think that's your problem but anyway...


On Sat, Jul 31, 2010 at 11:29 AM,   wrote:
>
>  Hi,
>
> I'm a bit lost with this, i'm trying to import a new XML via DIH, all row are 
> fetched but no ducument are indexed? I don't find any log or error?
>
> Any ideas?
>
> Here is the STATUS:
>
>
> status
> idle
> 
> 
> 1
> 7554
> 0
> 2010-07-31 10:14:33
> 0
> 7554
> 0:0:4.720
> 
>
>
> My xml file looks like this:
>
> 
> 
>    
>        Moniteur VG1930wm 19 LCD Viewsonic
>        
> http://x.com/abc?a(12073231)p(2822679)prod(89042332277)ttid(5)url(http%3A%2F%2Fwww.ffdsssd.com%2Fproductinformation%2F%7E66297%7E%2Fproduct.htm%26sender%3D2003)
>        Moniteur VG1930wm 19  LCD Viewsonic VG1930WM
>        247.57
>        Ecrans
>     etc...
>
> and my dataconfig:
>
> 
>        
>        
>                                        url="file:///home/john/Desktop/src.xml"
>                        processor="XPathEntityProcessor"
>                        forEach="/products/product"
>                        transformer="DateFormatTransformer">
>
>                          commonField="true" />
>                         xpath="/products/product/title" commonField="true" />
>                         xpath="/products/product/category" />
>                         xpath="/products/product/content" />
>                         xpath="/products/product/price" />
>
>                
>        
> 
>
>
>
>
>

Re: Performance issues when querying on large documents

2010-07-23 Thread Alexey Serba

Do you use highlighting? ( http://wiki.apache.org/solr/HighlightingParameters )

Try to disable it and compare performance.

On Fri, Jul 23, 2010 at 10:52 PM, ahammad  wrote:
>
> Hello,
>
> I have an index with lots of different types of documents. One of those
> types basically contains extracts of PDF docs. Some of those PDFs can have
> 1000+ pages, so there would be a lot of stuff to search through.
>
> I am experiencing really terrible performance when querying. My whole index
> has about 270k documents, but less than 1000 of those are the PDF extracts.
> The slow querying occurs when I search only on those PDF extracts (by
> specifying filters), and return 100 results. The 100 results definitely adds
> to the issue, but even cutting that down can be slow.
>
> Is there a way to improve querying with such large results? To give an idea,
> querying for a single word can take a little over a minute, which isn't
> really viable for an application that revolves around searching. For now, I
> have limited the results to 20, which makes the query execute in roughly
> 10-15 seconds. However, I would like to have the option of returning 100
> results.
>
> Thanks a lot.
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Performance-issues-when-querying-on-large-documents-tp990590p990590.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: 2 solr dataImport requests on a single core at the same time

2010-07-23 Thread Alexey Serba

> having multiple Request Handlers will not degrade the performance
IMO you shouldn't worry unless you have hundreds of them

Re: commit is taking very very long time

2010-07-23 Thread Alexey Serba

> I am not sure why some commits take very long time.
Hmm... Because it merges index segments... How large is your index?

> Also is there a way to reduce the time it takes?
You can disable commit in DIH call and use autoCommit instead. It's
kind of hack because you postpone commit operation and make it async.

Another option is to set optimize=false in DIH call ( it's true by
default ). Also you can try to increase mergeFactor parameter but it
would affect search performance.

Re: 2 solr dataImport requests on a single core at the same time

2010-07-22 Thread Alexey Serba

DataImportHandler does not support parallel execution of several
requests. You should either send your requests sequentially or
register several DIH handlers in solrconfig and use them in parallel.

On Thu, Jul 22, 2010 at 11:20 AM, kishan  wrote:
>
> please help me
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/2-solr-dataImport-requests-on-a-single-core-at-the-same-time-tp978649p986351.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Adding new elements to index

2010-07-07 Thread Alexey Serba

1) Shouldn't you put your "entity" elements under "document" tag, i.e.

  
  

  
...
...
  


2) What happens if you try to run full-import with explicitly
specified "entity" GET parameter?
command=full-import&entity=carrers
command=full-import&entity=hidrants


On Wed, Jul 7, 2010 at 11:15 AM, Xavier Rodriguez  wrote:
> Thanks for the quick reply!
>
> In fact it was a typo, the 200 rows I got were from postgres. I tried to say
> that the full-import was omitting the 100 oracle rows.
>
> When I run the full import, I run it as a single job, using the url
> command=full-import. I've tried to clear the index both using the clean
> command and manually deleting it, but when I run the full-import, the number
> of indexed documents are the documents coming from postgres.
>
> To be sure that the id field is unique, i get the id by assigning a letter
> before the id value. When indexed, the id looks like s_123, and that's the
> id 123 for an entity identified as "s". Other entities use different
> prefixes, but never "s".
>
> I used DIH to index the data. My configuration is the folllowing:
>
> File db-data-config.xml
>
>          type="JdbcDataSource"
>        name="ds_ora"
>        driver="oracle.jdbc.OracleDriver"
>        url="jdbc:oracle:thin:@xxx.xxx.xxx.xxx:1521:SID"
>        user="user"
>        password="password"
>    />
>
>          type="JdbcDataSource"
>        name="ds_pg"
>        driver="org.postgresql.Driver"
>        url="jdbc:postgresql://xxx.xxx.xxx.yyy:5432/sid"
>        user="user"
>        password="password"
>    />
>
> 
>            
>            
> 
>
>
> 
>            
>            
>  
>
> --
>
> In that configuration, all the fields coming from ds_pg are indexed, and the
> fields coming from ds_ora are not indexed. As I've said, the strange
> behaviour for me is that no error is logged in tomcat, the number of
> documents created is the number of rows returned by "hidrants", while the
> number of rows returned is the sum of the rows from "hidrants" and
> "carrers".
>
> Thanks in advance.
>
> Xavi.
>
>
>
>
>
>
>
> On 7 July 2010 02:46, Erick Erickson  wrote:
>
>> first do you have a unique key defined in your schema.xml? If you
>> do, some of those 300 rows could be replacing earlier rows.
>>
>> You say: " if I have 200
>> rows indexed from postgres and 100 rows from Oracle, the full-import
>> process
>> only indexes 200 documents from oracle, although it shows clearly that the
>> query retruned 300 rows."
>>
>> Which really looks like a typo, if you have 100 rows from Oracle how
>> did you get 200 rows from Oracle?
>>
>> Are you perhaps doing this in two different jobs and deleting the
>> first import before running the second?
>>
>> And if this is irrelevant, could you provide more details like how you're
>> indexing things (I'm assuming DIH, but you don't state that anywhere).
>> If it *is* DIH, providing that configuration would help.
>>
>> Best
>> Erick
>>
>> On Tue, Jul 6, 2010 at 11:19 AM, Xavier Rodriguez 
>> wrote:
>>
>> > Hi,
>> >
>> > I have a SOLR installed on a Tomcat application server. This solr
>> instance
>> > has some data indexed from a postgres database. Now I need to add some
>> > entities from an Oracle database. When I run the full-import command, the
>> > documents indexed are only documents from postgres. In fact, if I have
>> 200
>> > rows indexed from postgres and 100 rows from Oracle, the full-import
>> > process
>> > only indexes 200 documents from oracle, although it shows clearly that
>> the
>> > query retruned 300 rows.
>> >
>> > I'm not doing a delta-import, simply a full import. I've tried to clean
>> the
>> > index, reload the configuration, and manually remove
>> dataimport.properties
>> > because it's the only metadata i found.  Is there any other file to check
>> > or
>> > modify just to get all 300 rows indexed?
>> >
>> > Of course, I tried to find one of that oracle fields, with no results.
>> >
>> > Thanks a lot,
>> >
>> > Xavier Rodriguez.
>> >
>>
>

Re: solr data config questions

2010-06-29 Thread Alexey Serba

It's weird. I tried and it works for me.

1) Try to add convertType="true" to JdbcDataSource definition
See 
http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are_added_to_the_Solr_document_as_object_strings_like_B.401f23c5

2) Try to apply cast operation to whole result, i.e.
cast(concat(replytable.comment_id, ',', replytable.SID) as char) as
commentreply

HTH,
Alex


On Tue, Jun 29, 2010 at 10:08 PM, Peng, Wei  wrote:
> I tried query="select concat(cast(replytable.comment_id as char), ',', 
> cast(replytable.SID as char)) as commentreply from commenttable right join 
> replytable on replytable.comment_id=commenttable.comment_id where 
> commenttable.story_id='${story.story_id}'" too, but I still got strange 
> characters
> "commentreply":["[...@66e23a","[...@8e5225","[...@1b308c1","[...@103f345"],
>
> I use the same query on mysql database, it returns right results.
>
> Can someone answer me this ?
>
> Many Thanks
>
> Vivian
>
> -Original Message-
> From: Alexey Serba [mailto:ase...@gmail.com]
> Sent: Monday, June 28, 2010 4:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: solr data config questions
>
> Hi,
>
> You can add additional commentreplyjoin entity to story entity, i.e.
>
>  ...
>            ...
>        
>            ...
>        
>    
>
>    
>        
>    
> 
>
> Thus, you will have multivalued field commentreply that contains list
> of related "comment_id, reply_id" ("comment_id," if you don't have any
> related replies for this entry) pairs. You can retrieve all values of
> that field and process on a client and build complex data structure.
>
> HTH,
> Alex
>
> On Mon, Jun 28, 2010 at 8:19 PM, Peng, Wei  wrote:
>> Hi All,
>>
>>
>>
>> I am a new user of Solr.
>>
>> We are now trying to enable searching on Digg dataset.
>>
>> It has story_id as the primary key and comment_id are the comment id
>> which commented story_id, so story_id and comment_id is one-to-many
>> relationship.
>>
>> These comment_ids can be replied by some repliers, so comment_id and
>> repliers are one-to-many relationship.
>>
>>
>>
>> The problem is that within a single returned document the search results
>> shows an array of comment_ids and an array of repliers without knowing
>> which repliers replied which comment.
>>
>> For example: now we got comment_id:[c1,c,2...,cn],
>> repliers:[r1,r2,r3rm]. Can we get something like
>> comment_id:[c1,c,2...,cn], repliers:[{r1,r2},{},r3{rm-1,rm}] so that
>> {r1,r2} is corresponding to c1?
>>
>>
>>
>> Our current data-config is attached:
>>
>> 
>>
>>    > autoreconnect="true" netTimeoutForStreamingResults="1200"
>> url="jdbc:mysql://localhost/diggdataset" batchSize="-1" user="root"
>> password=" "/>
>>
>>    
>>
>>            >
>>                  deltaImportQuery="select * from story where
>> ID=='${dataimporter.delta.story_id}'"
>>
>>                  deltaQuery="select story_id from story where
>> last_modified > '${dataimporter.last_index_time}'">
>>
>>
>>
>>            
>>
>>            
>>
>>            
>>
>>            
>>
>>            
>>
>>            
>>
>>            
>>
>>            
>>
>>            
>>
>>
>>
>>            >
>>                    query="select * from dugg_list where
>> story_id='${story.story_id}'"
>>
>>                    deltaQuery="select SID from dugg_list where
>> last_modified > '${dataimporter.last_index_time}'"
>>
>>                    parentDeltaQuery="select story_id from story where
>> story_id=${dugg_list.story_id}">
>>
>>                  
>>
>>            
>>
>>
>>
>>            >
>>                    query="select * from commenttable where
>> story_id='${story.story_id}'"
>>
>>                    deltaQuery="select SID from commenttable where
>> last_modified > '${dataimporter.last_index_time}'"
>>
>>                    parentDeltaQuery="select story_id from story where
>> story_id=${commenttable.story_id}">
>>
>>                  
>>
>>                  
>>
>>                  
>>
>>                  
>>
>>                  
>>
>>                  > column="timestamp" />
>>
>>
>>
>>
>>
>>            >
>>                    query="select * from replytable where
>> comment_id='${commenttable.comment_id}'"
>>
>>                    deltaQuery="select SID from replytable where
>> last_modified > '${dataimporter.last_index_time}'"
>>
>>                    parentDeltaQuery="select comment_id from
>> commenttable where comment_id=${replytable.comment_id}">
>>
>>                  
>>
>>                  
>>
>>                  
>>
>>                  
>>
>>                  
>>
>>            
>>
>>
>>
>>            
>>
>>            
>>
>>    
>>
>> 
>>
>>
>>
>> Please help me on this.
>>
>> Many thanks
>>
>>
>>
>> Vivian
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: DIH and denormalizing

2010-06-28 Thread Alexey Serba

> It seems that ${ncdat.feature} is not being set.
Try ${dataTable.feature} instead.


On Tue, Jun 29, 2010 at 1:22 AM, Shawn Heisey  wrote:
> I am trying to do some denormalizing with DIH from a MySQL source.  Here's
> part of my data-config.xml:
>
>       query="SELECT *,FROM_UNIXTIME(post_date) as pd FROM ncdat WHERE did
> > ${dataimporter.request.minDid} AND did <=
> ${dataimporter.request.maxDid} AND (did % ${dataimporter.request.numShards})
> IN (${dataimporter.request.modVal})">
>         query="SELECT webtable as wt FROM ncdat_wt WHERE
> featurecode='${ncdat.feature}'">
> 
> 
>
> The relationship between features in ncdat and webtable in ncdat_wt (via
> featurecode) will be many-many.  The "wt" field in schema.xml is set up as
> multivalued.
>
> It seems that ${ncdat.feature} is not being set.  I saw a query happening on
> the server and it was "SELECT webtable as wt FROM ncdat_wt WHERE
> featurecode=''" - that last part is an empty string with single quotes
> around it.  From what I can tell, there are no entries in ncdat where
> feature is blank.  I've tried this with both a 1.5-dev checked out months
> ago (which we are using in production) and a 3.1-dev checked out today.
>
> Am I doing something wrong?
>
> Thanks,
> Shawn
>
>

Re: solr data config questions

2010-06-28 Thread Alexey Serba

Hi,

You can add additional commentreplyjoin entity to story entity, i.e.


...








Thus, you will have multivalued field commentreply that contains list
of related "comment_id, reply_id" ("comment_id," if you don't have any
related replies for this entry) pairs. You can retrieve all values of
that field and process on a client and build complex data structure.

HTH,
Alex

On Mon, Jun 28, 2010 at 8:19 PM, Peng, Wei  wrote:
> Hi All,
>
>
>
> I am a new user of Solr.
>
> We are now trying to enable searching on Digg dataset.
>
> It has story_id as the primary key and comment_id are the comment id
> which commented story_id, so story_id and comment_id is one-to-many
> relationship.
>
> These comment_ids can be replied by some repliers, so comment_id and
> repliers are one-to-many relationship.
>
>
>
> The problem is that within a single returned document the search results
> shows an array of comment_ids and an array of repliers without knowing
> which repliers replied which comment.
>
> For example: now we got comment_id:[c1,c,2...,cn],
> repliers:[r1,r2,r3rm]. Can we get something like
> comment_id:[c1,c,2...,cn], repliers:[{r1,r2},{},r3{rm-1,rm}] so that
> {r1,r2} is corresponding to c1?
>
>
>
> Our current data-config is attached:
>
> 
>
>     autoreconnect="true" netTimeoutForStreamingResults="1200"
> url="jdbc:mysql://localhost/diggdataset" batchSize="-1" user="root"
> password=" "/>
>
>    
>
>            
>                  deltaImportQuery="select * from story where
> ID=='${dataimporter.delta.story_id}'"
>
>                  deltaQuery="select story_id from story where
> last_modified > '${dataimporter.last_index_time}'">
>
>
>
>            
>
>            
>
>            
>
>            
>
>            
>
>            
>
>            
>
>            
>
>            
>
>
>
>            
>                    query="select * from dugg_list where
> story_id='${story.story_id}'"
>
>                    deltaQuery="select SID from dugg_list where
> last_modified > '${dataimporter.last_index_time}'"
>
>                    parentDeltaQuery="select story_id from story where
> story_id=${dugg_list.story_id}">
>
>                  
>
>            
>
>
>
>            
>                    query="select * from commenttable where
> story_id='${story.story_id}'"
>
>                    deltaQuery="select SID from commenttable where
> last_modified > '${dataimporter.last_index_time}'"
>
>                    parentDeltaQuery="select story_id from story where
> story_id=${commenttable.story_id}">
>
>                  
>
>                  
>
>                  
>
>                  
>
>                  
>
>                   column="timestamp" />
>
>
>
>
>
>            
>                    query="select * from replytable where
> comment_id='${commenttable.comment_id}'"
>
>                    deltaQuery="select SID from replytable where
> last_modified > '${dataimporter.last_index_time}'"
>
>                    parentDeltaQuery="select comment_id from
> commenttable where comment_id=${replytable.comment_id}">
>
>                  
>
>                  
>
>                  
>
>                  
>
>                  
>
>            
>
>
>
>            
>
>            
>
>    
>
> 
>
>
>
> Please help me on this.
>
> Many thanks
>
>
>
> Vivian
>
>
>
>
>
>
>
>

Re: Data Import Handler Rich Format Documents

2010-06-28 Thread Alexey Serba

> Ok, I'm trying to integrate the TikaEntityProcessor as suggested.  I'm using
> Solr Version: 1.4.0 and getting the following error:
>
> java.lang.ClassNotFoundException: Unable to load BinURLDataSource or
> org.apache.solr.handler.dataimport.BinURLDataSource
It seems that DIH-Tika integration is not a part of Solr 1.4.0/1.4.1
release. You should use trunk / nightly builds.
https://issues.apache.org/jira/browse/SOLR-1583

> My data-config.xml looks like this:
>
> 
>      driver="oracle.jdbc.driver.OracleDriver"
>    url="jdbc:oracle:thin:@whatever:12345:whatever"
>    user="me"
>    name="ds-db"
>    password="secret"/>
>
>      name="ds-url"/>
>
>  
>         dataSource="ds-db"
>     query="select * from my_database where rownum <=2">
>      
>      
>      
>      
>      
>      
>      
>    
>
>         dataSource="ds-url"
>     query="select CONTENT_URL from my_database where
> content_id='${my_database.CONTENT_ID}'">
>           dataSource="ds-url"
>      format="text">
>      url="http://www.mysite.com/${my_database.content_url}";
>      
>     
>    
>
>  
> 
>
> I added the entity name="my_database_url" section to an existing (working)
> database entity to be able to have Tika index the content pointed to by the
> content_url.
>
> Is there anything obviously wrong with what I've tried so far?

I think you should move Tika entity into my_database entity and
simplify the whole configuration


...


http://www.mysite.com/${my_database.content_url}";

Re: dataimport.properties is not updated on delta-import

2010-06-25 Thread Alexey Serba

Please note that Oracle ( or Oracle jdbc driver ) converts column
names to upper case eventhough you state them in lower case. If this
is the case then try to rewrite your query in the following form
select id as "id", name as "name" from table

On Thursday, June 24, 2010, warb  wrote:
>
> Hello again!
>
> Upon further investigation it seems that something is amiss with
> delta-import after all, the delta-import does not actually import anything
> (I thought it did when I ran it previously but I am not sure that was the
> case any longer.) It does complete successfully as seen from the front-end
> (dataimport?command=delta-import). Also in the logs it is stated the the
> import was successful (INFO: Delta Import completed successfully), but there
> are exception pertaining to some documents.
>
> The exception message is that the id field is missing
> (org.apache.solr.common.SolrException: Document [null] missing required
> field: id). Now, I have checked the column names in the table, the
> data-config.xml file and the schema.xml file and they all have the
> column/field names written in lowercase and are even named exactly the same.
>
> Do Solr rollback delta-imports if one or more of the documents failed?
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/dataimport-properties-is-not-updated-on-delta-import-tp916753p919609.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Data Import Handler Rich Format Documents

2010-06-21 Thread Alexey Serba

You are right. It seems TikaEntityProcessor is exactly the tool you
need in this case.

Alex

On Sat, Jun 19, 2010 at 2:59 AM, Chris Hostetter
 wrote:
> : I think you can use existing ExtractingRequestHandler to do the job,
> : i.e. add child entity to your DIH metadata
>
> why would you do this instead of using the TikaEntityProcessor as i
> already suggested in my earlier mail?
>
>
>
> -Hoss
>
>

Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Alexey Serba

I think you can use existing ExtractingRequestHandler to do the job,
i.e. add child entity to your DIH metadata




http://localhost:8983/solr/update/extract?extractOnly=true&wt=xml&indent=on&stream.url=${metadata.url}";
dataSource="solr">




That's not working example, just basic idea, you still need to
uri_escape ${metadata.url} reference probably using some transformer
(regexp, javascript?) and extract file content from ERH xml response
using xpath and probably do some html stripping.

HTH,
Alex

On Fri, Jun 18, 2010 at 4:51 PM, Tod  wrote:
> I have a database containing Metadata from a content management system.
>  Part of that data includes a URL pointing to the actual published document
> which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.
>
> I'm already indexing the Metadata and that provides a lot of value.  The
> customer however would like that the content pointed to by the URL also be
> indexed for more discrete searching.
>
> This article at Lucid:
>
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS
>
> describes the process of coding a custom transformer.  A separate article
> I've read implies Nutch could be used to provide this functionality too.
>
> What would be the best and most efficient way to accomplish what I'm trying
> to do?  I have a feeling the Lucid article might be dated and there might
> ways to do this now without any coding and maybe without even needing to use
> Nutch.  I'm using the current release version of Solr.
>
> Thanks in advance.
>
>
> - Tod
>

Re: Solr DataConfig / DIH Question

2010-06-16 Thread Alexey Serba

> There is a 1-[0,1] relationship between Person and Address with address_id 
> being the nullable foreign key.

I think you should be good with single query/entity then (no need for
nested entities)



On Sunday, June 13, 2010, Holmes, Charles V.  wrote:
> I'm putting together an entity.  A simplified version of the database schema 
> is below.  There is a 1-[0,1] relationship between Person and Address with 
> address_id being the nullable foreign key.  If it makes any difference, I'm 
> using SQL Server 2005 on the backend.
>
> Person [id (pk), name, address_id (fk)]
> Address [id (pk), zipcode]
>
> My data config looks like the one below.  This naturally fails when the 
> address_id is null since the query ends up being "select * from user.address 
> where id = ".
>
>          Query="select * from user.person">
>              Query="select * from user.address where id = ${person.address_id}"
>   
> 
>
> I've worked around it by using a config like this one.  However, this makes 
> the queries quite complex for some of my larger joins.
>
>          Query="select * from user.person">
>              Query="select * from user.address where id = (select address_id 
> from user.person where id = ${person.id})">
>   
> 
>
> Is there a cleaner / better way of handling these type of relationships?  
> I've also tried to specify a default in the Solr schema, but that seems to 
> only work after all the data is indexed which makes sense but surprised me 
> initially.  BTW, thanks for the great DIH tutorial on the wiki!
>
> Thanks!
> Charles
>

Re: multiValued using

2010-06-07 Thread Alexey Serba

Hi Alberto,

You can add child entity which returns multiple records, i.e.






HTH,
Alex

2010/6/7 Alberto García Sola :
> Hello, this is my first message to this list.
>
> I was wondering if it is possible to use multiValued when using MySQL (or
> any SQL-database engine) through DataImportHandler.
>
> I've tried using a query which return something like this:
> 1 - title1 - multivalue1-1
> 1 - title1 - multivalue1-2
> 1 - title1 - multivalue1-3
> 2 - title2 - multivalue2-1
> 2 - title2 - multivalue2-2
>
> And using the first row as ID. But that only returns me the first occurrence
> rather than transforming them into multiValued fields.
>
> Is there a way to deal with multiValued in databases?
>
> NOTE: The way of working with multivalues I use is using foreign keys and
> relate them into the query so that the query gives me the results the way I
> have shown.
>
> Regards,
> Alberto.
>

Re: Importing large datasets

2010-06-07 Thread Alexey Serba

What's the relation between items and item_descriptions table? I.e. is
there only one item_descriptions record for every id?

If 1-1 then you can merge all your data into single database and use
the following query

 
 

HTH,
Alex

On Thu, Jun 3, 2010 at 6:34 AM, Blargy  wrote:
>
>
> Erik Hatcher-4 wrote:
>>
>> One thing that might help indexing speed - create a *single* SQL query
>> to grab all the data you need without using DIH's sub-entities, at
>> least the non-cached ones.
>>
>>       Erik
>>
>> On Jun 2, 2010, at 12:21 PM, Blargy wrote:
>>
>>>
>>>
>>> As a data point, I routinely see clients index 5M items on normal
>>> hardware
>>> in approx. 1 hour (give or take 30 minutes).
>>>
>>> Also wanted to add that our main entity (item) consists of 5 sub-
>>> entities
>>> (ie, joins). 2 of those 5 are fairly small so I am using
>>> CachedSqlEntityProcessor for them but the other 3 (which includes
>>> item_description) are normal.
>>>
>>> All the entites minus the item_description connect to datasource1.
>>> They
>>> currently point to one physical machine although we do have a pool
>>> of 3 DB's
>>> that could be used if it helps. The other entity, item_description
>>> uses a
>>> datasource2 which has a pool of 2 DB's that could potentially be
>>> used. Not
>>> sure if that would help or not.
>>>
>>> I might as well that the item description will have indexed, stored
>>> and term
>>> vectors set to true.
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>>
>
> I can't find any example of creating a massive sql query. Any out there?
> Will batching still work with this massive query?
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866506.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: indexer threading?

2010-04-27 Thread Alexey Serba

Hi Brian,

I was testing indexing performance on a high cpu box recently and came
to the same issue. I tried different indexing methods ( xml,
CSVRequestHandler and Solrj + BinaryRequestWriter with multiple
threads ). The last method is the fastest indeed. I believe that
multiple threads approach gives you better performance if you have
complex text analysis. I had very simple analysis -
WhitespaceTokenizer only and performance boost with increasing threads
was not very impressive ( but still ). I guess that in case of simple
text analysis overall performance comes to synchronization issues.

I tried to profile application during indexing phase for CPU times and
monitors and it seems that most of blocking is on the following
methods:
- DocumentsWriter.doBalanceRAM
- DocumentsWriter.getThreadState
- SolrIndexWriter.ensureOpen

I don't know the guts of Solr/Lucene in such details so can't make any
conclusions. Are there any configuration techniques to improve
indexing performance in multiple threads scenario?

Alex

On Mon, Apr 26, 2010 at 6:52 PM, Wawok, Brian  wrote:
> Hi,
>
> I was wondering about how the multi-threading of the indexer works?  I am 
> using SolrJ to stream documents to a server. As I add more threads on the 
> client side, I slowly see both speed and CPU usage go up on the indexer side. 
> Once I hit about 4 threads, my indexer is at 100% cpu usage (of 1 CPU on a 
> 4-way box), and will not do any more work. It is pretty fast, doing something 
> like 75k lines of text per second.. but I would really like to use all 4 CPUs 
> on the indexer. Is the just a limitation of Solr, or is this a limitation of 
> using SolrJ and document streaming?
>
>
> Thanks,
>
>
> Brian
>

Re: Short Question: Fills this entity multiValued Fields (DIH)?

2010-04-08 Thread Alexey Serba

> Have a look at these two lines:
> 
> 
>                
> 
>
> If there is more than one description per item_ID, does the features-field
> gets multiple values if it is defined as multiValued=true?
Correct.

Re: SOLR-1316 How To Implement this autosuggest component ???

2010-03-24 Thread Alexey Serba

You should add this component (suggest or spellcheck, depends how do
you name it) to request handler, i.e. add

  



  suggest

  

And then you can hit the following url and get your suggestions

http://localhost:8983/solr/suggest/?spellcheck=true&spellcheck.dictionary=suggest&spellcheck.build=true&spellcheck.extendedResults=true&spellcheck.count=10&q=prefix

On Wed, Mar 24, 2010 at 8:09 PM, stocki  wrote:
>
> hey.
>
> i got it =)
>
> i checked out with lucene and the build from solr. with ant -verbose
> example.
>
> now, when i put this line into solrconfig:  name="classname">org.apache.solr.spelling.suggest.Suggester
> no exception occurs =) juhu
>
> but how wokrs this component ?? sorry for a new stupid question ^^
>
>
> stocki wrote:
>>
>> okay, thx
>>
>> so i checked out but i cannot build an build.
>>
>> i got 100 errors ...
>>
>> D:\cygwin\home\stock\trunk_\solr\common-build.xml:424: The following error
>> occur
>> red while executing this line:
>> D:\cygwin\home\stock\trunk_\solr\common-build.xml:281: The following error
>> occur
>> red while executing this line:
>> D:\cygwin\home\stock\trunk_\solr\contrib\clustering\build.xml:69: The
>> following
>> error occurred while executing this line:
>> D:\cygwin\home\stock\trunk_\solr\build.xml:155: The following error
>> occurred whi
>> le executing this line:
>> D:\cygwin\home\stock\trunk_\solr\common-build.xml:221: Compile failed; see
>> the c
>> ompiler error output for details.
>>
>>
>>
>> Lance Norskog-2 wrote:
>>>
>>> You need 'ant' to do builds.  At the top level, do:
>>> ant clean
>>> ant example
>>>
>>> These will build everything and set up the example/ directory. After
>>> that, run:
>>> ant test-core
>>>
>>> to run all of the unit tests and make sure that the build works. If
>>> the autosuggest patch has a test, this will check that the patch went
>>> in correctly.
>>>
>>> Lance
>>>
>>> On Tue, Mar 23, 2010 at 7:42 AM, stocki  wrote:

 okay,
 i do this..

 but one file are not right updatet 
 Index: trunk/src/java/org/apache/solr/util/HighFrequencyDictionary.java
 (from the suggest.patch)

 i checkout it from eclipse, apply patch, make an new solr.war ... its
 the
 right way ??
 i thought that is making a war i didnt need to make an build.

 how do i make an build ?




 Alexey-34 wrote:
>
>> Error loading class 'org.apache.solr.spelling.suggest.Suggester'
> Are you sure you applied the patch correctly?
> See http://wiki.apache.org/solr/HowToContribute#Working_With_Patches
>
> Checkout Solr trunk source code (
> http://svn.apache.org/repos/asf/lucene/solr/trunk ), apply patch,
> verify that everything went smoothly, build solr and use built version
> for your tests.
>
> On Mon, Mar 22, 2010 at 9:42 PM, stocki  wrote:
>>
>> i patch an nightly build from solr.
>> patch runs, classes are in the correct folder, but when i replace
>> spellcheck
>> with this spellchecl like in the comments, solr cannot find the
>> classes
>> =(
>>
>> 
>>    
>>      suggest
>>      > name="classname">org.apache.solr.spelling.suggest.Suggester
>>      > name="lookupImpl">org.apache.solr.spelling.suggest.jaspell.JaspellLookup
>>      text
>>      american-english
>>    
>>  
>>
>>
>> --> SCHWERWIEGEND: org.apache.solr.common.SolrException: Error loading
>> class
>> 'org.ap
>> ache.solr.spelling.suggest.Suggester'
>>
>>
>> why is it so ??  i think no one has so many trouble to run a patch
>> like
>> me =( :D
>>
>>
>> Andrzej Bialecki wrote:
>>>
>>> On 2010-03-19 13:03, stocki wrote:

 hello..

 i try to implement autosuggest component from these link:
 http://issues.apache.org/jira/browse/SOLR-1316

 but i have no idea how to do this !?? can anyone get me some tipps ?
>>>
>>> Please follow the instructions outlined in the JIRA issue, in the
>>> comment that shows fragments of XML config files.
>>>
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>   ___. ___ ___ ___ _ _   __
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://old.nabble.com/SOLR-1316-How-To-Implement-this-autosuggest-component-tp27950949p27990809.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>

 --
 View this message in context:
 http://old.nabble.com/SOLR-1316-How-To-Implement-this-patch-autoComplete-tp27950949p28001938.html
 Sent from the Solr - User mailing list archive

Re: SOLR-1316 How To Implement this autosuggest component ???

2010-03-23 Thread Alexey Serba

> Error loading class 'org.apache.solr.spelling.suggest.Suggester'
Are you sure you applied the patch correctly?
See http://wiki.apache.org/solr/HowToContribute#Working_With_Patches

Checkout Solr trunk source code (
http://svn.apache.org/repos/asf/lucene/solr/trunk ), apply patch,
verify that everything went smoothly, build solr and use built version
for your tests.

On Mon, Mar 22, 2010 at 9:42 PM, stocki  wrote:
>
> i patch an nightly build from solr.
> patch runs, classes are in the correct folder, but when i replace spellcheck
> with this spellchecl like in the comments, solr cannot find the classes =(
>
> 
>    
>      suggest
>      org.apache.solr.spelling.suggest.Suggester
>       name="lookupImpl">org.apache.solr.spelling.suggest.jaspell.JaspellLookup
>      text
>      american-english
>    
>  
>
>
> --> SCHWERWIEGEND: org.apache.solr.common.SolrException: Error loading class
> 'org.ap
> ache.solr.spelling.suggest.Suggester'
>
>
> why is it so ??  i think no one has so many trouble to run a patch like
> me =( :D
>
>
> Andrzej Bialecki wrote:
>>
>> On 2010-03-19 13:03, stocki wrote:
>>>
>>> hello..
>>>
>>> i try to implement autosuggest component from these link:
>>> http://issues.apache.org/jira/browse/SOLR-1316
>>>
>>> but i have no idea how to do this !?? can anyone get me some tipps ?
>>
>> Please follow the instructions outlined in the JIRA issue, in the
>> comment that shows fragments of XML config files.
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>   ___. ___ ___ ___ _ _   __
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>>
>
> --
> View this message in context: 
> http://old.nabble.com/SOLR-1316-How-To-Implement-this-autosuggest-component-tp27950949p27990809.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Term Highlighting without store text in index

2010-03-18 Thread Alexey Serba

Hey Dominique,

See 
http://www.lucidimagination.com/search/document/5ea8054ed8348e6f/highlight_arbitrary_text#3799814845ebf002

Although it might be not good solution for huge texts, wildcard/phrase queries.
http://issues.apache.org/jira/browse/SOLR-1397

On Mon, Mar 15, 2010 at 4:09 PM, dbejean  wrote:
>
> Hello,
>
> Just in order to be able to show term highlighting in my results list, I
> store all the indexed data in the Lucene index and so, it is very huge
> (108Gb). Is there any possibilities to do it in an other way ? Now or in the
> future, is it possible that Solr use a 3nd-party tool such as ehcache in
> order to store the content of the indexed documents outside of the Lucene
> index ?
>
> Thank you
>
> Dominique
>
>
> --
> View this message in context: 
> http://old.nabble.com/Term-Highlighting-without-store-text-in-index-tp27904022p27904022.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: implementing profanity detector

2010-02-11 Thread Alexey Serba

> - A TokenFilter would allow me to tap into the existing analysis pipeline so
> I get the tokens for free but I can't access the document.
https://issues.apache.org/jira/browse/SOLR-1536

On Fri, Jan 29, 2010 at 12:46 AM, Mike Perham  wrote:
> We'd like to implement a profanity detector for documents during indexing.
>  That is, given a file of profane words, we'd like to be able to mark a
> document as safe or not safe if it contains any of those words so that we
> can have something similar to google's safe search.
>
> I'm trying to figure out how best to implement this with Solr 1.4:
>
> - An UpdateRequestProcessor would allow me to dynamically populate a "safe"
> boolean field but requires me to pull out the content, tokenize it and run
> each token through my set of profanities, essentially running the analysis
> pipeline again.  That's a lot of overheard AFAIK.
>
> - A TokenFilter would allow me to tap into the existing analysis pipeline so
> I get the tokens for free but I can't access the document.
>
> Any suggestions on how to best implement this?
>
> Thanks in advance,
> mike
>

DataImportHandler - case sensitivity of column names

2010-02-08 Thread Alexey Serba

I encountered the problem with Oracle converting column names to upper
case. As a result SolrInputDocument is created with field names in
upper case and "Document [null] missing required field: id" exception
is thrown ( although ID field is defined ).

I do not specify "field" elements explicitly.

I know that I can rewrite all my queries to "select id as "id", body
as "body" from document" format, but is there any other workaround for
this? case insensitive option or something?

Here's my data-config:

  
  

  
  
  

  


Alexey

Re: Indexing an oracle warehouse table

2010-02-03 Thread Alexey Serba

> What would be the right way to point out which field contains the term 
> searched for.
I would use highlighting for all of these fields and then post process
Solr response in order to check highlighting tags. But I don't have so
many fields usually and don't know if it's possible to configure Solr
to highlight fields using '*' as dynamic fields.

On Wed, Feb 3, 2010 at 2:43 AM, caman  wrote:
>
> Thanks all. I am on track.
> Another question:
> What would be the right way to point out which field contains the term
> searched for.
> e.g. If I search for SOLR and if the term exist in field788 for a document,
> how do I pinpoint that which field has the term.
> I copied all the fields in field called 'body' which makes searching easier
> but would be nice to show the field which has that exact term.
>
> thanks
>
> caman wrote:
>>
>> Hello all,
>>
>> hope someone can point me to right direction. I am trying to index an
>> oracle warehouse table(TableA) with 850 columns. Out of the structure
>> about 800 fields are CLOBs and are good candidate to enable full-text
>> searching. Also have few columns which has relational link to other
>> tables. I am clean on how to create a root entity and then pull data from
>> other relational link as child entities.  Most columns in TableA are named
>> as field1,field2...field800.
>> Now my question is how to organize the schema efficiently:
>> First option:
>> if my query is 'select * from TableA', Do I  define > column="FIELD1" /> for each of those 800 columns?   Seems cumbersome. May
>> be can write a script to generate XML instead of handwriting both in
>> data-config.xml and schema.xml.
>> OR
>> Dont define any  so that column in
>> SOLR will be same as in the database table. But questions are 1)How do I
>> define unique field in this scenario? 2) How to copy all the text fields
>> to a common field for easy searching?
>>
>> Any helpful is appreciated. Please feel free to suggest any alternative
>> way.
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>
> --
> View this message in context: 
> http://old.nabble.com/Indexing-an-oracle-warehouse-table-tp27414263p27429352.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Indexing a oracle warehouse table

2010-02-02 Thread Alexey Serba

> Dont define any  so that column in
> SOLR will be same as in the database table.
Correct
You can define dynamic field  ( see
http://wiki.apache.org/solr/SchemaXml#Dynamic_fields )

> 1)How do I define unique field in this scenario?
You can create primary key into database or generate it directly in
Solr ( see "UUID techniques" http://wiki.apache.org/solr/UniqueKey )

> 2) How to copy all the text fields to a common field for easy searching?
 ( see
http://wiki.apache.org/solr/SchemaXml#Copy_Fields )


On Tue, Feb 2, 2010 at 4:22 AM, caman  wrote:
>
> Hello all,
>
> hope someone can point me to right direction. I am trying to index an oracle
> warehouse table(TableA) with 850 columns. Out of the structure about 800
> fields are CLOBs and are good candidate to enable full-text searching. Also
> have few columns which has relational link to other tables. I am clean on
> how to create a root entity and then pull data from other relational link as
> child entities.  Most columns in TableA are named as
> field1,field2...field800.
> Now my question is how to organize the schema efficiently:
> First option:
> if my query is 'select * from TableA', Do I  define  column="FIELD1" /> for each of those 800 columns?   Seems cumbersome. May be
> can write a script to generate XML instead of handwriting both in
> data-config.xml and schema.xml.
> OR
> Dont define any  so that column in
> SOLR will be same as in the database table. But questions are 1)How do I
> define unique field in this scenario? 2) How to copy all the text fields to
> a common field for easy searching?
>
> Any helpful is appreciated. Please feel free to suggest any alternative way.
>
> Thanks
>
>
>
>
>
> --
> View this message in context: 
> http://old.nabble.com/Indexing-a-oracle-warehouse-table-tp27414263p27414263.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

DataImportHandler - convertType attribute

2010-02-02 Thread Alexey Serba

Hello,

I encountered blob indexing problem and found convertType solution in
FAQ

I was wondering why it is not enabled by default and found the
following comment
in
mailing list:

"We used to attempt type conversion from the SQL type to the field's given
type. We
found that it was error prone and switched to using the ResultSet#getObject
for all columns (making the old behavior a configurable option –
"convertType" in JdbcDataSource)."

Why it is error prone? Is it safe enough to enable convertType for all jdbc
data sources by default? What are the side effects?

Thanks in advance,
Alex

Re: DataImportHandler - synchronous execution

2010-01-13 Thread Alexey Serba

Hi,

I created Jira issue SOLR-1721 and attached simple patch ( no
documentation ) for this.

HIH,
Alex

2010/1/13 Noble Paul നോബിള്‍  नोब्ळ् :
> it can be added
>
> On Tue, Jan 12, 2010 at 10:18 PM, Alexey Serba  wrote:
>> Hi,
>>
>> I found that there's no explicit option to run DataImportHandler in a
>> synchronous mode. I need that option to run DIH from SolrJ (
>> EmbeddedSolrServer ) in the same thread. Currently I pass dummy stream
>> to DIH as a workaround for this, but I think it makes sense to add
>> specific option for that. Any objections?
>>
>> Alex
>>
>
>
>
> --
> -
> Noble Paul | Systems Architect| AOL | http://aol.com
>

DataImportHandler - synchronous execution

2010-01-12 Thread Alexey Serba

Hi,

I found that there's no explicit option to run DataImportHandler in a
synchronous mode. I need that option to run DIH from SolrJ (
EmbeddedSolrServer ) in the same thread. Currently I pass dummy stream
to DIH as a workaround for this, but I think it makes sense to add
specific option for that. Any objections?

Alex

Re: Adaptive search?

2009-12-18 Thread Alexey Serba

You can add click counts to your index as additional field and boost
results based on that value.

http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_change_the_score_of_a_document_based_on_the_.2Avalue.2A_of_a_field_.28say.2C_.22popularity.22.29

You can keep some kind of buffer for clicks and update click count
field for documents in the index periodically.

If you don't want to update whole documents in the index then you
probably should look at ExternalFileField or Lucene ParallelReader as
a custom Solr IndexReader, but this is complex low level Lucene stuff
and requires some hacking.

Alex

On Thu, Dec 17, 2009 at 6:46 PM, Siddhant Goel  wrote:
> Let say we have a search engine (a simple front end - web app kind of a
> thing - responsible for querying Solr and then displaying the results in a
> human readable form) based on Solr. If a user searches for something, gets
> quite a few search results, and then clicks on one such result - is there
> any mechanism by which we can notify Solr to boost the score/relevance of
> that particular result in future searches? If not, then any pointers on how
> to go about doing that would be very helpful.
>
> Thanks,
>
> On Thu, Dec 17, 2009 at 7:50 PM, Paul Libbrecht  wrote:
>
>> What can it mean to "adapt to user clicks" ? Quite many things in my head.
>> Do you have maybe a citation that inspires you here?
>>
>> paul
>>
>>
>> Le 17-déc.-09 à 13:52, Siddhant Goel a écrit :
>>
>>
>>  Does Solr provide adaptive searching? Can it adapt to user clicks within
>>> the
>>> search results it provides? Or that has to be done externally?
>>>
>>
>>
>
>
> --
> - Siddhant
>

Re: preserve relational strucutre in solr?

2009-12-14 Thread Alexey Serba

http://wiki.apache.org/solr/DataImportHandler#Full_Import_Example

See full import example, it has 1-n and n-n relationships

On Mon, Dec 14, 2009 at 4:34 PM, Faire Mii  wrote:
>
>  was able to import data through solr DIH.
>
> in my db i have 3 tables:
>
> threads: id tags: id thread_tag_map: thread_id, tag_id
>
> i want to import the many2many relationship (which thread has which tags) to 
> my solr index.
>
> how should the query look like.
>
> i have tried with following code without result:
>
>         query="select * from threads, tags, thread_tag_map where 
> thread_tag_map.thread_id = threads.id AND thread_tag_map.tag_id = tags.id">
> 
>
> s this the right way to go?
>
> i thought that with this query each document will consist of tread and all 
> the tags related to it. and i could do a query to get the specific thread by 
> tagname.
>
>
> thanks!

Re: sanizing/filtering query string for security

2009-11-09 Thread Alexey Serba

> BTW, I have not used DisMax handler yet, but does it handle *:* properly?
See q.alt DisMax parameter
http://wiki.apache.org/solr/DisMaxRequestHandler#q.alt

You can specify q.alt=*:* and q as empty string to get all results.

> do you care if users issue this query
I allow users to issue an empty search and get all results with all
facets / etc. It's a nice navigation UI btw.

> Basically given my UI, I'm trying to *hide* the total count from users 
> searching for *everything*
If you don't specify q.alt parameter then Solr returns zero results
for empty search. *:* won't work either.

> though this syntax has helped me debug/monitor the state of my search doc 
> pool size.
see q.alt

Alex

On Tue, Nov 10, 2009 at 12:59 AM, michael8  wrote:
>
> Sounds like a nice approach you have  done.  BTW, I have not used DisMax
> handler yet, but does it handle *:* properly?  IOW, do you care if users
> issue this query, or does DisMax treat this query string differently than
> standard request handler?  Basically given my UI, I'm trying to *hide* the
> total count from users searching for *everything*, though this syntax has
> helped me debug/monitor the state of my search doc pool size.
>
> Thanks,
> Michael
>
>
> Alexey-34 wrote:
>>
>> I added some kind of pre and post processing of Solr results for this,
>> i.e.
>>
>> If I find fieldname specified in query string in form of
>> "fieldname:term" then I pass this query string to standard request
>> handler, otherwise use DisMaxRequestHandler ( DisMaxRequestHandler
>> doesn't break the query, at least I haven't seen yet ). If standard
>> request handler throws error ( invalid field, too many clauses, etc )
>> then I pass original query to DisMax request handler.
>>
>> Alex
>>
>> On Mon, Nov 9, 2009 at 10:05 PM, michael8  wrote:
>>>
>>> Hi Julian,
>>>
>>> Saw you post on exactly the question I have.  I'm curious if you got any
>>> response directly, or figured out a way to do this by now that you could
>>> share?  I'm in the same situation trying to 'sanitize' the query string
>>> coming in before handing it to solr.  I do see that characters like ":"
>>> could break the query, but am curious if anyone has come up with a
>>> general
>>> solution as I think this must be a fairly common problem for any solr
>>> deployment to tackle.
>>>
>>> Thanks,
>>> Michael
>>>
>>>
>>> Julian Davchev wrote:

 Hi,
 Is there anything special that can be done for sanitizing user input
 before passed as query to solr.
 Not allowing * and ? as first char is only thing I can thing of right
 now. Anything else it should somehow handle.

 I am not able to find any relevant document.


>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/sanizing-filtering-query-string-for-security-tp21516844p26271891.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> --
> View this message in context: 
> http://old.nabble.com/sanizing-filtering-query-string-for-security-tp21516844p26274459.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: sanizing/filtering query string for security

2009-11-09 Thread Alexey Serba

I added some kind of pre and post processing of Solr results for this, i.e.

If I find fieldname specified in query string in form of
"fieldname:term" then I pass this query string to standard request
handler, otherwise use DisMaxRequestHandler ( DisMaxRequestHandler
doesn't break the query, at least I haven't seen yet ). If standard
request handler throws error ( invalid field, too many clauses, etc )
then I pass original query to DisMax request handler.

Alex

On Mon, Nov 9, 2009 at 10:05 PM, michael8  wrote:
>
> Hi Julian,
>
> Saw you post on exactly the question I have.  I'm curious if you got any
> response directly, or figured out a way to do this by now that you could
> share?  I'm in the same situation trying to 'sanitize' the query string
> coming in before handing it to solr.  I do see that characters like ":"
> could break the query, but am curious if anyone has come up with a general
> solution as I think this must be a fairly common problem for any solr
> deployment to tackle.
>
> Thanks,
> Michael
>
>
> Julian Davchev wrote:
>>
>> Hi,
>> Is there anything special that can be done for sanitizing user input
>> before passed as query to solr.
>> Not allowing * and ? as first char is only thing I can thing of right
>> now. Anything else it should somehow handle.
>>
>> I am not able to find any relevant document.
>>
>>
>
> --
> View this message in context: 
> http://old.nabble.com/sanizing-filtering-query-string-for-security-tp21516844p26271891.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Similar documents from multiple cores with different schemas

2009-11-09 Thread Alexey Serba

> Or maybe it's
> possible to tweak MoreLikeThis just to return the fields and terms that
> could be used for a search on the other core?
Exactly

See parameter mlt.interestingTerms in MoreLikeThisHandler
http://wiki.apache.org/solr/MoreLikeThisHandler

You can get interesting terms and build query (with N optional clauses
+ boosts) to second core yourself

HIH,
Alex


On Mon, Nov 9, 2009 at 6:25 PM, Chantal Ackermann
 wrote:
> Hi all,
>
> my search for any postings answering the following question haven't produced
> any helpful hints so far. Maybe someone can point me into the right
> direction?
>
> Situation:
> I have two cores with slightly different schemas. Slightly means that some
> fields appear on both cores but there are some that are required in one core
> but optional in the other. Then there are fields that appear only in one
> core.
> (I don't want to put them in one index, right now, because of the fields
> that might be required for only one type but not the other. But it's
> certainly an option.)
>
> Question:
> Is there a way to get similar contents from core B when the input (seed) to
> the comparison is a document from core A?
>
> MoreLikeThis:
> I was searching for MoreLikeThis, multiple schemas etc. As these are cores
> with different schemas, the posts on distributed search/sharding in
> combination with MoreLikeThis are not helpful. But maybe there is some other
> functionality that I am not aware of? Some similarity search? Or maybe it's
> possible to tweak MoreLikeThis just to return the fields and terms that
> could be used for a search on the other core?
>
> Thanks for any input!
> Chantal
>

Re: MoreLikeThis and filtering/restricting on "target" fields

2009-11-06 Thread Alexey Serba

Hi Cody,

> I have tried using MLT as a search component so that it has access to
> filter queries (via fq) but I cannot seem to get it to give me any
> data other than more of the same, that is, I can get a ton of Articles
> back but not other "content types".
Filter query ( fq ) should work, for example add fq=type_s:BlogPost OR
type_s:Community

http://localhost:9007/solr/mlt?q=id:WikiArticle:948&mlt.fl=body_t&mlt.qf=body_t^1.0&fq=type_s:BlogPost
OR type_s:Community

Alex

On Fri, Nov 6, 2009 at 1:44 AM, Cody Caughlan  wrote:
> I am trying to use MoreLikeThis (both the component and handler,
> trying combinations) and I would like to give it an input document
> reference which has a "source" field to analyze and then get back
> other documents which have a given field that is used by MLT.
>
> My dataset is composed of documents like:
>
> # Doc 1
> id:Article:99
> type_s:Article
> body_t: the body of the article...
>
> # Doc 2
> id:Article:646
> types_s:Article
> body_t: another article...
>
> # Doc 3
> id:Community:44
> type_s:Community
> description_t: description of this community...
>
> # Doc 4
> id:Community:34874
> type_s:Community
> description_t: another description
>
> # Doc 5
> id:BlogPost:2384
> type_s:BlogPost
> body_t: contents of some blog post
>
> So I would like to say, "given an article (e.g. id:"Article:99" which
> has a field "body_t" that should be analyze), give more related
> Communities, and you will want to search on "description_t" for your
> analysis".'
>
> When I run a basic query like:
>
> (using raw URL values for clarity, but they are encoded in reality)
>
> http://localhost:9007/solr/mlt?q=id:WikiArticle:948&mlt.fl=body_t
>
> then I get back a ton of other articles. Which is fine if my target
> type was Article.
>
> So how I can I say "search on field A for your analysis of the input
> document, but for related terms use field B, filtered by type_s"
>
> It seems that I can really only specify one field via mlt.fl
>
> I have tried using MLT as a search component so that it has access to
> filter queries (via fq) but I cannot seem to get it to give me any
> data other than more of the same, that is, I can get a ton of Articles
> back but not other "content types".
>
> Am I just trying to do too much?
>
> Thanks
> /Cody
>

Re: Dismax and Standard Queries together

2009-11-03 Thread Alexey Serba

Hi Ram,

You can add another field total ( catchall field ) and copy all other
fields into this field ( using copyField directive )
http://wiki.apache.org/solr/SchemaXml#Copy_Fields

and use this field in DisMax qf parameter, for example
qf=business_name^2.0 category_name^1.0 sub_category_name^1.0 total^0.0
and
mm=100%

Thus, it requires occurrence of all search keywords in any field of
your document, but you can control relevance of returned results via
boosting in qf parameter.

HIH,
Alex

On Tue, Nov 3, 2009 at 12:02 AM, ram_sj  wrote:
>
> Hi,
>
> I have three fields, business_name, category_name, sub_category_name in my
> solrconfig file.
>
> my query = "pet clinic"
>
> example sub_category_names: Veterinarians, Kennels, Veterinary Clinics
> Hospitals, Pet Grooming, Pet Stores, Clinics
>
> my ideal requirement is dismax searching on
>
> a. dismax over three or two fields
> b. followed by a Boolean match over any one of the field is acceptable.
>
> I played around with minimum match attributes, but doesn't seems to be
> helpful, I guess the dismax requires at-least two fields.
>
> The nest queries takes only one qf filed, so it doesn't help much either.
>
> Any suggestions will be helpful.
>
> Thanks
> Ram
> --
> View this message in context: 
> http://old.nabble.com/Dismax-and-Standard-Queries-together-tp26157830p26157830.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Solr Cell on web-based files?

2009-11-02 Thread Alexey Serba

> e.g (doesn't work)
> curl http://localhost:8983/solr/update/extract?extractOnly=true
> --data-binary @http://myweb.com/mylocalfile.htm -H "Content-type:text/html"

> You might try remote streaming with Solr (see
> http://wiki.apache.org/solr/SolrConfigXml).

Yes, curl example

curl 
'http://localhost:8080/solr/main_index/extract/?extractOnly=true&indent=on&resource.name=lecture12&stream.url=http%3A//myweb.com/lecture12.ppt'

It works great for me.

Alex

Re: adding and updating a lot of document to Solr, metadata extraction etc

2009-11-02 Thread Alexey Serba

Hi Eugene,

> - ability to iterate over all documents, returned in search, as Lucene does
>  provide within a HitCollector instance. We would need to extract and
>  aggregate various fields, stored in index, to group results and aggregate 
> them
>  in some way.
> 
> Also I did not find any way in the tutorial to access the search results with
> all fields to be processed by our application.
>
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr
Check out Faceted Search, probably you can achieve your goal by using
Facet Component

There's also Field Collapsing patch
http://wiki.apache.org/solr/FieldCollapsing


Alex

Re: Keepwords Schema

2009-10-05 Thread Alexey Serba

Probably you want to use
- multivalued field 'authors'

  login.php
alex
brian
...
  

- return facets for this field
- you can filter unwanted authors whether during indexing process or post
process returned search results

On Fri, Oct 2, 2009 at 4:35 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> On Thu, Oct 1, 2009 at 7:37 PM, matrix_psj  wrote:
>
> >
> >
> > An example:
> > My schema is about web files. Part of the syntax is a text field of
> authors
> > that have worked on each file, e.g.
> > 
> >login.php
> >   2009-01-01
> >   alex, brian, carl carlington, dave alpha, eddie, dave
> > beta
> > 
> >
> > When I perform a search and get 20 web files back, I would like a facet
> of
> > the individual authors, but only if there name appears in a
> > public_authors.txt file.
> >
> > So if the public_authors.txt file contained:
> > Anna,
> > Bob,
> > Carl Carlington,
> > Dave Alpha,
> > Elvis,
> > Eddie,
> >
> > The facet returned would be:
> > Carl Carlington
> > Dave Alpha
> > Eddie
> >
> >
> >
> > Not sure if that makes sense? If it does, could someone explain to me the
> > schema fieldtype declarations that would bring back this sort of results.
> >
> >
> If I'm understanding you correctly - You want to facet on a field (with
> facet=true&facet.field=authors) but you want to show only certain
> whitelisted facet values in the response.
>
> If that is correct then, you can remove the authors which are not in the
> whitelist during indexing time. You can do this by adding
> KeepWordFilterFactory to your field type:
>
>  ignoreCase="true" />
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: yellow pages navigation kind menu. howto take every 100th row from resultset

2009-10-05 Thread Alexey Serba

It seems that you need Faceted
Search

On Fri, Oct 2, 2009 at 3:35 PM, Julian Davchev  wrote:
> Hi,
>
> Long story short:   how can I take every 100th row from solr resultset.
> What would syntax for this be.
>
> Long story:
>
> Currently I have lots of say documents(articles) indexed. They all have
> field title with corresponding value.
>
> atitle
> btitle
> .
> *title
>
> How do I build menu   so I can search of those?
> I cannot just hardcode  ABC  Dmeaning all starting
> with A all starting with B etc...cause there are unicode characters
> and english alphabet will just not cut it...
>
> So my idea is to make ranges like
>
> [atitle - mtitle][mtitle - ltitle] ...etc etc   (based on
> actual title names I got)
>
>
> Questions is how do I figure out what those  atitle-mtitle is (like get
> from solr query every 100th record)
> Two solutions I found:
> 1. get all stuff and do it server side (huge load as it's thousands
> record we talk about)
> 2. use solr sort and &start and make N calls until   resulted rows <
> 100.But this will mean quite a load as well as there lots of records.
>
> Any pointers?
> Thanks
>
>
>

Re: do NOT want to stem plurals for a particular field, or words

2009-09-16 Thread Alexey Serba

>  You can enable/disable stemming per field type in the schema.xml, by
> removing the stemming filters from the type definition.
>
> Basically, copy your prefered type, rename it to something like
> 'text_nostem', remove the stemming filter from the type and use your
> 'text_nostem' type for your field 'type' .
+ you can search in both fields text_stemmed and text_exact using
DisMax handler and boost text_exact matching. Thus if you search for
'articles' you'll get all results with 'articles' and 'article', but
exact match will be on top.

Re: Disabling tf (term frequency) during indexing and/or scoring

2009-09-16 Thread Alexey Serba

Hi Aaron,

You can overwrite default Lucene Similarity and disable tf and
lengthNorm factors in scoring formula ( see
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Similarity.html
and http://lucene.apache.org/java/2_4_1/api/index.html )

You need to

1) compile the following class and put it into Solr WEB-INF/classes
---
package my.package;

import org.apache.lucene.search.DefaultSimilarity;

public class NoLengthNormAndTfSimilarity extends DefaultSimilarity {

public float lengthNorm(String fieldName, int numTerms) {
return numTerms > 0 ? 1.0f : 0.0f;
}

public float tf(float freq) {
return freq > 0 ? 1.0f : 0.0f;
}
}
---

2. Add ""
into your schema.xml
http://wiki.apache.org/solr/SchemaXml#head-e343cad75d2caa52ac6ec53d4cee8296946d70ca

HIH,
Alex

On Mon, Sep 14, 2009 at 9:50 PM, Aaron McKee  wrote:
> Hello,
>
> Let me preface this by admitting that I'm still fairly new to Lucene and
> Solr, so I apologize if any of this sounds naive and I'm open to thinking
> about my problem differently.
>
> I'm currently responsible for a rather large dataset of business records
> that I'm trying to build a Lucene/Solr infrastructure around, to replace an
> in-house solution that we've been using for a few years. These records are
> sourced from multiple providers and there's often a fair bit of overlap in
> the business coverage. I have a set of fuzzy correlation libraries that I
> use to identify these documents and I ultimately create a super-record that
> includes metadata from each of the providers. Given the nature of things,
> these providers often have slight variations in wording or spelling in the
> overlapping fields (it's amazing how many ways people find to refer to the
> same business or address). I'd like to capture these variations, as they
> facilitate searching, but TF considerations are currently borking field
> scoring here.
>
> For example, taking business names into consideration, I have a Solr schema
> similar to:
>
>  multiValued="true">
> ...
>  multiValued="true">
>  multiValued="true" omitNorms="true">
>
> 
> ...
> 
>
> For any given business record, there may be 1..N business names present in
> the nameNorm field (some with naming variations, some identical). With TF
> enabled, however, I'm getting different match scores on this field simply
> based on how many providers contributed to the record, which is not
> meaningful to me. For example, a record containing foo
> barfoo bar is necessarily scoring higher
> than a record just containing foo bar.  Although I
> wouldn't mind TF data being considered within each discrete field value, I
> need to find a way to prevent score inflation based simply on the number of
> contributing providers.
>
> Looking at the mailing list archive and searching around, it sounds like the
> omitTf boolean in Lucene used to function somewhat in this manner, but has
> since taken on a broader interpretation (and name) that now also disables
> positional and payload data. Unfortunately, phrase support for fields like
> this is absolutely essential. So what's the best way to address a need like
> this? I guess I don't mind whether this is handled at index time or search
> time, but I'm not sure what I may need to override or if there's some
> existing provision I should take advantage of.
>
> Thank you for any help you may have.
>
> Best regards,
> Aaron
>

Re: query too long / has-many relation

2009-09-09 Thread Alexey Serba

> But apart from that everything works fine now (10,000 OR clauses takes 10
> seconds).
Not fast.
I would recommend to denormalize your data, put everything into Solr
index and use Solr faceting
http://wiki.apache.org/solr/SolrFacetingOverview to get relevant
persons ( see my previous message )

Re: query too long / has-many relation

2009-09-09 Thread Alexey Serba

>> Is there a way to configure Solr to accept POST queries (instead of GET
>> only?).
>> Or: is there some other way to make Solr accept queries longer than 2,000
>> characters? (Up to 10,000 would be nice)
> Solr accepts POST queries by default. I switched to POST for exactly
> the same reason. I use Solr 1.4 ( trunk version ) though.
Don't forget to increase maxBooleanClauses in solrconfig.xml
http://wiki.apache.org/solr/SolrConfigXml#head-69ecb985108d73a2f659f2387d916064a2cf63d1

Re: query too long / has-many relation

2009-09-09 Thread Alexey Serba

> Is there a way to configure Solr to accept POST queries (instead of GET
> only?).
> Or: is there some other way to make Solr accept queries longer than 2,000
> characters? (Up to 10,000 would be nice)
Solr accepts POST queries by default. I switched to POST for exactly
the same reason. I use Solr 1.4 ( trunk version ) though.


> I have a Solr 1.3 index (served by Tomcat) of People, containing id, name,
> address, description etc. This works fine.
> Now I want to store and retrieve Events (time location, person), so each
> person has 0 or more events.
> As I understood it, there is no way to model a has-many relation in Solr (at
> least not between two structures with more than 1 properties), so I decided
> to store the Events in a separate mysql table.
> An example of a query I would like to do is: give me all people that will
> have an Event on location x coming month, that have  in their
> description.
> I do this in two steps now: first I query the mysql table, then I build a
> solr query, with a big OR of all the ids.
> The problem is that this can generate long (too long) querystrings.
Another option would be to put all your event objects (time, location,
person_id, description) into Solr index ( normalization )
Then you can generate Solr query "give me all events on location x
coming month that have smth in their description" and asks Solr to
return facets values for field person_id. Solr will return all
distinct values of field "person_id" that matches the query with count
values. Then you can take list of related person_ids and load all
persons from MySQL database using SQL "in IN ()" clause.

1 2 >

1 - 100 of 103 matches

Mail list logo