Re: appear garbled when I use DIH from oracle database

2012-06-12 Thread Lance Norskog
You need to unpack the GBK encoding into Unicode strings. This might
be an Oracle function in the SQL query.

On Fri, Jun 8, 2012 at 3:16 AM, 涂小刚  wrote:
> Hello:
>  when I use DIH from oracle database,it appears garbled,why?     ps:my
> oracle database is  GBK encoding with chinese.
> how can I solve the problem?
> thanks!



-- 
Lance Norskog
goks...@gmail.com


Re: Indexing Multiple Datasources

2012-06-12 Thread Lance Norskog
Are you trying to do a JOIN on these two tables inside the DIH?

On Tue, Jun 12, 2012 at 8:35 PM, Gora Mohanty  wrote:
> On 11 June 2012 21:29, Kay  wrote:
>> Hello,
>>
>> We have 2 MS SQL Server Databases which we wanted to index .But most of the
>> columns in the Databases have the same names. For e.g. Both the DB’s have
>> the columns First name ,Last name ,etc.
>
> It is not clear how you want to handle this: Should the records
> from both databases be indexed into the same fields, e.g.,
> FirstName is always mapped to the Solr field FirstName for
> both databases, or would you want that FirstName be mapped
> to, say FirstName1, for the second database.
>
>> How can you index multiple Databases using single db-data-config file and
>> one schema?
>>
>> Here is my data-config file
>> 
>>
>> > driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
>> url="jdbc:sqlserver://localhost;DatabaseName= " user="" password="" />
>>    
>>                
>>                        > name="BusinessEntityID" />
>>                        
>>                        
>>                        
>>                        
>>
>>                
>>                
>>
>>    
>>                
>>                        > name="BusinessEntityID" />
>>                        
>>                        
>>                        
>>                        
>>                        
>>                        
>>                
>>    
>
> Never had occasion to try this, but there seems to be a problem
> in the configuration here:
> * You have defined datasource "ds-2" with
>     
>   but there is no similar definition for "ds-1". You need to add
>   another entry at the top:
>     
> * I believe things should work after that, but if not you could try
>  remapping the field names in the second SELECT, and
>  correspondingly changing the field attributes for the second
>  entity, e.g.,
>     
>     ...
>
> Regards,
> Gora



-- 
Lance Norskog
goks...@gmail.com


Re: edismax and untokenized field

2012-06-12 Thread Afroz Ahmad
In the example above your schema is applying the tokenizers and filter only
during index time. For your query terms to also pass through the same
pipeline you need to modify the field type and add a   section. I believe this should fix your problem.
Thanks
Afroz
:

   




  
   




  


On Mon, Jun 11, 2012 at 10:25 AM, Vijay Ramachandran wrote:

> Thank you for your reply. Sending this as a phrase query does change the
> results as expected.
>
> On Mon, Jun 11, 2012 at 4:39 PM, Tanguy Moal 
> wrote:
>
> > I think you have to issue a phrase query in such a case because otherwise
> > each "token" is searched independently in the merchant field : the query
> > parser splits the query on spaces!
> >
> >
> So parsing of query is dependent in part on the query handling itself,
> independent of the field definition?
>
>
> > Check the difference between debug outputs when you search for "Jones New
> > York", you'd get what you expected.
> >
>
> Yes, that gives the expected result. So, I should make a separate query to
> the merchant field as a phrase?
>
> thanks!
> Vijay
>


Re: Unexpected DIH behavior for onError attribute

2012-06-12 Thread Gora Mohanty
On 13 June 2012 01:17, Pranav Prakash  wrote:
> It seems that upon setting onError=skip, the DIH does not proceed to next
> records in the db, and only unto those entries which were prior to an
> error-causing record are being updated/added.
[...]

Please show us your DIH configuration file,
remembering to sanitise usernames/passwords
used for database access.

Also, you might want to look into the Solr
log files to see if there are any errors
reported there.

Regards,
Gora


Re: Indexing Multiple Datasources

2012-06-12 Thread Gora Mohanty
On 11 June 2012 21:29, Kay  wrote:
> Hello,
>
> We have 2 MS SQL Server Databases which we wanted to index .But most of the
> columns in the Databases have the same names. For e.g. Both the DB’s have
> the columns First name ,Last name ,etc.

It is not clear how you want to handle this: Should the records
from both databases be indexed into the same fields, e.g.,
FirstName is always mapped to the Solr field FirstName for
both databases, or would you want that FirstName be mapped
to, say FirstName1, for the second database.

> How can you index multiple Databases using single db-data-config file and
> one schema?
>
> Here is my data-config file
> 
>
>  driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> url="jdbc:sqlserver://localhost;DatabaseName= " user="" password="" />
>    
>                
>                         name="BusinessEntityID" />
>                        
>                        
>                        
>                        
>
>                
>                
>
>    
>                
>                         name="BusinessEntityID" />
>                        
>                        
>                        
>                        
>                        
>                        
>                
>    

Never had occasion to try this, but there seems to be a problem
in the configuration here:
* You have defined datasource "ds-2" with
 
   but there is no similar definition for "ds-1". You need to add
   another entry at the top:
 
* I believe things should work after that, but if not you could try
  remapping the field names in the second SELECT, and
  correspondingly changing the field attributes for the second
  entity, e.g.,
 
 ...

Regards,
Gora


Re: Different sort for each facet

2012-06-12 Thread Jack Krupansky

f.people.facet.sort=count should work.

Make sure you don't have a conflicting setting for that same field and 
attribute.


Does the "people" facet sort by count correctly with f.sort=index?

What are the attributes and field type for the "people" field?

-- Jack Krupansky

-Original Message- 
From: Christopher Gross

Sent: Tuesday, June 12, 2012 11:05 AM
To: solr-user
Subject: Different sort for each facet

In Solr 3.4, is there a way I can sort two facets differently in the same 
query?


If I have:

http://mysolrsrvr/solr/select?q=*:*&facet=true&facet.field=people&facet.field=category

is there a way that I can sort people by the count and category by the
name all in one query?  Or do I need to do that in separate queries?
I tried using "f.people.facet.sort=count" while also having
"facet.sort=index" but both came back in alphabetical order.

Doing more queries is OK, I'm just trying to avoid having to do too many.

-- Chris 



Interleaving Results from Sub-Queries

2012-06-12 Thread Andrew Morrison
I'm working on creating a Query that interleaves the results of a set of
sub-queries and was hoping I could get some input on the design.

The general idea is that if given Query q1 and Query q2  I'd like to add
them to a parent Query q0 so that the when q0 is scored, the order of
results in score descending order is

   1. 1st doc from q1
   2. 1st doc from q2
   3. 2nd doc from q1
   4. 2nd doc from q2
   5. 3rd doc from q1
   6. ...

In the attached code, we add all sub-query Scorers to an InterleaveScorer
that

   - scores all docs from each sub-scorer,
   - puts those {doc, score} pairs in a heap
   - iteratively pops from each heap until each is out of documents

Can anyone think of a cleaner way of doing this?

Below is a test that shows a general outline of its functionality.

  @Test
  public void testSimple() {
assertU(adoc("id", "0", "name", "alpha"));
assertU(adoc("id", "1", "name", "alpha"));
assertU(adoc("id", "2", "name", "alpha"));
assertU(adoc("id", "3", "name", "beta"));
assertU(adoc("id", "4", "name", "beta"));
assertU(commit());

SolrQueryRequest req =
  req("q", "{!interleave q0=name:alpha q1=name:beta}",
  "sort", "score desc",
  "fl", "id,score");

assertQ("id 0 should be in position 1", req,
"/response/result/doc[position()=1]/int[text()=0]");
assertQ("position 1 should have score 100.0", req,
"/response/result/doc[position()=1]/float[text()=100.0]");

assertQ("id 3 should be in position 2", req,
"/response/result/doc[position()=2]/int[text()=3]");
assertQ("position 2 should have score 99.9", req,
"/response/result/doc[position()=2]/float[text()=99.9]");

assertQ("id 2 should be in position 3", req,
"/response/result/doc[position()=3]/int[text()=2]");
assertQ("position 3 should have score 99.8", req,
"/response/result/doc[position()=3]/float[text()=99.8]");

assertQ("id 4 should be in position 4", req,
"/response/result/doc[position()=4]/int[text()=4]");
assertQ("position 4 should have score 99.7", req,
"/response/result/doc[position()=4]/float[text()=99.7]");

assertQ("id 1 should be in position 5", req,
"/response/result/doc[position()=5]/int[text()=1]");
assertQ("position 5 should have score 99.6", req,
"/response/result/doc[position()=5]/float[text()=99.6]");
  }


Andrew Morrison | Software Engineer | Etsy


Re: Indexing Multiple Datasources

2012-06-12 Thread Jack Krupansky
I believe that it will run them sequentially. The second would start only 
after the first finishes


Did you give both "entity" names in your Solr request (two "entity" options 
with the two top-level entity names)?


Although, if you specified no "entity" names  names in the request DIH 
should run them all (sequentially).


-- Jack Krupansky

-Original Message- 
From: Kay

Sent: Tuesday, June 12, 2012 2:24 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing Multiple Datasources

Thanks for the reply jack! We tried giving each data source a name (For e.g.
ds-1,ds-2 etc) but what it does is when we checked the log it establishes
connection with the first data source and indexed while the second DB is
getting ignored.

Yes! what we wanted to try is in our system we have many databases that use
look up tables.Is solr efficient to query the databases even though our
system is using lookup tables?

I would appreciate your response,

Thanks!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Multiple-Datasources-tp3988957p3989255.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Promote Ranking based on Usage

2012-06-12 Thread Jack Krupansky
Just for reference you should start by reviewing Lucid's "click scoring 
framework":

http://lucidworks.lucidimagination.com/display/lweug/Click+Scoring+Relevance+Framework

To do all of that yourself is a major undertaking, but maybe you could 
identify a simpler subset that does just enough to satisfy your needs.


You could have a custom request handler (or even a separate application) 
that stores your click boost in an external file and then have a boost 
function query that references an ExternalFieldField:

http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html

-- Jack Krupansky

-Original Message- 
From: jliz2803

Sent: Tuesday, June 12, 2012 2:41 PM
To: solr-user@lucene.apache.org
Subject: Promote Ranking based on Usage

Hi we have just started using Solr at our company.  We have Solr setup and
are using C# to make communicate with it.  The user will perform a search
then make a selection from the search results.  We want to promote documents
based on how often the user selects them.  I was wondering if someone could
point me in the right direction of how to properly configure Solr to handle
this, and what calls I need to send to Solr when a user selects an item so
its ranking gets promoted.  Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Promote-Ranking-based-on-Usage-tp3989258.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Unexpected DIH behavior for onError attribute

2012-06-12 Thread Jack Krupansky

Make sure you have the onError=skip on the proper entity.

-- Jack Krupansky

-Original Message- 
From: Pranav Prakash 
Sent: Tuesday, June 12, 2012 3:47 PM 
To: solr-user@lucene.apache.org 
Subject: Unexpected DIH behavior for onError attribute 


It seems that upon setting onError=skip, the DIH does not proceed to next
records in the db, and only unto those entries which were prior to an
error-causing record are being updated/added.

My db has 70K records. Of which record #17188 is illegal. When I had set
onError=abort, entire DIH operation was rolled back and nothing gets
added/updated. Upon setting onError=skip, only 17187 records were
added/updated. Upon setting onError=contine, only 17187 records were
added/updated.

Am I missing something or this is expected behavior?

*Pranav Prakash*

"temet nosce"


Re: Indexing Data option for subdirectories?

2012-06-12 Thread Erik Hatcher
If they aren't Solr XML format, but you can write an XSLT to transform it to 
Solr XML, you can use this: 


Erik


On Jun 12, 2012, at 15:20 , Jack Krupansky wrote:

> There isn't a recursion option for post.jar (I did check.)
> 
> Maybe your best bet is the "find" shell command. This may not be 100% 
> correct, but something like:
> 
>   find /data -name '*.xml' -exec java -jar post.jar {}
> 
> This is assuming that these are pre-formatted Solr XML update files with 
> "" and "".
> 
> If they are not in solr xml format and require translation, DIH with 
> FileDataSource and FileListEntityProcessor ihc supports recursion hwmay be 
> the way to go:
> http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor
> 
> -- Jack Krupansky
> 
> -Original Message- From: Bruno Mannina
> Sent: Tuesday, June 12, 2012 3:06 AM
> To: solr-user@lucene.apache.org
> Subject: Indexing Data option for subdirectories?
> 
> Dear All,
> 
> Is exist a way to index data under sub-directories directly?
> 
> I have several files under sub-directories like:
> /data/A/001/*.xml
> /data/A/002/*.xml
> /data/A/003/*.xml
> /data/A/004/*.xml
> ...
> /data/B/001/*.xml
> ...
> 
> /data/Z/999/*.xml
> 
> I would like to index directly with
> 
> *i.e. java -jar post.jar -R /data*
> 
> Is it possible?
> 
> thanks a lot,
> Bruno 



Re: Indexing Data option for subdirectories?

2012-06-12 Thread Gora Mohanty
On 13 June 2012 00:50, Jack Krupansky  wrote:
> There isn't a recursion option for post.jar (I did check.)
>
> Maybe your best bet is the "find" shell command. This may not be 100%
> correct, but something like:
>
>   find /data -name '*.xml' -exec java -jar post.jar {}
[...]

The above should end with a "\;", i.e.,
  find /data -name '*.xml' -exec java -jar post.jar {} \;
You can handle multiple posts to Solr in parallel if
you couple this with xargs.

This also assumes a UNIX system, but on most other
systems you could write a script that handles the
recursion into sub-directories, and posts each file.

Regards,
Gora


Deduplication in MLT

2012-06-12 Thread Pranav Prakash
I have an implementation of Deduplication as mentioned at
http://wiki.apache.org/solr/Deduplication. It is helpful in grouping search
results. I would like to achieve the same functionality in my MLT queries,
where the result set should include grouped documents. What is a good way
to do the same?


*Pranav Prakash*

"temet nosce"


Promote Ranking based on Usage

2012-06-12 Thread jliz2803
Hi we have just started using Solr at our company.  We have Solr setup and
are using C# to make communicate with it.  The user will perform a search
then make a selection from the search results.  We want to promote documents
based on how often the user selects them.  I was wondering if someone could
point me in the right direction of how to properly configure Solr to handle
this, and what calls I need to send to Solr when a user selects an item so
its ranking gets promoted.  Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Promote-Ranking-based-on-Usage-tp3989258.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing Multiple Datasources

2012-06-12 Thread Kay
Thanks for the reply jack! We tried giving each data source a name (For e.g.
ds-1,ds-2 etc) but what it does is when we checked the log it establishes
connection with the first data source and indexed while the second DB is
getting ignored.

Yes! what we wanted to try is in our system we have many databases that use
look up tables.Is solr efficient to query the databases even though our
system is using lookup tables?

I would appreciate your response,

Thanks!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Multiple-Datasources-tp3988957p3989255.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing Data option for subdirectories?

2012-06-12 Thread Jack Krupansky

There isn't a recursion option for post.jar (I did check.)

Maybe your best bet is the "find" shell command. This may not be 100% 
correct, but something like:


   find /data -name '*.xml' -exec java -jar post.jar {}

This is assuming that these are pre-formatted Solr XML update files with 
"" and "".


If they are not in solr xml format and require translation, DIH with 
FileDataSource and FileListEntityProcessor ihc supports recursion hwmay be 
the way to go:

http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor

-- Jack Krupansky

-Original Message- 
From: Bruno Mannina

Sent: Tuesday, June 12, 2012 3:06 AM
To: solr-user@lucene.apache.org
Subject: Indexing Data option for subdirectories?

Dear All,

Is exist a way to index data under sub-directories directly?

I have several files under sub-directories like:
/data/A/001/*.xml
/data/A/002/*.xml
/data/A/003/*.xml
/data/A/004/*.xml
...
/data/B/001/*.xml
...

/data/Z/999/*.xml

I would like to index directly with

*i.e. java -jar post.jar -R /data*

Is it possible?

thanks a lot,
Bruno 



Re: solr nested multivalued fields

2012-06-12 Thread jerome
Thanks, From all the material i have looked at and searched I am inclined to
believe that those are indeed my options, any others are still welcome...

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-nested-multivalued-fields-tp3989114p3989260.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: I need help on escaping the special char.

2012-06-12 Thread Jack Krupansky
See: 
http://lucene.472066.n3.nabble.com/index-special-characters-solr-td3987157.html


Basically, list the special characters in a text file with the "types" 
attribute and map them to type "ALPHA".


-- Jack Krupansky

-Original Message- 
From: Prachi Phatak

Sent: Tuesday, June 12, 2012 2:16 PM
To: solr-user@lucene.apache.org
Subject: I need help on escaping the special char.

I tried WordDelimiterFactory with types option. It doesn't seem working.

How can I escape i.e.+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ in configuration.

Prachi. 



I need help on escaping the special char.

2012-06-12 Thread Prachi Phatak
I tried WordDelimiterFactory with types option. It doesn't seem working.

How can I escape i.e.+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ in configuration.

Prachi.


Re: Sharding in SolrCloud

2012-06-12 Thread Mark Miller

On Jun 12, 2012, at 3:39 AM, lenz...@gfi.ihk.de wrote:

> Hello,
> 
> we tested SolrCloud in a setup with one collection, two shards and one 
> replica per shard and it works quite fine with some example data. 
> Now, we plan to set up our own collection and determine in how many shards 
> we should devide it. 
> We can estimate quite exactly the size of the collection, but we don't 
> know, what the best approach for sharding is, 
> even if we know the size and the amount of queries and updates.
> Is there any documentation or a kind of design guidelines for sharding a 
> collection in SolrCloud?
> 
> 
> Thanks & regards,
> Norman Lenzner


It's hard to tell - I think you want to start with an idea of how many docs you 
can fit on a single node. This can vary wildly depending on many factors. 
Generally you have to do some testing with your particular config and data. You 
can search the mailing lists and perhaps dig up a little info, but there is 
really no replacement for running some tests with real data.

Then you have to plan in your growth rate - resharding is naturally a 
relatively expensive operation. Once you have an idea of how many docs per 
machine you think seems comfortable, figure out how machines you need given 
your estimated doc growth rate and perhaps some padding. You might not get it 
right, but if you expect the possibility of a lot of growth, erring on the more 
shards side is obviously better.

- Mark Miller
lucidimagination.com













Re: deploy a brand new index in solrcloud

2012-06-12 Thread Mark Miller

On Jun 10, 2012, at 2:56 AM, Anatoli Matuskova wrote:

> I've thought in setting replication in solrCloud:
> http://www.searchworkings.org/forum/-/message_boards/view_message/339527#_19_message_339527
> What I don't know is if while replication is being handled, the replica
> slaves (that are not the master in replication) can keep handling puts via
> transaction log
> 

Yes.

We have two types of replication, peer-sync, which is simply an exchange of 
updates, and full index replication.

If we do full index replication we basically do this:

1. start buffering updates with the transaction log
2. send a commit to the master
3. replicate
4. apply buffered updates


- Mark Miller
lucidimagination.com













Re: Solr PHP highload search

2012-06-12 Thread Jack Krupansky
Add "&debugQuery=true" to your query and look at the "timing" section that 
comes back with the response to see q breakdown of Qtime. It should offer 
some insight into which search component(s) are taking the most time. That 
might point you in the right direction for improvements.


Also, see how much JVM memory is available when you are running queries. 
Maybe memory is low and garbage collections are occurring too frequently.


-- Jack Krupansky

-Original Message- 
From: Alexandr Bocharov

Sent: Tuesday, June 12, 2012 3:40 AM
To: solr-user@lucene.apache.org
Subject: Solr PHP highload search

Hi, all.

I need advice for configuring Solr search to use at highload production.

I've wrote user's search engine (PHP class), that uses over 70 parameters
for searching users.
User's database is over 30 millions records.
Index total size is 6.4G when I use 1 node and 3.2G when 2 nodes.
Previous search engine can handle 700,000 queries per day for searching
users - it is ~8 queries/sec (4 mysql servers with manual sharding via
Gearman)

Example of queries are:

[responseHeader] => SolrObject Object
   (
   [status] => 0
   [QTime] => 517
   [params] => SolrObject Object
   (
   [bq] => Array
   (
   [0] => bool_field1:1^30
   [1] => str_field1:str_value1^15
   [2] => tint_field1:tint_field1^5
   [3] => bool_field2:1^6
   [4] => date_field1:[NOW-14DAYS TO NOW]^20
   [5] => date_field2:[NOW-14DAYS TO NOW]^5
   )

   [indent] => on
   [start] => 0
   [q.alt] => *:*
   [wt] => xml
   [fq] => Array
   (
   [0] => tint_field2:[tint_value2 TO tint_value22]
   [1] => str_field1:str_value1
   [2] => str_field2:str_value2
   [3] => tint_field3:(tint_value3 OR tint_value32
OR tint_value33 OR tint_value34 OR tint_value5)
   [4] => tint_field4:tint_value4
   [5] => -bool_field1:[* TO *]
   )

   [version] => 2.2
   [defType] => dismax
   [rows] => 10
   )

   )


I test my PHP search API and found that concurrent random queries, for
example 10 queries at one time increases QTime from avg 500 ms to 3000 ms
at 2 nodes.

1. How can I tweak my queries or parameters or Solr's config to decrease
QTime?
2. What if I put my index data to emulated RAM directory, can it increase
greatly performance?
3. Sorting by boost queries has a great influence on QTime, how can I
optimize boost queries?
4. If I split my 2 nodes on 2 machines into 6 nodes on 2 machines, 3 nodes
per machine, will it increase performance?
5. What is "multi-core query", how can I configure it, and will it increase
performance?

Thank you! 



Re: SolrJ dependencies

2012-06-12 Thread Sami Siren
On Tue, Jun 12, 2012 at 4:22 PM, Thijs  wrote:
> Hi
> I just checked out and build solr&lucene from branches/lucene_4x
>
> I wanted to upgrade my custom client to this new version (using solrj).
> So I copied lucene/solr/dist/apache-solr-solrj-4.0-SNAPSHOT.jar &
>  lucene/solr/dist/apache-solr-core-4.0-SNAPSHOT.jar to my project and I
> updated the other libs from the libs in /solr/dist/solrj-lib
>
> However, when I wanted to run my client I got exceptions indicating that I
> was missing the HTTPClient jars. (httpclient, htpcore,httpmime)
> Shouldn't those go into lucene/solr/dist/solrj-lib as wel?

Yes they should.

> Do I need to create a ticket for this?

Please do so.

--
 Sami Siren


Re: Issues with whitespace tokenization in QueryParser

2012-06-12 Thread John Berryman
Robert Muir told me that there is somewhat of a workaround for this. For
defType=lucene. Just escape every whitespace with a slash. So instead of *new
dress shoes* search for *new\ dress\ shoes*. Of course you lose the ability
to use normal lucene syntax.

I was hoping that this workaround would also work for defType=dismax, but
with or without the escaped whitespace, queries get interpreted the same,
incorrect way. For instance, assume I have the following line in my
synonyms.txt: *dress shoes => dress_shoes*. Further assume that I have a
field *experiment* that gets analysed with synonyms. A search for *new
dress shoes*(with or without escaped spaces) will be interpreted as

*+((experiment:new)~0.01 (experiment:dress)~0.01 (experiment:shoes)~0.01)
(experiment:"new dress_shoes"~3)~0.01*

The first clause is manditory and contains independently analysed tokens,
so this will only match documents that contain "dress", "new", or "shoes",
but never "dress shoes" because analysis takes place as expected at index
time.


Re: SolrJ dependencies

2012-06-12 Thread Jack Krupansky
Maybe the migration from the "EOL" Commons HTTP Client to HTTP Components 
has something to do with this. The wiki probably needs Solr release-specific 
instructions. And maybe the lib folder is not quite right.


You can read about the migration here:
https://issues.apache.org/jira/browse/SOLR-2020

So, maybe you have some migration to do as well, or maybe something isn't 
quite right.


-- Jack Krupansky

-Original Message- 
From: Thijs

Sent: Tuesday, June 12, 2012 9:22 AM
To: solr-user@lucene.apache.org
Subject: SolrJ dependencies

Hi
I just checked out and build solr&lucene from branches/lucene_4x

I wanted to upgrade my custom client to this new version (using solrj).
So I copied lucene/solr/dist/apache-solr-solrj-4.0-SNAPSHOT.jar &
lucene/solr/dist/apache-solr-core-4.0-SNAPSHOT.jar to my project and I
updated the other libs from the libs in /solr/dist/solrj-lib

However, when I wanted to run my client I got exceptions indicating that
I was missing the HTTPClient jars. (httpclient, htpcore,httpmime)
Shouldn't those go into lucene/solr/dist/solrj-lib as wel?
Do I need to create a ticket for this?

Thijs




Different sort for each facet

2012-06-12 Thread Christopher Gross
In Solr 3.4, is there a way I can sort two facets differently in the same query?

If I have:

http://mysolrsrvr/solr/select?q=*:*&facet=true&facet.field=people&facet.field=category

is there a way that I can sort people by the count and category by the
name all in one query?  Or do I need to do that in separate queries?
I tried using "f.people.facet.sort=count" while also having
"facet.sort=index" but both came back in alphabetical order.

Doing more queries is OK, I'm just trying to avoid having to do too many.

-- Chris


Re: solr nested multivalued fields

2012-06-12 Thread Jack Krupansky

Maybe "Result Grouping/Field Collapsing" might work for you:
http://wiki.apache.org/solr/FieldCollapsing

Otherwise, multivalued string fields, with first and last name combined into 
one string might be the best you can do.


-- Jack Krupansky

-Original Message- 
From: jerome

Sent: Tuesday, June 12, 2012 4:10 AM
To: solr-user@lucene.apache.org
Subject: solr nested multivalued fields



I would like to produce the following result in a Solr search result but not
sure it is possible to do? (Using Solr 3.6)


  
 
John
Darby
 
 
Sue
Berger
 
  


However, i cant seem to manage getting this Tree like structre in my
results. At best I can get something to look like the following which is not
even close:


  
 John
 Darby
 Sue
 Berger
  


There are two problem here. Firstly, I cannot seem to "group" these people
into a meaningful tag structure as per the top example. Second, I cant for
the life of me get the tags to display an attribute name like "lastName" or
"firstName" when inside an array?

In my project I am pulling this data using a DIH and from the example above
one can see that this is a on-to-many relationship between groups and users.

I really would appreciate it is someone has some suggestions or alternative
thoughts.

Any assistance would be greatly appreciated


--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-nested-multivalued-fields-tp3989114.html
Sent from the Solr - User mailing list archive at Nabble.com. 



RE: PageRanking with DIH

2012-06-12 Thread Dyer, James
To boost a document with DIH, see this section about "$docBoost" in the wiki 
here:  http://wiki.apache.org/solr/DataImportHandler#Special_Commands.

If you're using a RDBMS for source data, your query would have something like 
this in it: "select PAGE_RANK as '$docBoost', ... from ... etc"

If you don't want to boost entire documents but have it be very flexible at 
query time, see the page on Extended Dismax, especially the boost function 
section:  
http://wiki.apache.org/solr/ExtendedDisMax?highlight=%28edismax%29#bf_.28Boost_Function.2C_additive.29
 .  Also, the Packt Solr book (Smiley&Pugh) has a nice section about boosting 
scores based on page-rank or popularity type fields.  In the old first edition 
its chapter 5, "enhanced searching".

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Gora Mohanty [mailto:g...@mimirtech.com] 
Sent: Tuesday, June 12, 2012 4:54 AM
To: solr-user@lucene.apache.org
Subject: Re: PageRanking with DIH

On 12 June 2012 13:51, vineet yadav  wrote:
> Hi Gora,
> Thanks for reply.
> I have computed pagerank offline for document set dump.  I ideally
> want to use pagerank and solr relevency score together in formula to
> sort search solr result.  I have already looked at
> http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_increase_the_score_for_specific_documents
> and found that indextimeboost is useful. I want to know how can I use
> indextimeboost ?
[...]

That depends on how you are indexing data into Solr.
That page explains how to do index-time boosting at
record level, and field level for XML documents uploaded
to Solr with post.sh.

If you are using the Solr DataImportHandler, you can boost
records, but not individual fields, as far as I am aware. Please
take a look at this thread for an example:
http://lucene.472066.n3.nabble.com/Index-time-boosting-with-DIH-td3206271.html

It would help if you did some basic groundwork, tried out things
for yourselves, and asked more specific questions. You might
wish to read http://wiki.apache.org/solr/UsingMailingLists

Regards,
Gora


SolrJ dependencies

2012-06-12 Thread Thijs

Hi
I just checked out and build solr&lucene from branches/lucene_4x

I wanted to upgrade my custom client to this new version (using solrj).
So I copied lucene/solr/dist/apache-solr-solrj-4.0-SNAPSHOT.jar &  
lucene/solr/dist/apache-solr-core-4.0-SNAPSHOT.jar to my project and I 
updated the other libs from the libs in /solr/dist/solrj-lib


However, when I wanted to run my client I got exceptions indicating that 
I was missing the HTTPClient jars. (httpclient, htpcore,httpmime)

Shouldn't those go into lucene/solr/dist/solrj-lib as wel?
Do I need to create a ticket for this?

Thijs





solr nested multivalued fields

2012-06-12 Thread jerome


I would like to produce the following result in a Solr search result but not
sure it is possible to do? (Using Solr 3.6)


   
  
 John
 Darby
  
  
 Sue
 Berger
  
   


However, i cant seem to manage getting this Tree like structre in my
results. At best I can get something to look like the following which is not
even close:


   
  John
  Darby
  Sue
  Berger
   


There are two problem here. Firstly, I cannot seem to "group" these people
into a meaningful tag structure as per the top example. Second, I cant for
the life of me get the tags to display an attribute name like "lastName" or
"firstName" when inside an array?

In my project I am pulling this data using a DIH and from the example above
one can see that this is a on-to-many relationship between groups and users.

I really would appreciate it is someone has some suggestions or alternative
thoughts.

Any assistance would be greatly appreciated


--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-nested-multivalued-fields-tp3989114.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Changing Index directory?

2012-06-12 Thread jamel essoussi
you can use the following configuration:

${solr.data.dir:/opt/data/solr/}/core_name in
solrconfig.xml

--> you should in this case specify the followinf JVM option :
-Dsolr.data.dir=(your path here)
--> /opt/data/solr/core_name : the default value

2012/6/12 Jack Krupansky 

> It is "dataDir" in solrconfig.xml:
>
> http://wiki.apache.org/solr/**SolrConfigXml#dataDir_**parameter
>
> -- Jack Krupansky
>
> -Original Message- From: Bruno Mannina
> Sent: Tuesday, June 12, 2012 2:54 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Changing Index directory?
>
>
> Le 12/06/2012 08:49, Bruno Mannina a écrit :
>
>> Dear All,
>>
>> For tests, I would like to install Solr on standard directory
>> (/home/solr) but with the index in a External HardDisk (/media/myExthdd).
>> I suppose it will decrease performance but it's not a problem.
>>
>> Where can I find the Index Directory Path variable?
>>
>> Thanks a lot,
>> Bruno
>>
>>
>>  sorry Solrconfig.xml ...
>



-- 

Best Regards
-- Jamel  ESSOUSSI


Re: Changing Index directory?

2012-06-12 Thread Jack Krupansky

It is "dataDir" in solrconfig.xml:

http://wiki.apache.org/solr/SolrConfigXml#dataDir_parameter

-- Jack Krupansky

-Original Message- 
From: Bruno Mannina

Sent: Tuesday, June 12, 2012 2:54 AM
To: solr-user@lucene.apache.org
Subject: Re: Changing Index directory?

Le 12/06/2012 08:49, Bruno Mannina a écrit :

Dear All,

For tests, I would like to install Solr on standard directory (/home/solr) 
but with the index in a External HardDisk (/media/myExthdd).

I suppose it will decrease performance but it's not a problem.

Where can I find the Index Directory Path variable?

Thanks a lot,
Bruno


sorry Solrconfig.xml ... 



Re: Exception when optimizing index

2012-06-12 Thread Jack Krupansky

It's good to know that the situation is reproducible.

Maybe you could do a couple of smaller tests, such as running CheckIndex 
after loading only 10%, 25%, and 50% of the data to see if the problem 
occurs with less data or is dependent on a much higher document count.


And also check for any exceptions or even warnings in the logs before 
running CheckIndex.


What was the number of documents you believe added to the index before you 
ran CheckIndex this latest time? Can you do a query of *:* and see if its 
count agrees?


-- Jack Krupansky

-Original Message- 
From: Rok Rejc

Sent: Tuesday, June 12, 2012 1:40 AM
To: solr-user@lucene.apache.org
Subject: Re: Exception when optimizing index

Just as an addon:

I have delete whole index directory and load the data from the start. After
the data was loaded (and I commited the data) I run CheckIndex again.
Again, there was bunch of broken segments.

I will try with the latest trunk to see if the problem still exists.

Regards,
Rok


On Mon, Jun 11, 2012 at 8:32 AM, Rok Rejc  wrote:


Hi all,

I have run CheckIndex. It seems that the index is currupted. I've got
plenty of exceptions like:

  test: terms, freq, prox...ERROR: 
java.lang.ArrayIndexOutOfBoundsException

java.lang.ArrayIndexOutOfBoundsException
at
org.apache.lucene.store.ByteArrayDataInput.readBytes(ByteArrayDataInput.java:181)
at
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextLeaf(BlockTreeTermsReader.java:2414)
at
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next(BlockTreeTermsReader.java:2400)
at
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next(BlockTreeTermsReader.java:2074)
at
org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:771)
at
org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1164)
at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:602)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1748)


and

  test: terms, freq, prox...ERROR: java.lang.RuntimeException: term [6f 70
65 72 61 63 69 6a 61]: doc 105407 <= lastDoc 105407
java.lang.RuntimeException: term [6f 70 65 72 61 63 69 6a 61]: doc 105407
<= lastDoc 105407
at
org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:858)
at
org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1164)
at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:602)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1748)
test: stored fields...OK [723321 total field count; avg 3 fields
per doc]



final warning was:


WARNING: 154 broken segments (containing 48127608 documents) detected
WARNING: would write new segments file, and 48127608 documents would be
lost, if -fix were specified


As I mentiod - I have run optimization after initial import (no further
adds or deletion were made).
For import I'm creating csv files and I'm loading them through csv upload
with multiple threads.

The index is otherwise queryable.

Any ideas what should I do next? Is this a bug in lucene?

Many thanks...

Rok









On Thu, Jun 7, 2012 at 5:05 PM, Jack Krupansky 
wrote:



Is the index otherwise usable for queries? And it is only the optimize
that is failing?

I suppose it is possible that the index could be corrupted, but it is
also possible that there is a bug in Lucene.

I would suggest running Lucene "CheckIndex" next. See what it has to say.

See:
https://builds.apache.org/job/**Lucene-trunk/javadoc/core/org/**
apache/lucene/index/**CheckIndex.html#main(java.**lang.String[])


-- Jack Krupansky

-Original Message- From: Rok Rejc
Sent: Thursday, June 07, 2012 5:50 AM
To: solr-user@lucene.apache.org
Subject: Re: Exception when optimizing index


Hi Jack,

its the virtual machine running on a VMware vSphere 5 Enterprise Plus.
Machine has 30 GB vRAM, 8 core vCPU 3.0 GHz, 2 TB SATA RAID-10 over 
iSCSI.

Operation system is CentOS 6.2 64bit.

Here are java infos:


 - catalina.​base/usr/share/**tomcat6
 - catalina.​home/usr/share/**tomcat6
 - catalina.​useNamingtrue
 - common.​loader
 ${catalina.base}/lib,${**catalina.base}/lib/*.jar,${**
catalina.home}/lib,${catalina.**home}/lib/*.jar
 - file.​encodingUTF-8
 - file.​encoding.​pkgsun.io
 - file.​separator/
 - java.​awt.​graphicsenvsun.awt.**X11GraphicsEnvironment
 - java.​awt.​printerjobsun.**print.PSPrinterJob
 - java.​class.​path
 /usr/share/tomcat6/bin/**bootstrap.jar
 /usr/share/tomcat6/bin/tomcat-**juli.jar/usr/share/java/**
commons-daemon.jar
 - java.​class.​version50.0
 - java.​endorsed.​dirs
 - java.​ext.​dirs
 /usr/lib/jvm/java-1.6.0-**openjdk-1.6.0.0.x86_64/jre/**lib/ext
 /usr/java/packages/lib/ext
 - java.​home/usr/lib/jvm/java-1.**6.0-openjdk-1.6.0.0.x86_64/jre
 - java.​io.​tm

Re: PageRanking with DIH

2012-06-12 Thread Gora Mohanty
On 12 June 2012 13:51, vineet yadav  wrote:
> Hi Gora,
> Thanks for reply.
> I have computed pagerank offline for document set dump.  I ideally
> want to use pagerank and solr relevency score together in formula to
> sort search solr result.  I have already looked at
> http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_increase_the_score_for_specific_documents
> and found that indextimeboost is useful. I want to know how can I use
> indextimeboost ?
[...]

That depends on how you are indexing data into Solr.
That page explains how to do index-time boosting at
record level, and field level for XML documents uploaded
to Solr with post.sh.

If you are using the Solr DataImportHandler, you can boost
records, but not individual fields, as far as I am aware. Please
take a look at this thread for an example:
http://lucene.472066.n3.nabble.com/Index-time-boosting-with-DIH-td3206271.html

It would help if you did some basic groundwork, tried out things
for yourselves, and asked more specific questions. You might
wish to read http://wiki.apache.org/solr/UsingMailingLists

Regards,
Gora


Re: PageRanking with DIH

2012-06-12 Thread vineet yadav
Hi Gora,
Thanks for reply.
I have computed pagerank offline for document set dump.  I ideally
want to use pagerank and solr relevency score together in formula to
sort search solr result.  I have already looked at
http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_increase_the_score_for_specific_documents
and found that indextimeboost is useful. I want to know how can I use
indextimeboost ?
Thanks
Vineet Yadav

On Tue, Jun 12, 2012 at 1:32 PM, Gora Mohanty  wrote:
> On 12 June 2012 13:04, vineet yadav  wrote:
>> Hi,
>> I have indexed documents and computed pagerank for documents. I want
>> to update pagerank for indexed document and sort solr search result
>> with pagerank.
>
> Your question is not entirely clear: What is "pagerank" in this case?
> A custom score that you can compute at indexing time, and by
> which you want to order retrieved results? If so, just add a pagerank
> field to your Solr records, ignore Solr's order, and instead sort results
> by that field.
>
>>  I did some research and found that  index time boost can be used, but
>> I don't know how to use it. Can I boost document at index time  with
>> DIH ? Can anybody help me in this regard ? Can I use Solr relevancy
>> score with PageRanking score to sort search result?  Any suggestions
>> are welcome
>
> This is confused: Do you want your "pagerank" as the sole basis for
> the ranking of returned results, or do you want it to be one of multiple
> (weighted) criteria? Maybe you should read
> http://wiki.apache.org/solr/SolrRelevancyFAQ
>
> Regards,
> Gora


Re: what's better for in memory searching?

2012-06-12 Thread Mikhail Khludnev
If I get it right, it's kind of per process setting swappiness.

On Tue, Jun 12, 2012 at 3:57 AM, Li Li  wrote:

> is this method equivalent to set vm.swappiness which is global?
> or it can set the swappiness for jvm process?
>
> On Tue, Jun 12, 2012 at 5:11 AM, Mikhail Khludnev
>  wrote:
> > Point about premature optimization makes sense for me. However some time
> > ago I've bookmarked potentially useful approach
> >
> http://lucene.472066.n3.nabble.com/High-response-time-after-being-idle-tp3616599p3617604.html
> .
> >
> > On Mon, Jun 11, 2012 at 3:02 PM, Toke Eskildsen  >wrote:
> >
> >> On Mon, 2012-06-11 at 11:38 +0200, Li Li wrote:
> >> > yes, I need average query time less than 10 ms. The faster the better.
> >> > I have enough memory for lucene because I know there are not too much
> >> > data. there are not many modifications. every day there are about
> >> > hundreds of document update. if indexes are not in physical memory,
> >> > then IO operations will cost a few ms.
> >>
> >> I'm with Michael on this one: It seems that you're doing a premature
> >> optimization. Guessing that your final index will be < 5GB in size with
> >> 1 million documents (give or take 900.000:-), relatively simple queries
> >> and so on, an average response time of 10 ms should be attainable even
> >> on spinning drives. One hundred document updates per day are not many,
> >> so again I would not expect problems.
> >>
> >> As is often the case on this mailing list, the advice is "try it". Using
> >> a normal on-disk index and doing some warm up is the easy solution to
> >> implement and nearly all of your work on this will be usable for a
> >> RAM-based solution, if you are not satisfied with the speed. Or you
> >> could buy a small & cheap SSD and have no more worries...
> >>
> >> Regards,
> >> Toke Eskildsen
> >>
> >>
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Tech Lead
> > Grid Dynamics
> >
> > 
> >  
>



-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics


 


Re: PageRanking with DIH

2012-06-12 Thread Gora Mohanty
On 12 June 2012 13:04, vineet yadav  wrote:
> Hi,
> I have indexed documents and computed pagerank for documents. I want
> to update pagerank for indexed document and sort solr search result
> with pagerank.

Your question is not entirely clear: What is "pagerank" in this case?
A custom score that you can compute at indexing time, and by
which you want to order retrieved results? If so, just add a pagerank
field to your Solr records, ignore Solr's order, and instead sort results
by that field.

>  I did some research and found that  index time boost can be used, but
> I don't know how to use it. Can I boost document at index time  with
> DIH ? Can anybody help me in this regard ? Can I use Solr relevancy
> score with PageRanking score to sort search result?  Any suggestions
> are welcome

This is confused: Do you want your "pagerank" as the sole basis for
the ranking of returned results, or do you want it to be one of multiple
(weighted) criteria? Maybe you should read
http://wiki.apache.org/solr/SolrRelevancyFAQ

Regards,
Gora


Sharding in SolrCloud

2012-06-12 Thread Lenzner
Hello,

we tested SolrCloud in a setup with one collection, two shards and one 
replica per shard and it works quite fine with some example data. 
Now, we plan to set up our own collection and determine in how many shards 
we should devide it. 
We can estimate quite exactly the size of the collection, but we don't 
know, what the best approach for sharding is, 
even if we know the size and the amount of queries and updates.
Is there any documentation or a kind of design guidelines for sharding a 
collection in SolrCloud?


Thanks & regards,
Norman Lenzner

PageRanking with DIH

2012-06-12 Thread vineet yadav
Hi,
I have indexed documents and computed pagerank for documents. I want
to update pagerank for indexed document and sort solr search result
with pagerank.
 I did some research and found that  index time boost can be used, but
I don't know how to use it. Can I boost document at index time  with
DIH ? Can anybody help me in this regard ? Can I use Solr relevancy
score with PageRanking score to sort search result?  Any suggestions
are welcome
!!
Thanks


Indexing Data option for subdirectories?

2012-06-12 Thread Bruno Mannina

Dear All,

Is exist a way to index data under sub-directories directly?

I have several files under sub-directories like:
/data/A/001/*.xml
/data/A/002/*.xml
/data/A/003/*.xml
/data/A/004/*.xml
...
/data/B/001/*.xml
...

/data/Z/999/*.xml

I would like to index directly with

*i.e. java -jar post.jar -R /data*

Is it possible?

thanks a lot,
Bruno