Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-07 Thread z z
Maybe if I were to say that the column user_id will become user_ids
that would clarify things?

user_id:2002+AND+created:[${**from}+TO+${until}]+data:more

becomes

user_id*s*:2002+AND+created:[${**from}+TO+${until}]+data:more

where I want 2002 to be an exact positive match on one of the user_ids
embedded in the TEXT ... not string :)  If I am totally off or making no
sense, feedback it very welcome.  I am just seeing lots of similar data
going into my db and it feels like Solr should be able to handle this.

I just want to know if transforming the data like that will still allow
exact searches against a user_id.  My language from a solr gurus point of
view is probably *very* poorly phrased ... exact and TEXT might not go
hand in hand.

Is the TEXT 20 1442 35 parsed as 20 1442 35 so that a search
against it for 1442 will yield exact results?  A search against 442
wont match right?

1. 20 1442 35
2. 20 442 35
3. 20 1442

user_ids:1442 - yields #1  #3 always?
user_ids:442 - yields only #2 always?

My lack of understanding about what solr does when it indexes is shining
through :)


On Fri, Jun 7, 2013 at 1:43 PM, z z zenlok.testi...@gmail.com wrote:

 My language might be a bit off (I am saying string when I probably mean
 text in the context of solr), but I'm pretty sure that my story is
 unwavering ;)

 `id` int(11) NOT NULL AUTO_INCREMENT
 `created` int(10)
 `data` varbinary(255)
 `user_id` int(11)

 So, imagine that we have 1000 entries come in where data above is
 exactly the same for all 1000 entries, but user_id is different (id and
 created being different is irrelevant).  I am thinking that prior to
 inserting into mysql, I should be able to concatenate the user_ids together
 with whitespace and then insert them into something like:

 `id` int(11) NOT NULL AUTO_INCREMENT
 `created` int(10)
 `data` varbinary(255)
 `user_id` blob

 Then on solr's end it will treat the user_id as Text and parse it (I want
 to say tokenize, but maybe my language is incorrect here?).

 Then when I search

 user_id:2002+AND+created:[${**from}+TO+${until}]+data:more

 I want to be sure that if I look for user_id 2002, I will get data that
 only has a value 2002 in the user_id column and that a separate user with
 id 20 cannot accidentally pull data for user_id 2002 as a result of a
 fuzzy (my language ok?) match of 20 against (20)02.

 Current schema definition:

  field name=user_id type=int indexed=true stored=true/

 New schema definition:

 field name=user_id type=user_id_string indexed=true
 stored=true/
 ...
 fieldType name=user_id_string class=solr.TextField
 positionIncrementGap=100
   analyzer
 tokenizer class=solr.WhitespaceTokenizerFactory
 maxTokenLength=120/
   /analyzer
 /fieldType




Re: Solr 4.2.1 higher memory footprint vs Solr 3.5

2013-06-07 Thread Bernd Fehling
Hi Shawn,

I also had CMS with tons of tuning options but still had once in a while
bigger GC pause. After switching to JDK7 I tried G1GC with no other options
and it runs perfekt.
With CMS I saw that old and young generation where growing until they
had to do a GC. This produces the sawtooth and also takes longer GC pause 
time.
With G1GC the GC is more frequently and better timed, it is softer, more 
flexible.
I just removed any old tuning and old GC and have only the G1GC option.

ulimit -c unlimited
ulimit -l 256
ulimit -m unlimited
ulimit -n 8192
ulimit -s unlimited
ulimit -v unlimited

JAVA_OPTS=-server -d64 -Xmx20g -Xms20g -XX:+UseG1GC
  -verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:gc.log

java version 1.7.0_07
Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)

May be I have just luck with it, but for big heaps it works fine.

Regards
Bernd


Am 06.06.2013 16:23, schrieb Shawn Heisey:
 On 6/6/2013 3:50 AM, Bernd Fehling wrote:
 What helped me a lot was switching to G1GC.
 Faster, smoother, very little ripple, nearly no sawtooth.
 
 When I tried G1, it did indeed produce a better looking memory graph,
 but it didn't do anything about my GC pauses.  They were several seconds
 with just CMS and NewRatio, and they actually seemed to get slightly
 worse when I tried G1 instead.
 
 To solve the GC pause problem, I've had to switch back to CMS and tack
 on several more tuning options, most of which are CMS-specific.  I'm not
 sure how to tune G1.  Have you done any additional tuning?
 
 Thanks,
 Shawn
 


Re: [blogpost] Memory is overrated, use SSDs

2013-06-07 Thread Toke Eskildsen
On Fri, 2013-06-07 at 07:15 +0200, Andy wrote:
 One question I have is did you precondition the SSD ( 
 http://www.sandforce.com/userfiles/file/downloads/FMS2009_F2A_Smith.pdf )? 
 SSD performance tends to take a very deep dive once all blocks are written at 
 least once and the garbage collector kicks in. 

Not explicitly so. The machine is our test server with the SSDs in RAID
0 with - to my knowledge - no TRIM support. They are 2½ year old and has
had a fair amount of data written and being 3/4 full most of the time.
At one point in time we experimented with 10M+ relatively small files
and a couple of 40GB databases, so the drives are definitely not in
pristine condition.

Anyway, as Solr searches is heavy on tiny random reads, I suspect that
search performance will be largely unaffected by SSD fragmentation. It
would be interesting to examine, but for now I cannot prioritize another
large performance test.


Thank you for your input. I will update the blog post accordingly,
Toke Eskildsen, State and University Library, Denmark



Re: nutch 1.4, solr 3.4 configuration error

2013-06-07 Thread Tuğcem Oral
I had a similar error. I couldn't find any documentation which nutch and
solr versions are compatible. For instance, we' re using nutch 1.6 on
hadoop 1.0.4 with solrj 3.4.0 and index crawled segments to solr 4.2.0. But
I remember that I could find a compatible version of solrj for nutch 1.4
(because of using hadoop). You can upgrade your nutch from 1.4 to 1.6
easily. And also I suggest you to check for your solrindex-mapping.xml in
your /conf directory.

Best,

Tugcem.


On Fri, Jun 7, 2013 at 12:58 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:

 : ./nutch crawl urls -dir myCrawl2 -solr http://localhost:8080 -depth 2
 -topN
 ...
 : Caused by: org.apache.solr.common.SolrException: Not Found
 :
 : Not Found
 :
 : request: http://localhost:8080/select?q=id:[* TO
 : *]fl=idrows=1wt=javabinversion=2
 ...
 : Other possibly helpful information:
 : 1) The solr admin screen comes up fine in the browser.

 At which URL does the Solr admin screen come up fine in your browser?

 Best guess...

 1) you have solr installed such that it uses the webcontext /solr but
 you gave the wrong url to nutch (ie: try -solr
 http://localhost:8080/solr;)

 2) you are using multiple collections, and you may need to configure nutch
 to know about which collection you are using (ie: try -solr
 http://localhost:8080/solr/collection1;)

 ...if neither of those don't help, i would suggest you follow up with the
 nutch-user list, as the nutch community is probably in the best position
 to help you configure nutch to work with Solr and vice versa)


 -Hoss




-- 
TO


Clear cache used by Solr

2013-06-07 Thread Varsha Rani
Hi

I 'm trying to compare the performance of different Solr queries. In order
to get a fair test, I want to clear the cache between queries.

How is this done? Of course, one can restart the server, I was to know if
there is a quicker way.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Clear-cache-used-by-Solr-tp4068817.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr.NoOpDistributingUpdateProcessorFactory in SOLR CLOUD

2013-06-07 Thread sathish_ix
Hi ,

Need more information how NoOpDistributingUpdateProcessorFactory works,
Below is the cloud setup,

collection1 shard1 ---node1:8983 (leader)
| | _ _ _ _ _ _ _ _ _ _ node2:8984
|
|_ _ _ _ _ _ _ _ _ _ _ _ shard2--- node3:7585 (leader)
  |_ _ _ _ _ _ _ _ __ _ node4:7586


node 1, node 2, node 3 , node4 are 4 seperate solr instance running on 4
tomcat container.

We have included the following tag to solrconfig.xml , for not distributing
the index across shards.
updateRequestProcessorChain
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
processor class=solr.NoOpDistributingUpdateProcessorFactory /
/updateRequestProcessorChain  


We are able accomplish the task of loading an index only single shard by
using no-op distributingupdateprocessfactory.


 Loaded data into node:8984 of shard 1
After indexing the size of the index on node 8984 was 94MB
Whereas the index  size on leader node for shard 1  was 4 kb.
Seems for shard 1 the leader is not performing the index building and
replication is not working.
But on good note, the index was not distributed to shard 2 (node 3, node
4) 

When i removed above tag updateRequestProcessorChain,
Index is distributed accorss shards
Replication is working fine.

My requirement is to store specific region index into  single shard, so the
region data is not distributed across shards.

Can you some help on this ?

Thanks,
Sathish








--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-NoOpDistributingUpdateProcessorFactory-in-SOLR-CLOUD-tp4068818.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Configuring seperate db-data-config.xml per shard

2013-06-07 Thread sathish_ix
Hi,

we were able to accomplish this by single collection.

Zookeeper :

create separate node for each shards, and upload the dbconfig file under
shards.

eg : /config/config1/shard1
  /config/config1/shard2
  /config/config1/shard3

In the solrconfig.xml,

 requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=config${dbconfig}/str
/lst
/requestHandler


In solr.xml,

?xml version=1.0 encoding=UTF-8 ?
solr persistent=true zkHost=localhost:2181 
  cores defaultCoreName=core1 adminPath=/admin/cores
zkClientTimeout=${zkClientTimeout:15000}  host=${host:} hostPort=9985
hostContext=${hostContext:}

  core loadOnStartup=true instanceDir=core1 transient=false
name=core1 
  property name=dbconfig value=shard1/db-data-config.xml / 
  /core
  
  /cores
/solr

This way you can configure dbconfig file per shard.

Thanks,
Sathish




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuring-seperate-db-data-config-xml-per-shard-tp4068383p4068819.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is there a way to load multiple schema when using zookeeper?

2013-06-07 Thread sathish_ix
Hi,

we were able to accomplish this by single collection.

Zookeeper :

create separate node for each shards, and upload the dbconfig file under
shards.

eg : /config/config1/shard1
  /config/config1/shard2
  /config/config1/shard3

In the solrconfig.xml,

 requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=config${dbconfig}/str
/lst
/requestHandler


In solr.xml,

?xml version=1.0 encoding=UTF-8 ?
solr persistent=true zkHost=localhost:2181 
  cores defaultCoreName=core1 adminPath=/admin/cores
zkClientTimeout=${zkClientTimeout:15000}  host=${host:} hostPort=9985
hostContext=${hostContext:}
   
  core loadOnStartup=true instanceDir=core1 transient=false
name=core1 
  property name=dbconfig value=shard1/db-data-config.xml / 
  /core
 
  /cores
/solr

This way you can configure dbconfig file per shard.

Thanks,
Sathish 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-load-multiple-schema-when-using-zookeeper-tp4058358p4068821.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is there a way to load multiple schema when using zookeeper?

2013-06-07 Thread sathish_ix
Hi,

we were able to accomplish this by single collection.

Zookeeper :

create separate node for each shards, and upload the dbconfig file under
shards.

eg : /config/config1/shard1
  /config/config1/shard2
  /config/config1/shard3

In the solrconfig.xml,

 requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=config${dbconfig}/str
/lst
/requestHandler


In solr.xml,

?xml version=1.0 encoding=UTF-8 ?
solr persistent=true zkHost=localhost:2181 
  cores defaultCoreName=core1 adminPath=/admin/cores
zkClientTimeout=${zkClientTimeout:15000}  host=${host:} hostPort=9985
hostContext=${hostContext:}
   
  core loadOnStartup=true instanceDir=core1 transient=false
name=core1 
  property name=dbconfig value=shard1/db-data-config.xml / 
  /core
 
  /cores
/solr

This way you can configure dbconfig file per shard.

Thanks,
Sathish 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-load-multiple-schema-when-using-zookeeper-tp4058358p4068820.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Clear cache used by Solr

2013-06-07 Thread Toke Eskildsen
On Fri, 2013-06-07 at 09:24 +0200, Varsha Rani wrote:
 I 'm trying to compare the performance of different Solr queries. In order
 to get a fair test, I want to clear the cache between queries.
 
 How is this done? Of course, one can restart the server, I was to know if
 there is a quicker way.

That depends on your system. If you are using Linux or OSX, this should
work:
sudo echo 1  /proc/sys/vm/drop_caches
For Windows, CacheSet seems to provide the functionality:
http://technet.microsoft.com/en-us/sysinternals/bb897561.aspx


To avoid any leftover from memory mapping vs. cache trickery, I stop
Solr, issue the drop_caches call and start Solr again.

- Toke Eskildsen



Re: LotsOfCores feature

2013-06-07 Thread Aleksey
 A use case would a web site or service that had millions of users, each of
 whom would have an active Solr core when they are active, but inactive
 otherwise. Of course those cores would not all reside on one node and
 ZooKeeper is out of the question for managing anything that is in the
 millions. This would be a true cloud or data center and even multi-data
 center app, not a cluster app.

I am getting a little bit confused again. It seems now the answer to
my question is a clear no?
Also, instead of managing cores is it not possible to manage servers
which will be in tens and hundreds? As far as which core goes to which
server, that could be based on some hashing scheme.


Using Solr Scripts

2013-06-07 Thread Furkan KAMACI
I have a SolrCloud and I want to maintain some important things on it. i.e.
I will backup indexes, start - stop Solr nodes individually, send an
optimize request to the cloud etc. However I see that there is a scripts
folder comes with Solr. Can I use some of them for my purposes or should I
implement something that connects to Zookeeper quorum by Solrj and does
what I want?


How to stop index distribution among shards in solr cloud

2013-06-07 Thread sathish_ix
Hi,

I have two shards, logically each shards corresponds to a region. Currently
index is distributed in solr cloud to shards, how to load index to specific
shard in solr cloud,

Any thoughts ?

Thanks,
Sathish



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-stop-index-distribution-among-shards-in-solr-cloud-tp4068831.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr4.3 Internationalization.

2013-06-07 Thread bsargurunathan
Guys,

Please clarify the following questions regarding Solr Internationalization.

1) Initially my requirement is need to support 2 languages(English  French)
for a Web application.
And we are using Mysql DB.

2) So please share good and easy approach to achieve it with some sample
configs.

3) And my question is whether I need to index the data with both
languages(English  French) with different cores?

4) Or indexing with English is only enough? So solr have any mechanism to
handle multiple languages while retrieving? If there anything share with
some sample configs.

Thanks
Guru



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr4-3-Internationalization-tp4068834.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: LotsOfCores feature

2013-06-07 Thread Noble Paul നോബിള്‍ नोब्ळ्
The Wiki page was built not for Cloud Solr.

We have done such a deployment where less than a tenth of cores were active
at any given point in time. though there were tens of million indices they
were split among a large no:of hosts.


If you don't insist of Cloud deployment it is possible. I'm not sure if it
is possible with cloud


On Fri, Jun 7, 2013 at 12:38 AM, Aleksey bitterc...@gmail.com wrote:

 I was looking at this wiki and linked issues:
 http://wiki.apache.org/solr/LotsOfCores

 they talk about a limit being 100K cores. Is that per server or per
 entire fleet because zookeeper needs to manage that?

 I was considering a use case where I have tens of millions of indices
 but less that a million needs to be active at any time, so they need
 to be loaded on demand and evicted when not used for a while.
 Also since number one requirement is efficient loading of course I
 assume I will store a prebuilt index somewhere so Solr will just
 download it and strap it in, right?

 The root issue is marked as won;t fix but some other important
 subissues are marked as resolved. What's the overall status of the
 effort?

 Thank you in advance,

 Aleksey




-- 
-
Noble Paul


Re: SOLR CSV output in custom order

2013-06-07 Thread Noble Paul നോബിള്‍ नोब्ळ्
Have you tried explicitly giving the field names (fl) as parameter
 http://wiki.apache.org/solr/CommonQueryParameters#fl


On Thu, Jun 6, 2013 at 12:41 PM, anurag.jain anurag.k...@gmail.com wrote:

 I want output of csv file in proper order.  when I use wt=csv  it gives
 output in random order. Is there any way to get output in proper format.

 Thanks



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SOLR-CSV-output-in-custom-order-tp4068527.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
-
Noble Paul


Re: [blogpost] Memory is overrated, use SSDs

2013-06-07 Thread Erick Erickson
Thanks for this, hard data is always welcome!

Another blog post for my reference list!

Erick

On Fri, Jun 7, 2013 at 2:59 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote:
 On Fri, 2013-06-07 at 07:15 +0200, Andy wrote:
 One question I have is did you precondition the SSD ( 
 http://www.sandforce.com/userfiles/file/downloads/FMS2009_F2A_Smith.pdf )? 
 SSD performance tends to take a very deep dive once all blocks are written 
 at least once and the garbage collector kicks in.

 Not explicitly so. The machine is our test server with the SSDs in RAID
 0 with - to my knowledge - no TRIM support. They are 2½ year old and has
 had a fair amount of data written and being 3/4 full most of the time.
 At one point in time we experimented with 10M+ relatively small files
 and a couple of 40GB databases, so the drives are definitely not in
 pristine condition.

 Anyway, as Solr searches is heavy on tiny random reads, I suspect that
 search performance will be largely unaffected by SSD fragmentation. It
 would be interesting to examine, but for now I cannot prioritize another
 large performance test.


 Thank you for your input. I will update the blog post accordingly,
 Toke Eskildsen, State and University Library, Denmark



Re: solr.NoOpDistributingUpdateProcessorFactory in SOLR CLOUD

2013-06-07 Thread Erick Erickson
I don't think you want the noop bits, I'd go back to the
standard definitions here.


What you _do_ want, I think, is the custom hashing option, see:
https://issues.apache.org/jira/browse/SOLR-2592
which has been in place since Solr 4.1. It allows you to
send documents to the shard of your choice, which is I believe
what you're really after here.

Best
Erick

On Fri, Jun 7, 2013 at 3:31 AM, sathish_ix skandhasw...@inautix.co.in wrote:
 Hi ,

 Need more information how NoOpDistributingUpdateProcessorFactory works,
 Below is the cloud setup,

 collection1 shard1 ---node1:8983 (leader)
 | | _ _ _ _ _ _ _ _ _ _ node2:8984
 |
 |_ _ _ _ _ _ _ _ _ _ _ _ shard2--- node3:7585 (leader)
   |_ _ _ _ _ _ _ _ __ _ node4:7586


 node 1, node 2, node 3 , node4 are 4 seperate solr instance running on 4
 tomcat container.

 We have included the following tag to solrconfig.xml , for not distributing
 the index across shards.
 updateRequestProcessorChain
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /
 processor class=solr.NoOpDistributingUpdateProcessorFactory /
 /updateRequestProcessorChain


 We are able accomplish the task of loading an index only single shard by
 using no-op distributingupdateprocessfactory.


 Loaded data into node:8984 of shard 1
 After indexing the size of the index on node 8984 was 94MB
 Whereas the index  size on leader node for shard 1  was 4 kb.
 Seems for shard 1 the leader is not performing the index building and
 replication is not working.
But on good note, the index was not distributed to shard 2 (node 3, node
 4)

 When i removed above tag updateRequestProcessorChain,
Index is distributed accorss shards
Replication is working fine.

 My requirement is to store specific region index into  single shard, so the
 region data is not distributed across shards.

 Can you some help on this ?

 Thanks,
 Sathish








 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/solr-NoOpDistributingUpdateProcessorFactory-in-SOLR-CLOUD-tp4068818.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Clear cache used by Solr

2013-06-07 Thread Erick Erickson
I really question whether this is valuable. Much of Solr performance
is there explicitly because of caches, so what you're measuring
is disk I/O to fill caches and any other latency. I'm just not sure
what operational information you'll get here.

But assuming that you're really getting actionable data, you can
comment out all of the caches in the solrconfig.xml file to at least
remove those. The underlying lucene caches will not be emptied,
but they'll always be filled anyway for all the queries after the first
few, you can't avoid them.

Best
Erick

On Fri, Jun 7, 2013 at 3:24 AM, Varsha Rani varsha.ya...@orkash.com wrote:
 Hi

 I 'm trying to compare the performance of different Solr queries. In order
 to get a fair test, I want to clear the cache between queries.

 How is this done? Of course, one can restart the server, I was to know if
 there is a quicker way.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Clear-cache-used-by-Solr-tp4068817.html
 Sent from the Solr - User mailing list archive at Nabble.com.


solr facet query on multiple search term

2013-06-07 Thread vrparekh
Hello All,

I required facet counts for multiple SearchTerms.
Currently I am doing two separate facet query on each search term with
facet.range=dateField

e.g.

 http://solrserver/select?q=1stsearchTermfq=onfacet-parameters 

 http://solrserver/select?q=2ndsearchTermfq=onfacet-parameters

Note :: SearchTerm field will be text_en_splitting

Now I have found another way to do facet query on multiple search term by
tagging and excluding

e.g.

http://solrurl/select?start=0rows=10hl=off;
facet=on
facet.range.start=2013-06-06T16%3a00%3a00Z
facet.range.end=2013-06-07T16%3a00%3a01Z
facet.range.gap=%2B1HOUR
wt=xml
sort=dateField+desc
facet.range={!key=music+ex=movie}dateField
   
fq={!tag=music}content:musicfacet.range={!key=movie+ex=music}dateField
fq={!tag=movie}content:movieq=(col2:1+)
   
fq=+dateField:[2013-06-05T16:00:00Z+TO+2013-06-07T16:00:00Z]+AND+(+Col1:test+)
fl=col1,col2,col3


I have tested for few search term , It is providing same result as different
query for each search term.
Is this the proper way (with results and performance)?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-facet-query-on-multiple-search-term-tp4068856.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: LotsOfCores feature

2013-06-07 Thread Erick Erickson
I should have been clearer, and others have mentioned... the lots of cores
stuff is really outside Zookeeper/SolrCloud at present. I don't think it's
incompatible, but it wasn't part of the design so it'll need some effort to
make it play nice with SolrCloud. I'm not sure there's actually a compelling
use-case for combining the two.

bq: Also, instead of managing cores is it not possible to manage servers
which will be in tens and hundreds?

Well, tens to hundreds of servers will work with SolrCloud. You could
theoretically take over routing documents (i.e. custom hashing) and
simply use SolrCloud without the lots of cores stuff. So the scenario
is that you have, say, 250 machines that will hold all your data and use
custom routing to get the right docs to the right core. Some of the upcoming
SolrJ being capable of sending requests only to the proper shard would
certainly help here. But this too is rather unexplored territory. I don't think
Zookeeper would really have a problem here because it's not moving much
data back and forth, the 1M limitation for data in ZK is on a per-core basis
and really applies only to the conf data, NOT the index.

But the current approach does lend itself to Jack's scenario. Essentially your
ClusterKeeper could send the index to one of the machines and create the
core there.

The current approach addresses the case where you are essentially doing
what Jack outlined semi-manually. That is, you're distributing your cores
around your cluster based on historical access patterns. It's pretty easy to
move the cores around by copying the dirs and using the auto-discovery
stuff to keep things in balance, but it's in no way automatic and probably
requires a restart (or at least core unload/load). Jack's idea
of doing this dynamically should work in that kind of scenario.

I can imagine, for instance, some relatively small number of physical
machines and all the user's indexes actually being kept on a networked
filesystem. The startup process is simply finding a machine with spare
capacity and telling it to create the core and pointing it at the pre-existing
index. On the assumption that the indexes fit into memory, you'd pay a
small penalty for start-up but wouldn't need to copy indexes around. You
could elaborate this as necessary, tuning the transient caches such that
you fit the number/size of users to particular hardware. If the store were
an HDFS file system, redundancy/backup/error recovery would come along
for free.

But under any scenario, one of the hurdles will be figuring out how many
simultaneous users of whatever size can actually be comfortably handled
by a particular piece of hardware. And usually there's some kind of long
tail just to make it worse. Most of your users will be under X documents,
and some users will be 100X And updating would be interesting.

But I should emphasize that anything elaborate like this dynamic shuffling
is kind of theoretical at this point, meaning we haven't actually tested it. It
_should_ work, but I'm sure there will be some issues to flush out.

Best
Erick

On Fri, Jun 7, 2013 at 6:38 AM, Noble Paul നോബിള്‍  नोब्ळ्
noble.p...@gmail.com wrote:
 The Wiki page was built not for Cloud Solr.

 We have done such a deployment where less than a tenth of cores were active
 at any given point in time. though there were tens of million indices they
 were split among a large no:of hosts.


 If you don't insist of Cloud deployment it is possible. I'm not sure if it
 is possible with cloud


 On Fri, Jun 7, 2013 at 12:38 AM, Aleksey bitterc...@gmail.com wrote:

 I was looking at this wiki and linked issues:
 http://wiki.apache.org/solr/LotsOfCores

 they talk about a limit being 100K cores. Is that per server or per
 entire fleet because zookeeper needs to manage that?

 I was considering a use case where I have tens of millions of indices
 but less that a million needs to be active at any time, so they need
 to be loaded on demand and evicted when not used for a while.
 Also since number one requirement is efficient loading of course I
 assume I will store a prebuilt index somewhere so Solr will just
 download it and strap it in, right?

 The root issue is marked as won;t fix but some other important
 subissues are marked as resolved. What's the overall status of the
 effort?

 Thank you in advance,

 Aleksey




 --
 -
 Noble Paul


Documents

2013-06-07 Thread acasaus
Good morning,

I would like to know how I can modify a xml file to access to my information 
and not to the example information because I have one file from I obtains the 
information that I use to show the user with Blacklight.

Sorry about my english,

Alex


Re: Documents

2013-06-07 Thread Dmitry Kan
hi,

you need to parse your custom xml file and transform it into the xml file
that will be of format solr understands. If you are familiar with xslt, you
could do that in a few lines depending on the complexity of the input xml
file.

Dmitry


On Fri, Jun 7, 2013 at 3:34 PM, acas...@greendata.com wrote:

 Good morning,

 I would like to know how I can modify a xml file to access to my
 information
 and not to the example information because I have one file from I obtains
 the
 information that I use to show the user with Blacklight.

 Sorry about my english,

 Alex



Re: Doubt Regarding Shards Index

2013-06-07 Thread sathish_ix
Hi ,

How did you distribute the index by year to different shards,
do we need to write any code ?

Thanks,
Sathish



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Doubt-Regarding-Shards-Index-tp3629964p4068869.html
Sent from the Solr - User mailing list archive at Nabble.com.


[CROSS-POSTING] SOLR-4903 and SOLR-4904

2013-06-07 Thread Dmitry Kan
CROSS-POSTING from dev list.

Hi guys,

As discussed with Grant and Andrzej I have created two jiras related to
inefficiency in distributed faceting. This affects 3.4, but my gut feeling
is telling me 4.x is affected as well.

Regards,

Dmitry Kan

P.S. Asking this question won yours truly second prize on Stump the chump
this year. :)


Re: HdfsDirectoryFactory

2013-06-07 Thread Mark Miller
Eagle eye man.

Yeah, we plan on contributing hdfs support for Solr. I'm flying home today and 
will create a JIRA issue for it shortly after I get there.

- Mark

On Jun 6, 2013, at 6:16 PM, Jamie Johnson jej2...@gmail.com wrote:

 I've seen reference to an HdfsDirectoryFactory in the new Cloudera Search
 along with a commit in the Solr SVN (
 http://svn.apache.org/viewvc/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/solrconfig-tlog.xml?view=markup),
 is this something that is being made part of the core?  I've seen
 discussions in the past where folks have recommended not using an HDFS
 based DirectoryFactory for reasons like speed, any details/information that
 can be provided would be really appreciated.



Re: Doubt Regarding Shards Index

2013-06-07 Thread Dmitry Kan
Hi,

Sharding by time by itself does not need any custom code on solr side:
start indexing your data to a shard, depending on the timestamp of your
document.

The querying part is trickier if you want to have one front end solr: it
should know which shards to query. If querying all shards for each query is
fine for you, then you are good and no changes needed. Alternatively, you
can shoot a query to a particular year shard knowing the year of your user
query.

Dmitry


On Fri, Jun 7, 2013 at 3:54 PM, sathish_ix skandhasw...@inautix.co.inwrote:

 Hi ,

 How did you distribute the index by year to different shards,
 do we need to write any code ?

 Thanks,
 Sathish



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Doubt-Regarding-Shards-Index-tp3629964p4068869.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr 4.2.1 higher memory footprint vs Solr 3.5

2013-06-07 Thread Otis Gospodnetic
This is exactly what we did for a clients (alas using Elasticsearch). We
then observed better performance through SPM. We used the latest Oracle JVM.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Jun 7, 2013 2:55 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de
wrote:

 Hi Shawn,

 I also had CMS with tons of tuning options but still had once in a while
 bigger GC pause. After switching to JDK7 I tried G1GC with no other options
 and it runs perfekt.
 With CMS I saw that old and young generation where growing until they
 had to do a GC. This produces the sawtooth and also takes longer GC
 pause time.
 With G1GC the GC is more frequently and better timed, it is softer, more
 flexible.
 I just removed any old tuning and old GC and have only the G1GC option.

 ulimit -c unlimited
 ulimit -l 256
 ulimit -m unlimited
 ulimit -n 8192
 ulimit -s unlimited
 ulimit -v unlimited

 JAVA_OPTS=-server -d64 -Xmx20g -Xms20g -XX:+UseG1GC
   -verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:gc.log

 java version 1.7.0_07
 Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
 Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)

 May be I have just luck with it, but for big heaps it works fine.

 Regards
 Bernd


 Am 06.06.2013 16:23, schrieb Shawn Heisey:
  On 6/6/2013 3:50 AM, Bernd Fehling wrote:
  What helped me a lot was switching to G1GC.
  Faster, smoother, very little ripple, nearly no sawtooth.
 
  When I tried G1, it did indeed produce a better looking memory graph,
  but it didn't do anything about my GC pauses.  They were several seconds
  with just CMS and NewRatio, and they actually seemed to get slightly
  worse when I tried G1 instead.
 
  To solve the GC pause problem, I've had to switch back to CMS and tack
  on several more tuning options, most of which are CMS-specific.  I'm not
  sure how to tune G1.  Have you done any additional tuning?
 
  Thanks,
  Shawn
 



Re: Clear cache used by Solr

2013-06-07 Thread Yonik Seeley
On Fri, Jun 7, 2013 at 7:32 AM, Erick Erickson erickerick...@gmail.com wrote:
 I really question whether this is valuable. Much of Solr performance
 is there explicitly because of caches

Right, and it's also the case that certain solr features are coded
with the cache in mind (i.e. they will be utilized for a single
request for things like highlighting, multi-select faceting, etc.

On Fri, Jun 7, 2013 at 3:24 AM, Varsha Rani varsha.ya...@orkash.com wrote:
 I 'm trying to compare the performance of different Solr queries. In order
 to get a fair test, I want to clear the cache between queries.

If you are using/testing lucene query syntax, you can just add an
additional term that doesn't match anything and then keep changing
it... that will prevent the query/filter cache from recognizing it as
the same.

q=(my big query I'm testing) ab

And then next time change the b to a c, etc.

Or you could explicitly tell solr not to cache it:

http://yonik.com/posts/advanced-filter-caching-in-solr/

q={!cache=false}(my big query I'm testing)

-Yonik
http://lucidworks.com


Re: LotsOfCores feature

2013-06-07 Thread Jack Krupansky
AFAICT, SolrCloud addresses the use case of distributed update for a 
relatively smaller number of collections (dozens?) that have a relatively 
larger number of rows - billions over a modest to moderate number of nodes 
(a handful to a dozen or dozens). So, maybe dozens of collections (some 
people still call these cores) that distribute hundreds of millions if not 
billions of rows over dozens (or potentially low hundreds) of nodes. 
Technically, ZK was designed for thousands of nodes, but I don't think that 
was for the use case of distributed query that constantly fans out to all 
shards.


Aleksey: What would you say is the average core size for your use case - 
thousands or millions of rows? And how sharded would each of your 
collections be, if at all?


-- Jack Krupansky

-Original Message- 
From: Noble Paul നോബിള്‍ नोब्ळ्

Sent: Friday, June 07, 2013 6:38 AM
To: solr-user@lucene.apache.org
Subject: Re: LotsOfCores feature

The Wiki page was built not for Cloud Solr.

We have done such a deployment where less than a tenth of cores were active
at any given point in time. though there were tens of million indices they
were split among a large no:of hosts.


If you don't insist of Cloud deployment it is possible. I'm not sure if it
is possible with cloud


On Fri, Jun 7, 2013 at 12:38 AM, Aleksey bitterc...@gmail.com wrote:


I was looking at this wiki and linked issues:
http://wiki.apache.org/solr/LotsOfCores

they talk about a limit being 100K cores. Is that per server or per
entire fleet because zookeeper needs to manage that?

I was considering a use case where I have tens of millions of indices
but less that a million needs to be active at any time, so they need
to be loaded on demand and evicted when not used for a while.
Also since number one requirement is efficient loading of course I
assume I will store a prebuilt index somewhere so Solr will just
download it and strap it in, right?

The root issue is marked as won;t fix but some other important
subissues are marked as resolved. What's the overall status of the
effort?

Thank you in advance,

Aleksey





--
-
Noble Paul 



Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-07 Thread Jack Krupansky

Right, a search for 442 would not match 1442.

-- Jack Krupansky

-Original Message- 
From: z z

Sent: Friday, June 07, 2013 2:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Schema Change: Int - String (i am the original poster, new 
email address)


Maybe if I were to say that the column user_id will become user_ids
that would clarify things?

user_id:2002+AND+created:[${**from}+TO+${until}]+data:more

becomes

user_id*s*:2002+AND+created:[${**from}+TO+${until}]+data:more

where I want 2002 to be an exact positive match on one of the user_ids
embedded in the TEXT ... not string :)  If I am totally off or making no
sense, feedback it very welcome.  I am just seeing lots of similar data
going into my db and it feels like Solr should be able to handle this.

I just want to know if transforming the data like that will still allow
exact searches against a user_id.  My language from a solr gurus point of
view is probably *very* poorly phrased ... exact and TEXT might not go
hand in hand.

Is the TEXT 20 1442 35 parsed as 20 1442 35 so that a search
against it for 1442 will yield exact results?  A search against 442
wont match right?

1. 20 1442 35
2. 20 442 35
3. 20 1442

user_ids:1442 - yields #1  #3 always?
user_ids:442 - yields only #2 always?

My lack of understanding about what solr does when it indexes is shining
through :)


On Fri, Jun 7, 2013 at 1:43 PM, z z zenlok.testi...@gmail.com wrote:


My language might be a bit off (I am saying string when I probably mean
text in the context of solr), but I'm pretty sure that my story is
unwavering ;)

`id` int(11) NOT NULL AUTO_INCREMENT
`created` int(10)
`data` varbinary(255)
`user_id` int(11)

So, imagine that we have 1000 entries come in where data above is
exactly the same for all 1000 entries, but user_id is different (id and
created being different is irrelevant).  I am thinking that prior to
inserting into mysql, I should be able to concatenate the user_ids 
together

with whitespace and then insert them into something like:

`id` int(11) NOT NULL AUTO_INCREMENT
`created` int(10)
`data` varbinary(255)
`user_id` blob

Then on solr's end it will treat the user_id as Text and parse it (I want
to say tokenize, but maybe my language is incorrect here?).

Then when I search

user_id:2002+AND+created:[${**from}+TO+${until}]+data:more

I want to be sure that if I look for user_id 2002, I will get data that
only has a value 2002 in the user_id column and that a separate user 
with

id 20 cannot accidentally pull data for user_id 2002 as a result of a
fuzzy (my language ok?) match of 20 against (20)02.

Current schema definition:

 field name=user_id type=int indexed=true stored=true/

New schema definition:

field name=user_id type=user_id_string indexed=true
stored=true/
...
fieldType name=user_id_string class=solr.TextField
positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory
maxTokenLength=120/
  /analyzer
/fieldType






Re: OR query with null value and non-null value(s)

2013-06-07 Thread Jack Krupansky
Yes, it SHOULD! And in the LucidWorks Search query parser it does. Why 
doesn't it in Solr? Ask Yonik to explain that!


-- Jack Krupansky

-Original Message- 
From: Rahul R

Sent: Friday, June 07, 2013 1:21 AM
To: solr-user@lucene.apache.org
Subject: Re: OR query with null value and non-null value(s)

Thank you Shawn. This does work. To help me understand better, why do
we need the *:* ? Shouldn't it be implicit ?
Shouldn't
fq=(price:4+OR+(-price:[* TO *]))  //does not work
mean the same as
fq=(price:4+OR+(*:* -price:[* TO *]))   //works

Why does Solr need the *:* there ?




On Fri, Jun 7, 2013 at 12:07 AM, Shawn Heisey s...@elyograg.org wrote:


On 6/6/2013 12:28 PM, Rahul R wrote:


I have recently enabled facet.missing=true in solrconfig.xml which gives
null facet values also. As I understand it, the syntax to do a faceted
search on a null value is something like this:
fq=-price:[* TO *]
So when I want to search on a particular value (for example : 4)  OR null
value, I would expect the syntax to be something like this:
fq=(price:4+OR+(-price:[* TO *]))
But this does not work. After searching around for more, read somewhere
that the right way to achieve this would be:
fq=-(-price:4+AND+price:[*+TO+***])
Now this does work but seems like a very roundabout way. Is there a 
better

way to achieve this ?



Pure negative queries don't work -- you have to have results in the query
before you can subtract.  For some top-level queries, Solr is able to
detect this situation and fix it internally, but on inner queries you must
explicitly state your intentions.  It is best if you always use '*:*
-query' syntax, just to be safe.

fq=(price:4+OR+(*:* -price:[* TO *]))

Thanks,
Shawn






Re: Solr 4.2.1 higher memory footprint vs Solr 3.5

2013-06-07 Thread adityab
Hi All, 
I work with Sandeep M, so continued to his comments. We did observe a memory
growth. 
We use jdk1.6.0_45 with CMS. We see this issue because of large document
size. With large i mean our single document has large multivalued fields. 
We found that JIRA  LUCENE-4995
https://issues.apache.org/jira/browse/LUCENE-4995   is what we
experienced. and the patch seam to resolve our issue. We are performing more
test around it. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-2-1-higher-memory-footprint-vs-Solr-3-5-tp4067879p4068886.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Documents

2013-06-07 Thread Alexandre Rafalovitch
If you are trying to import an external XML file into your system, you
may want to look at DataImportHandler. It is a good way to start. Look
at Wikipedia examples.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Fri, Jun 7, 2013 at 8:34 AM,  acas...@greendata.com wrote:
 Good morning,

 I would like to know how I can modify a xml file to access to my information
 and not to the example information because I have one file from I obtains the
 information that I use to show the user with Blacklight.

 Sorry about my english,

 Alex


Re: Solr4.3 Internationalization.

2013-06-07 Thread Alexandre Rafalovitch
It may be helpful to approach this from the other side. Specifically search.

Are you:
1) Expecting to search across both French and English content (e.g.
French, but fallback to English if translation is missing)? If yes,
you want a single collection
2) Is French content completely separate from English content or are
they just a couple of translated fields in otherwise shared entity? If
later, you want a single collection.
3) Are you accessing all languages at once when you retrieve a record
or just one language at a time? If all languages at once, you want a
single collection.

And so on. If your content is completely separate, you could do
different collections. Otherwise, you probably want the same
collection.

If you do want a single collection, there is a couple of things you
can do to make it transparent for the frontend code to switch between
languages and make search transparent. While not a production use, it
is explored in details in my just released book:
http://www.packtpub.com/apache-solr-for-indexing-data/book . The
corresponding example is at:
https://github.com/arafalov/solr-indexing-book/tree/master/published/languages
but I am not sure how easy it is to understand without the walkthrough
in the book.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Fri, Jun 7, 2013 at 6:08 AM, bsargurunathan bsargurunat...@gmail.com wrote:
 Guys,

 Please clarify the following questions regarding Solr Internationalization.

 1) Initially my requirement is need to support 2 languages(English  French)
 for a Web application.
 And we are using Mysql DB.

 2) So please share good and easy approach to achieve it with some sample
 configs.

 3) And my question is whether I need to index the data with both
 languages(English  French) with different cores?

 4) Or indexing with English is only enough? So solr have any mechanism to
 handle multiple languages while retrieving? If there anything share with
 some sample configs.

 Thanks
 Guru



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr4-3-Internationalization-tp4068834.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Solr 4.3.0 Cloud Issue indexing pdf documents

2013-06-07 Thread Mark Wilson
Hi

I am having an issue with adding pdf documents to a SolrCloud index I have
setup.

I can index pdf documents fine using 4.3.0 on my local box, but I have a
SolrCloud instance setup on the Amazon Cloud (Using 2 servers) and I get
Error.

It seems that it is not loading org.apache.pdfbox.pdmodel.PDPage. However,
the jar is in the directory, and referenced in the solrconfig.xml file

  lib dir=/www/solr/lib/contrib/extraction/lib regex=.*\.jar /
  lib dir=/www/solr/lib/ regex=solr-cell-\d.*\.jar /

  lib dir=/www/solr/lib/contrib/clustering/lib/ regex=.*\.jar /
  lib dir=/www/solr/lib/ regex=solr-clustering-\d.*\.jar /

  lib dir=/www/solr/lib/contrib/langid/lib/ regex=.*\.jar /
  lib dir=/www/solr/lib/ regex=solr-langid-\d.*\.jar /

  lib dir=/www/solr/lib/contrib/velocity/lib regex=.*\.jar /
  lib dir=/www/solr/lib/ regex=solr-velocity-\d.*\.jar /

When I start Tomcat, I can see that the file has loaded.

2705 [coreLoadExecutor-4-thread-3] INFO
org.apache.solr.core.SolrResourceLoader  ­ Adding
'file:/www/solr/lib/contrib/extraction/lib/pdfbox-1.7.1.jar' to classloader

But when I try to add a document.

java 
-Durl=http://ec2-blah-blaheu-west-1.compute.amazonaws.com:8080/solr/quosa2-c
ollection/update/extract -Dparams=literal.id=pdf1 -Dtype=text/pdf -jar
post.jar 2008.Genomics.pdf


I get this error. I¹m running on an Ubuntu machine.

Linux ip-10-229-125-163 3.5.0-21-generic #32-Ubuntu SMP Tue Dec 11 18:51:59
UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Error log.

88168 [http-bio-8080-exec-1] INFO
org.apache.solr.update.processor.LogUpdateProcessor  ­
[quosa2-collection_shard1_replica1] webapp=/solr path=/update/extract
params={literal.id=pdf1} {} 0 1534
88180 [http-bio-8080-exec-1] ERROR
org.apache.solr.servlet.SolrDispatchFilter  ­
null:java.lang.RuntimeException: java.lang.UnsatisfiedLinkError:
/usr/lib/jvm/java-7-oracle/jre/lib/amd64/xawt/libmawt.so: libXrender.so.1:
cannot open shared object file: No such file or directory
at 
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java
:670)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
380)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
155)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:243)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:210)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
va:222)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
va:123)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171
)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:947)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
:118)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at 
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Proce
ssor.java:1009)
at 
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(Abstrac
tProtocol.java:589)
at 
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:
310)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
45)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
15)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.UnsatisfiedLinkError:
/usr/lib/jvm/java-7-oracle/jre/lib/amd64/xawt/libmawt.so: libXrender.so.1:
cannot open shared object file: No such file or directory
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1939)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1864)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1825)
at java.lang.Runtime.load0(Runtime.java:792)
at java.lang.System.load(System.java:1059)
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1939)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1864)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1846)
at java.lang.Runtime.loadLibrary0(Runtime.java:845)
at java.lang.System.loadLibrary(System.java:1084)
at sun.security.action.LoadLibraryAction.run(LoadLibraryAction.java:67)
at sun.security.action.LoadLibraryAction.run(LoadLibraryAction.java:47)
at java.security.AccessController.doPrivileged(Native Method)
at java.awt.Toolkit.loadLibraries(Toolkit.java:1648)
at java.awt.Toolkit.clinit(Toolkit.java:1670)
at java.awt.Color.clinit(Color.java:275)
at org.apache.pdfbox.pdmodel.PDPage.clinit(PDPage.java:72)
at org.apache.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:212)
at 

Custom Data Clustering

2013-06-07 Thread Raheel Hasan
Hi,

Can someone please tell me if there is a way to have a custom *`clustering
of the data`* from `solr` 'query' results? I am facing 2 issues currently:

 1. The `*Carrot*` clustering only applies clustering to the paged
results (i.e. in the current pagination's page results).

 2. I need to have custom clustering and classify results into certain
classes only (i.e. only few very specific words in the search results).
Like for example Red, Green, Blue etc... and not hello World,
Known World, green world etc -(if you know what I mean here) -
Where all these words in both Do and DoNot existing in the search results.

Please tell me how to achieve this. Perhaps Carrot/clustering is not needed
here and some other classifier is needed. So what to do here?

Basically, I cannot receive 1 million results, then process them via
PHP-Array to classify them as per need. The classification must be done
here in solr only.

Thanks

-- 
Regards,
Raheel Hasan


RE: How to stop index distribution among shards in solr cloud

2013-06-07 Thread James Thomas
This may help:

http://docs.lucidworks.com/display/solr/Shards+and+Indexing+Data+in+SolrCloud
--- See Document Routing section.


-Original Message-
From: sathish_ix [mailto:skandhasw...@inautix.co.in] 
Sent: Friday, June 07, 2013 5:27 AM
To: solr-user@lucene.apache.org
Subject: How to stop index distribution among shards in solr cloud

Hi,

I have two shards, logically each shards corresponds to a region. Currently 
index is distributed in solr cloud to shards, how to load index to specific 
shard in solr cloud,

Any thoughts ?

Thanks,
Sathish



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-stop-index-distribution-among-shards-in-solr-cloud-tp4068831.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr.NoOpDistributingUpdateProcessorFactory in SOLR CLOUD

2013-06-07 Thread Chris Hostetter

: I don't think you want the noop bits, I'd go back to the
: standard definitions here.

Correct.

the NoOpDistributingUpdateProcessorFactory is for telling the update 
processor chain that you do not want it to do any distribution of updates 
at all -- whatever SolrCore you send the doc to, is the only do that gets 
it, and RunUpdateProcessor will write it to it's local index.



-Hoss


Re: Solr 4.3.0 Cloud Issue indexing pdf documents

2013-06-07 Thread Michael Della Bitta
Hi Mark,

This is a total shot in the dark, but does
passing  -Djava.awt.headless=true when you run the server help at all?

More on awt headless mode:
http://www.oracle.com/technetwork/articles/javase/headless-136834.html

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
w: appinions.com http://www.appinions.com/


On Fri, Jun 7, 2013 at 11:31 AM, Mark Wilson m...@sanger.ac.uk wrote:

 Hi

 I am having an issue with adding pdf documents to a SolrCloud index I have
 setup.

 I can index pdf documents fine using 4.3.0 on my local box, but I have a
 SolrCloud instance setup on the Amazon Cloud (Using 2 servers) and I get
 Error.

 It seems that it is not loading org.apache.pdfbox.pdmodel.PDPage. However,
 the jar is in the directory, and referenced in the solrconfig.xml file

   lib dir=/www/solr/lib/contrib/extraction/lib regex=.*\.jar /
   lib dir=/www/solr/lib/ regex=solr-cell-\d.*\.jar /

   lib dir=/www/solr/lib/contrib/clustering/lib/ regex=.*\.jar /
   lib dir=/www/solr/lib/ regex=solr-clustering-\d.*\.jar /

   lib dir=/www/solr/lib/contrib/langid/lib/ regex=.*\.jar /
   lib dir=/www/solr/lib/ regex=solr-langid-\d.*\.jar /

   lib dir=/www/solr/lib/contrib/velocity/lib regex=.*\.jar /
   lib dir=/www/solr/lib/ regex=solr-velocity-\d.*\.jar /

 When I start Tomcat, I can see that the file has loaded.

 2705 [coreLoadExecutor-4-thread-3] INFO
 org.apache.solr.core.SolrResourceLoader  ­ Adding
 'file:/www/solr/lib/contrib/extraction/lib/pdfbox-1.7.1.jar' to classloader

 But when I try to add a document.

 java
 -Durl=
 http://ec2-blah-blaheu-west-1.compute.amazonaws.com:8080/solr/quosa2-c
 ollection/update/extract -Dparams=literal.id=pdf1 -Dtype=text/pdf -jar
 post.jar 2008.Genomics.pdf


 I get this error. I¹m running on an Ubuntu machine.

 Linux ip-10-229-125-163 3.5.0-21-generic #32-Ubuntu SMP Tue Dec 11 18:51:59
 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

 Error log.

 88168 [http-bio-8080-exec-1] INFO
 org.apache.solr.update.processor.LogUpdateProcessor  ­
 [quosa2-collection_shard1_replica1] webapp=/solr path=/update/extract
 params={literal.id=pdf1} {} 0 1534
 88180 [http-bio-8080-exec-1] ERROR
 org.apache.solr.servlet.SolrDispatchFilter  ­
 null:java.lang.RuntimeException: java.lang.UnsatisfiedLinkError:
 /usr/lib/jvm/java-7-oracle/jre/lib/amd64/xawt/libmawt.so: libXrender.so.1:
 cannot open shared object file: No such file or directory
 at

 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java
 :670)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
 380)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
 155)
 at

 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
 FilterChain.java:243)
 at

 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
 ain.java:210)
 at

 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
 va:222)
 at

 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
 va:123)
 at

 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171
 )
 at

 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
 at
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:947)
 at

 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
 :118)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
 at

 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Proce
 ssor.java:1009)
 at

 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(Abstrac
 tProtocol.java:589)
 at

 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:
 310)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
 45)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
 15)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.lang.UnsatisfiedLinkError:
 /usr/lib/jvm/java-7-oracle/jre/lib/amd64/xawt/libmawt.so: libXrender.so.1:
 cannot open shared object file: No such file or directory
 at java.lang.ClassLoader$NativeLibrary.load(Native Method)
 at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1939)
 at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1864)
 at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1825)
 at java.lang.Runtime.load0(Runtime.java:792)
 at java.lang.System.load(System.java:1059)
 at java.lang.ClassLoader$NativeLibrary.load(Native Method)
 at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1939)
 at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1864)
 at 

Re: OR query with null value and non-null value(s)

2013-06-07 Thread Rahul R
Thank you for the Clarification Shawn.


On Fri, Jun 7, 2013 at 7:34 PM, Jack Krupansky j...@basetechnology.comwrote:

 Yes, it SHOULD! And in the LucidWorks Search query parser it does. Why
 doesn't it in Solr? Ask Yonik to explain that!

 -- Jack Krupansky

 -Original Message- From: Rahul R
 Sent: Friday, June 07, 2013 1:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: OR query with null value and non-null value(s)


 Thank you Shawn. This does work. To help me understand better, why do
 we need the *:* ? Shouldn't it be implicit ?
 Shouldn't
 fq=(price:4+OR+(-price:[* TO *]))  //does not work
 mean the same as
 fq=(price:4+OR+(*:* -price:[* TO *]))   //works

 Why does Solr need the *:* there ?




 On Fri, Jun 7, 2013 at 12:07 AM, Shawn Heisey s...@elyograg.org wrote:

  On 6/6/2013 12:28 PM, Rahul R wrote:

  I have recently enabled facet.missing=true in solrconfig.xml which gives
 null facet values also. As I understand it, the syntax to do a faceted
 search on a null value is something like this:
 fq=-price:[* TO *]
 So when I want to search on a particular value (for example : 4)  OR null
 value, I would expect the syntax to be something like this:
 fq=(price:4+OR+(-price:[* TO *]))
 But this does not work. After searching around for more, read somewhere
 that the right way to achieve this would be:
 fq=-(-price:4+AND+price:[*+TO+*])

 Now this does work but seems like a very roundabout way. Is there a
 better
 way to achieve this ?


 Pure negative queries don't work -- you have to have results in the query
 before you can subtract.  For some top-level queries, Solr is able to
 detect this situation and fix it internally, but on inner queries you must
 explicitly state your intentions.  It is best if you always use '*:*
 -query' syntax, just to be safe.

 fq=(price:4+OR+(*:* -price:[* TO *]))

 Thanks,
 Shawn






Re: LotsOfCores feature

2013-06-07 Thread Aleksey
 Aleksey: What would you say is the average core size for your use case -
 thousands or millions of rows? And how sharded would each of your
 collections be, if at all?

Average core/collection size wouldn't even be thousands, hundreds more
like. And the largest would be half a million or so but that's a
pathological case. I don't need sharding and queries than fan out to
different machines. If fact I'd like to avoid that so I don't have to
collate the results.


 The Wiki page was built not for Cloud Solr.

 We have done such a deployment where less than a tenth of cores were active
 at any given point in time. though there were tens of million indices they
 were split among a large no:of hosts.

 If you don't insist of Cloud deployment it is possible. I'm not sure if it
 is possible with cloud

By Cloud you mean specifically SolrCloud? I don't have to have it if I
can do without it. Bottom line is I want a bunch of small cores to be
distributed over a fleet, each core completely fitting on one server.
Would you be willing to provide a little more details on your setup?
In particular, how are you managing the cores?
How do you route requests to proper server?
If you scale the fleet up and down, does reshuffling of the cores
happen automatically or is it an involved manual process?

Thanks,

Aleksey


Re: LotsOfCores feature

2013-06-07 Thread Jack Krupansky

Thanks. That's what I suspected. Yes, MegaMiniCores.

My scenario is purely hypothetical. But it is also relevant for 
multi-tenant use cases, where the users and schemas are not known in 
advance and are only online intermittently.


Users could fit three rough size categories: very small, medium, and very 
large. Over time a user might move from very small to medium to very large. 
Very large users could require their own dedicated clusters. Medium size 
could occasionally require a dedicated node, but not always. And very small 
is mostly offline but occasionally a fair number are online for short 
periods of time.


-- Jack Krupansky

-Original Message- 
From: Aleksey

Sent: Friday, June 07, 2013 3:44 PM
To: solr-user
Subject: Re: LotsOfCores feature


Aleksey: What would you say is the average core size for your use case -
thousands or millions of rows? And how sharded would each of your
collections be, if at all?


Average core/collection size wouldn't even be thousands, hundreds more
like. And the largest would be half a million or so but that's a
pathological case. I don't need sharding and queries than fan out to
different machines. If fact I'd like to avoid that so I don't have to
collate the results.



The Wiki page was built not for Cloud Solr.

We have done such a deployment where less than a tenth of cores were 
active

at any given point in time. though there were tens of million indices they
were split among a large no:of hosts.

If you don't insist of Cloud deployment it is possible. I'm not sure if it
is possible with cloud


By Cloud you mean specifically SolrCloud? I don't have to have it if I
can do without it. Bottom line is I want a bunch of small cores to be
distributed over a fleet, each core completely fitting on one server.
Would you be willing to provide a little more details on your setup?
In particular, how are you managing the cores?
How do you route requests to proper server?
If you scale the fleet up and down, does reshuffling of the cores
happen automatically or is it an involved manual process?

Thanks,

Aleksey 



RE: SolrCloud Load Balancer weight

2013-06-07 Thread Vaillancourt, Tim
Cool!

Having those values influenced by stats is a neat idea too. I'll get on that 
soon.

Tim

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Monday, June 03, 2013 5:07 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud Load Balancer weight


On Jun 3, 2013, at 3:33 PM, Tim Vaillancourt t...@elementspace.com wrote:

 Should I JIRA this? Thoughts?

Yeah - it's always been in the back of my mind - it's come up a few times - 
eventually we would like nodes to report some stats to zk to influence load 
balancing. 

- mark


translating a character code to an ordinal?

2013-06-07 Thread geeky2
hello all,

environment: solr 3.5, centos

problem statement:  i have several character codes that i want to translate
to ordinal (integer) values (for sorting), while retaining the original code
field in the document.

i was thinking that i could use a copyField from my code field to my ord
field - then employ a pattern replace filter factory during indexing.

but won't the copyfield fail because the two field types are different?

ps: i also read the wiki about
http://wiki.apache.org/solr/DataImportHandler#Transformer the script
transformer and regex transformer - but was hoping to avoid this - if i
could.




thx
mark




--
View this message in context: 
http://lucene.472066.n3.nabble.com/translating-a-character-code-to-an-ordinal-tp4068966.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr facet query on multiple search term

2013-06-07 Thread Erick Erickson
I'm a little confused here. Faceting is about counting docs that meet
your query restrictions. I.e. the q= and fq= clauses. So your original
problem statement simply cannot be combined into a single query
since your q= clauses are different. You could do something like
q=(firstterm OR secondterm)facet.query=firsttermfacet.query=secondTerm
That would give you accurate facet counts for the terms, but it
certainly doesn't preserve the original intent of
q=firsttermfacet.query=blahblah.

But facet.query is only counted over the docs that match
the q= clause (well, the q= clause and any fq clauses). So perhaps
you can supply a few example input docs and desired counts on the other side.

Best
Erick

On Fri, Jun 7, 2013 at 8:01 AM, vrparekh vrpar...@gmail.com wrote:
 Hello All,

 I required facet counts for multiple SearchTerms.
 Currently I am doing two separate facet query on each search term with
 facet.range=dateField

 e.g.

  http://solrserver/select?q=1stsearchTermfq=onfacet-parameters

  http://solrserver/select?q=2ndsearchTermfq=onfacet-parameters

 Note :: SearchTerm field will be text_en_splitting

 Now I have found another way to do facet query on multiple search term by
 tagging and excluding

 e.g.

 http://solrurl/select?start=0rows=10hl=off;
 facet=on
 facet.range.start=2013-06-06T16%3a00%3a00Z
 facet.range.end=2013-06-07T16%3a00%3a01Z
 facet.range.gap=%2B1HOUR
 wt=xml
 sort=dateField+desc
 facet.range={!key=music+ex=movie}dateField

 fq={!tag=music}content:musicfacet.range={!key=movie+ex=music}dateField
 fq={!tag=movie}content:movieq=(col2:1+)

 fq=+dateField:[2013-06-05T16:00:00Z+TO+2013-06-07T16:00:00Z]+AND+(+Col1:test+)
 fl=col1,col2,col3


 I have tested for few search term , It is providing same result as different
 query for each search term.
 Is this the proper way (with results and performance)?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/solr-facet-query-on-multiple-search-term-tp4068856.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: translating a character code to an ordinal?

2013-06-07 Thread Jack Krupansky
This won't help you unless you move to Solr 4.0, but here's an update 
processor script from the book that can take the first character of a string 
field and add it as an integer value for another field:


 updateRequestProcessorChain name=script-add-char-code
   processor class=solr.StatelessScriptUpdateProcessorFactory
 str name=scriptadd-char-code.js/str
 lst name=params
   str name=fieldNamecontent/str
   str name=codeFieldNamecontent_code_i/str
 /lst
   /processor
   processor class=solr.LogUpdateProcessorFactory /
   processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

Here is the JavaScript script that should be placed in the 
add-char-code.js file in the conf directory for

the Solr collection:

 function processAdd(cmd) {
   var fieldName;
   var codeFieldName;
   if (typeof params !== undefined) {
 fieldName = params.get(fieldName);
 codeFieldName = params.get(codeFieldName);
   }
   if (fieldName == null)
 fieldName = content;
   if (codeFieldName == null)
 codeFieldName = content_code_i;

   // Get value for named field, no-op if empty
   var value = cmd.getSolrInputDocument().getField(fieldName);
   if (value != null){
 var str = value.getFirstValue();

 // No-op if string is empty
 if (str != null  str.length() != 0){
   // Get code for first character
   var code = str.charCodeAt(0);
   logger.info(String: \ + str + \ len:  + str.length() +  code: 
 + code);


   // Set the character code output field value
   cmd.getSolrInputDocument().addField(codeFieldName, code);
 }
   }
 }

 function processDelete() {
   // Dummy - add if needed
 }

 function processCommit() {
   // Dummy - add if needed
 }

 function processRollback() {
   // Dummy - add if needed
 }

 function processMergeIndexes() {
   // Dummy - add if needed
 }

 function finish() {
   // Dummy - add if needed
 }

Test it:

 curl 
http://localhost:8983/solr/update?commit=trueupdate.chain=script-add-char-code; 
\

 -H 'Content-type:application/json' -d '
 [{id: doc-1, content: abc},
  {id: doc-2, content: 1},
  {id: doc-3, content: },
  {id: doc-4},
  {id: doc-5, content: \u0002 abc},
  {id: doc-6, content: [And, this, is the end, of this 
test.]}]'


Results:

 id:doc-1,
 content:[abc],
 content_code_i:97,

 id:doc-2,
 content:[1],
 content_code_i:49,

 id:doc-3,
 content:[],

 id:doc-4,

 id:doc-5,
 content:[\u0002 abc],
 content_code_i:2,

 id:doc-6,
 content:[And, this,
   is the end,
   of this test.],
 content_code_i:65,

-- Jack Krupansky

-Original Message- 
From: geeky2

Sent: Friday, June 07, 2013 6:27 PM
To: solr-user@lucene.apache.org
Subject: translating a character code to an ordinal?

hello all,

environment: solr 3.5, centos

problem statement:  i have several character codes that i want to translate
to ordinal (integer) values (for sorting), while retaining the original code
field in the document.

i was thinking that i could use a copyField from my code field to my ord
field - then employ a pattern replace filter factory during indexing.

but won't the copyfield fail because the two field types are different?

ps: i also read the wiki about
http://wiki.apache.org/solr/DataImportHandler#Transformer the script
transformer and regex transformer - but was hoping to avoid this - if i
could.




thx
mark




--
View this message in context: 
http://lucene.472066.n3.nabble.com/translating-a-character-code-to-an-ordinal-tp4068966.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Filtering on results with more than N words.

2013-06-07 Thread Jack Krupansky
Also from the book, here's an alternative update request processor that uses 
a JavaScript script to do the counting and field

creation:

 updateRequestProcessorChain name=script-add-word-count
   processor class=solr.StatelessScriptUpdateProcessorFactory
 str name=scriptadd-word-count.js/str
 lst name=params
   str name=fieldNamecontent/str
   str name=wordCountFieldNamecontent_wc_i/str
 /lst
   /processor
   processor class=solr.LogUpdateProcessorFactory /
   processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

Here is the JavaScript script that should be placed in the 
add-word-count.js file in the conf directory for

the Solr collection:

 function processAdd(cmd) {
   var fieldName;
   var wordCountFieldName;
   if (typeof params !== undefined) {
 fieldName = params.get(fieldName);
 wordCountFieldName = params.get(wordCountFieldName);
   }
   if (fieldName == null)
 fieldName = content;
   if (wordCountFieldName == null)
 wordCountFieldName = content_wc_i;

   // Get value(s) for named field
   var values = cmd.getSolrInputDocument().getField(fieldName).getValues();

   // Combine values into one string
   var str = ;
   var n = values.size();
   for (i = 0; i  n; i++)
 str += ' ' + values.get(i);

   // Compress out hyphens and underscores to join words
   var str_no_dash = str.replace(/-|_/g, '');;

   // Replace words with simply X
   var str_x_words = str_no_dash.replace(/\w+/g, 'X');

   // Remove punctuation and white space, leaving just the Xes.
   var str_final = str_x_words.replace(/[^X]+/g, '');

   // A count of the Xes is a good proxy for the word count.
   var wordCount = str_final.length;

   // Set the word count output field value
   cmd.getSolrInputDocument().addField(wordCountFieldName, wordCount);
 }

 function processDelete() {
   // Dummy - add if needed
 }

 function processCommit() {
   // Dummy - add if needed
 }

 function processRollback() {
   // Dummy - add if needed
 }

 function processMergeIndexes() {
   // Dummy - add if needed
 }

 function finish() {
   // Dummy - add if needed
 }

A test:

 curl 
http://localhost:8983/solr/update?commit=trueupdate.chain=script-add-word-count; 
\

 -H 'Content-type:application/json' -d '
 [{id: doc-1, content: Hello World},
  {id: doc-2, content: },
  {id: doc-3, content:  -- --- !},
  {id: doc-4, content: This is some more.},
  {id: doc-5, content: The CD-ROM, (and num_events_seen.)},
  {id: doc-6, content: Four score and seven years ago our fathers
 brought forth on this continent a new nation, conceived in liberty,
 and dedicated to the proposition that all men are created equal.
 Now we are engaged in a great civil war, testing whether that nation,
 or any nation so conceived and so dedicated, can long endure. },
  {id: doc-7, content: 401(k)},
  {id: doc-8, content: [And, this, is the end, of this 
test.]}]'


Results:

 id:doc-1,
 content:[Hello World],
 content_wc_i:2,

 id:doc-2,
 content:[],
 content_wc_i:0,

 id:doc-3,
 content:[ -- --- !],
 content_wc_i:0,

 id:doc-4,
 content:[This is some more.],
 content_wc_i:4,

 id:doc-5,
 content:[The CD-ROM, (and num_events_seen.)],
 content_wc_i:4,

 id:doc-6,
 content:[Four score and seven years ago our fathers\n
 brought forth on this continent a new nation, conceived in liberty,\n
 and dedicated to the proposition that all men are created equal.\n
 Now we are engaged in a great civil war, testing whether that 
nation,\n

 or any nation so conceived and so dedicated, can long endure. ],
 content_wc_i:54,

 id:doc-7,
 content:[401(k)],
 content_wc_i:2,

 id:doc-8,
 content:[And, this,
   is the end,
   of this test.],
 content_wc_i:8,




-- Jack Krupansky
-Original Message- 
From: Jack Krupansky

Sent: Thursday, June 06, 2013 5:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Filtering on results with more than N words.


From the book, here's an update request processor chain which will count the

words in the content field and place it in the content_len_I field. Then
you could do a range query on that count.

updateRequestProcessorChain name=regex-count-words

 !-- Start with a copy of the content field --
 processor class=solr.CloneFieldUpdateProcessorFactory
   str name=sourcecontent/str
   str name=destcontent_len_i/str
 /processor

 !-- Combine multivalued input into a single string --
 processor class=solr.ConcatFieldUpdateProcessorFactory
   str name=fieldNamecontent_len_i/str
   str name=delimiter /str
 /processor

 !-- Remove hyphens and underscores - join parts into single word --
 processor class=solr.RegexReplaceProcessorFactory
   str name=fieldNamecontent_len_i/str
   str name=pattern-|_/str
   str name=replacement/str
 /processor

 !-- Reduce words into a single letter X --
 processor class=solr.RegexReplaceProcessorFactory
   str name=fieldNamecontent_len_i/str
   str name=pattern\w+/str
   str name=replacementX/str
 /processor

 !-- Remove punctuation 

Re: translating a character code to an ordinal?

2013-06-07 Thread geeky2
hello jack,

thank you for the code ;)

what book are you referring to?  AFAICT - all of the 4.0 books are future
order.

we won't be moving to 4.0 (soon enough).

so i take it - copyfield will not work, eg - i cannot take a code like ABC
and copy it to an int field and then use the regex to turn it in to an
ordinal?

thx
mark




--
View this message in context: 
http://lucene.472066.n3.nabble.com/translating-a-character-code-to-an-ordinal-tp4068966p4068984.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: translating a character code to an ordinal?

2013-06-07 Thread Jack Krupansky
Correct, you need either an update request processor, a custom field type, 
or to preprocess your input before you give it to Solr.


You can't do analysis on a non-text field.

The book is my new Solr reference/guide that I will be self-publishing. We 
hope to make an Alpha draft available later next week.


-- Jack Krupansky
-Original Message- 
From: geeky2

Sent: Friday, June 07, 2013 8:08 PM
To: solr-user@lucene.apache.org
Subject: Re: translating a character code to an ordinal?

hello jack,

thank you for the code ;)

what book are you referring to?  AFAICT - all of the 4.0 books are future
order.

we won't be moving to 4.0 (soon enough).

so i take it - copyfield will not work, eg - i cannot take a code like ABC
and copy it to an int field and then use the regex to turn it in to an
ordinal?

thx
mark




--
View this message in context: 
http://lucene.472066.n3.nabble.com/translating-a-character-code-to-an-ordinal-tp4068966p4068984.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Lucene/Solr Filesystem tunings

2013-06-07 Thread Tim Vaillancourt
I figured as much for atime, thanks Otis!

I haven't ran benchmarks just yet, but I'll be sure to share whatever I
find. I plan to try ext4 vs xfs.

I am also curious what effect disabling journaling (ext2) would have,
relying on SolrCloud to manage 'consistency' over many instances vs FS
journaling. Anyone have opinions there? If I test I'll share the results.

Cheers,

Tim


On 4 June 2013 16:11, Otis Gospodnetic otis.gospodne...@gmail.com wrote:

 Hi,

 You can use noatime, nodiratime, nothing in Solr depends on that as
 far as I know.  We tend to use ext4.  Some people love xfs.  Want to
 run some benchmarks and publish the results? :)

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Tue, Jun 4, 2013 at 6:48 PM, Tim Vaillancourt t...@elementspace.com
 wrote:
  Hey all,
 
  Does anyone have any advice or special filesytem tuning to share for
  Lucene/Solr, and which file systems they like more?
 
  Also, does Lucene/Solr care about access times if I turn them off (I
 think I
  doesn't care)?
 
  A bit unrelated: What are people's opinions on reducing some consistency
  things like filesystem journaling, etc (ext2?) due to SolrCloud's
 additional
  HA with replicas? How about RAID 0 x 3 replicas or so?
 
  Thanks!
 
  Tim Vaillancourt



Re: Two instances of solr - the same datadir?

2013-06-07 Thread Tim Vaillancourt
If it makes you feel better, I also considered this approach when I was in
the same situation with a separate indexer and searcher on one Physical
linux machine.

My main concern was re-using the FS cache between both instances - If I
replicated to myself there would be two independent copies of the index,
FS-cached separately.

I like the suggestion of using autoCommit to reload the index. If I'm
reading that right, you'd set an autoCommit on 'zero docs changing', or
just 'every N seconds'? Did that work?

Best of luck!

Tim


On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote:

 So here it is for a record how I am solving it right now:

 Write-master is started with: -Dmontysolr.warming.enabled=false
 -Dmontysolr.write.master=true -Dmontysolr.read.master=
 http://localhost:5005
 Read-master is started with: -Dmontysolr.warming.enabled=true
 -Dmontysolr.write.master=false


 solrconfig.xml changes:

 1. all index changing components have this bit,
 enable=${montysolr.master:true} - ie.

 updateHandler class=solr.DirectUpdateHandler2
  enable=${montysolr.master:true}

 2. for cache warming de/activation

 listener event=newSearcher
   class=solr.QuerySenderListener
   enable=${montysolr.enable.warming:true}...

 3. to trigger refresh of the read-only-master (from write-master):

 listener event=postCommit
   class=solr.RunExecutableListener
   enable=${montysolr.master:true}
   str name=execurl/str
   str name=dir./str
   bool name=waitfalse/bool
   arr name=args str${montysolr.read.master:http://localhost

 }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr
 /listener

 This works, I still don't like the reload of the whole core, but it seems
 like the easiest thing to do now.

 -- roman


 On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi Peter,
 
  Thank you, I am glad to read that this usecase is not alien.
 
  I'd like to make the second instance (searcher) completely read-only, so
 I
  have disabled all the components that can write.
 
  (being lazy ;)) I'll probably use
  http://wiki.apache.org/solr/CollectionDistribution to call the curl
 after
  commit, or write some IndexReaderFactory that checks for changes
 
  The problem with calling the 'core reload' - is that it seems lots of
 work
  for just opening a new searcher, eeekkk...somewhere I read that it is
 cheap
  to reload a core, but re-opening the index searches must be definitely
  cheaper...
 
  roman
 
 
  On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge peter.stu...@gmail.com
 wrote:
 
  Hi,
  We use this very same scenario to great effect - 2 instances using the
  same
  dataDir with many cores - 1 is a writer (no caching), the other is a
  searcher (lots of caching).
  To get the searcher to see the index changes from the writer, you need
 the
  searcher to do an empty commit - i.e. you invoke a commit with 0
  documents.
  This will refresh the caches (including autowarming), [re]build the
  relevant searchers etc. and make any index changes visible to the RO
  instance.
  Also, make sure to use lockTypenative/lockType in solrconfig.xml to
  ensure the two instances don't try to commit at the same time.
  There are several ways to trigger a commit:
  Call commit() periodically within your own code.
  Use autoCommit in solrconfig.xml.
  Use an RPC/IPC mechanism between the 2 instance processes to tell the
  searcher the index has changed, then call commit when called (more
 complex
  coding, but good if the index changes on an ad-hoc basis).
  Note, doing things this way isn't really suitable for an NRT
 environment.
 
  HTH,
  Peter
 
 
 
  On Tue, Jun 4, 2013 at 11:23 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   Replication is fine, I am going to use it, but I wanted it for
 instances
   *distributed* across several (physical) machines - but here I have one
   physical machine, it has many cores. I want to run 2 instances of solr
   because I think it has these benefits:
  
   1) I can give less RAM to the writer (4GB), and use more RAM for the
   searcher (28GB)
   2) I can deactivate warming for the writer and keep it for the
 searcher
   (this considerably speeds up indexing - each time we commit, the
 server
  is
   rebuilding a citation network of 80M edges)
   3) saving disk space and better OS caching (OS should be able to use
  more
   RAM for the caching, which should result in faster operations - the
 two
   processes are accessing the same index)
  
   Maybe I should just forget it and go with the replication, but it
  doesn't
   'feel right' IFF it is on the same physical machine. And Lucene
   specifically has a method for discovering changes and re-opening the
  index
   (DirectoryReader.openIfChanged)
  
   Am I not seeing something?
  
   roman
  
  
  
   On Tue, Jun 4, 2013 at 5:30 PM, Jason Hellman 
   jhell...@innoventsolutions.com wrote:
  
Roman,
   
Could you be more specific as to 

Re: Two instances of solr - the same datadir?

2013-06-07 Thread Roman Chyla
I have auto commit after 40k RECs/1800secs. But I only tested with manual
commit, but I don't see why it should work differently.
Roman
On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote:

 If it makes you feel better, I also considered this approach when I was in
 the same situation with a separate indexer and searcher on one Physical
 linux machine.

 My main concern was re-using the FS cache between both instances - If I
 replicated to myself there would be two independent copies of the index,
 FS-cached separately.

 I like the suggestion of using autoCommit to reload the index. If I'm
 reading that right, you'd set an autoCommit on 'zero docs changing', or
 just 'every N seconds'? Did that work?

 Best of luck!

 Tim


 On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote:

  So here it is for a record how I am solving it right now:
 
  Write-master is started with: -Dmontysolr.warming.enabled=false
  -Dmontysolr.write.master=true -Dmontysolr.read.master=
  http://localhost:5005
  Read-master is started with: -Dmontysolr.warming.enabled=true
  -Dmontysolr.write.master=false
 
 
  solrconfig.xml changes:
 
  1. all index changing components have this bit,
  enable=${montysolr.master:true} - ie.
 
  updateHandler class=solr.DirectUpdateHandler2
   enable=${montysolr.master:true}
 
  2. for cache warming de/activation
 
  listener event=newSearcher
class=solr.QuerySenderListener
enable=${montysolr.enable.warming:true}...
 
  3. to trigger refresh of the read-only-master (from write-master):
 
  listener event=postCommit
class=solr.RunExecutableListener
enable=${montysolr.master:true}
str name=execurl/str
str name=dir./str
bool name=waitfalse/bool
arr name=args str${montysolr.read.master:http://localhost
 
 
 }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr
  /listener
 
  This works, I still don't like the reload of the whole core, but it seems
  like the easiest thing to do now.
 
  -- roman
 
 
  On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   Hi Peter,
  
   Thank you, I am glad to read that this usecase is not alien.
  
   I'd like to make the second instance (searcher) completely read-only,
 so
  I
   have disabled all the components that can write.
  
   (being lazy ;)) I'll probably use
   http://wiki.apache.org/solr/CollectionDistribution to call the curl
  after
   commit, or write some IndexReaderFactory that checks for changes
  
   The problem with calling the 'core reload' - is that it seems lots of
  work
   for just opening a new searcher, eeekkk...somewhere I read that it is
  cheap
   to reload a core, but re-opening the index searches must be definitely
   cheaper...
  
   roman
  
  
   On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge peter.stu...@gmail.com
  wrote:
  
   Hi,
   We use this very same scenario to great effect - 2 instances using the
   same
   dataDir with many cores - 1 is a writer (no caching), the other is a
   searcher (lots of caching).
   To get the searcher to see the index changes from the writer, you need
  the
   searcher to do an empty commit - i.e. you invoke a commit with 0
   documents.
   This will refresh the caches (including autowarming), [re]build the
   relevant searchers etc. and make any index changes visible to the RO
   instance.
   Also, make sure to use lockTypenative/lockType in solrconfig.xml
 to
   ensure the two instances don't try to commit at the same time.
   There are several ways to trigger a commit:
   Call commit() periodically within your own code.
   Use autoCommit in solrconfig.xml.
   Use an RPC/IPC mechanism between the 2 instance processes to tell the
   searcher the index has changed, then call commit when called (more
  complex
   coding, but good if the index changes on an ad-hoc basis).
   Note, doing things this way isn't really suitable for an NRT
  environment.
  
   HTH,
   Peter
  
  
  
   On Tue, Jun 4, 2013 at 11:23 PM, Roman Chyla roman.ch...@gmail.com
   wrote:
  
Replication is fine, I am going to use it, but I wanted it for
  instances
*distributed* across several (physical) machines - but here I have
 one
physical machine, it has many cores. I want to run 2 instances of
 solr
because I think it has these benefits:
   
1) I can give less RAM to the writer (4GB), and use more RAM for the
searcher (28GB)
2) I can deactivate warming for the writer and keep it for the
  searcher
(this considerably speeds up indexing - each time we commit, the
  server
   is
rebuilding a citation network of 80M edges)
3) saving disk space and better OS caching (OS should be able to use
   more
RAM for the caching, which should result in faster operations - the
  two
processes are accessing the same index)
   
Maybe I should just forget it and go with the replication, but it
   doesn't
'feel right' IFF it is on the 

Re: translating a character code to an ordinal?

2013-06-07 Thread geeky2
thx,


please send me a link to the book so i get/purchase it.


thx
mark





--
View this message in context: 
http://lucene.472066.n3.nabble.com/translating-a-character-code-to-an-ordinal-tp4068966p4068997.html
Sent from the Solr - User mailing list archive at Nabble.com.


custom field tutorial

2013-06-07 Thread geeky2
can someone point me to a custom field tutorial.

i checked the wiki and this list - but still a little hazy on how i would do
this.

essentially - when the user issues a query, i want my class to interrogate a
string field (containing several codes - example boo, baz, bar) 

and return a single integer field that maps to the string field (containing
the code).

example: 

boo=1
baz=2
bar=3

thx
mark





--
View this message in context: 
http://lucene.472066.n3.nabble.com/custom-field-tutorial-tp4068998.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: LotsOfCores feature

2013-06-07 Thread Noble Paul നോബിള്‍ नोब्ळ्
We set it up like this
+ individual solr instances are setup
+ external mapping/routing to allocate users to instances. This information
can be stored in an external data store
+ all cores are created as transient and loadonstart as false
+ cores come online on demand
+ as and when users data get bigger (or hosts are hot)they are migrated
between less hit hosts using in built replication

Keep in mind we had the schema for all users. Currently there is no way to
upload a new schema to solr.
On Jun 8, 2013 1:15 AM, Aleksey bitterc...@gmail.com wrote:

  Aleksey: What would you say is the average core size for your use case -
  thousands or millions of rows? And how sharded would each of your
  collections be, if at all?

 Average core/collection size wouldn't even be thousands, hundreds more
 like. And the largest would be half a million or so but that's a
 pathological case. I don't need sharding and queries than fan out to
 different machines. If fact I'd like to avoid that so I don't have to
 collate the results.


  The Wiki page was built not for Cloud Solr.
 
  We have done such a deployment where less than a tenth of cores were
 active
  at any given point in time. though there were tens of million indices
 they
  were split among a large no:of hosts.
 
  If you don't insist of Cloud deployment it is possible. I'm not sure if
 it
  is possible with cloud

 By Cloud you mean specifically SolrCloud? I don't have to have it if I
 can do without it. Bottom line is I want a bunch of small cores to be
 distributed over a fleet, each core completely fitting on one server.
 Would you be willing to provide a little more details on your setup?
 In particular, how are you managing the cores?
 How do you route requests to proper server?
 If you scale the fleet up and down, does reshuffling of the cores
 happen automatically or is it an involved manual process?

 Thanks,

 Aleksey



Re: custom field tutorial

2013-06-07 Thread Walter Underwood
What are you trying to do? This seems really odd. I've been working in search 
for fifteen years and I've never heard this request.

You could always return all the fields to the client and ignore the ones you 
don't want.

wunder

On Jun 7, 2013, at 8:24 PM, geeky2 wrote:

 can someone point me to a custom field tutorial.
 
 i checked the wiki and this list - but still a little hazy on how i would do
 this.
 
 essentially - when the user issues a query, i want my class to interrogate a
 string field (containing several codes - example boo, baz, bar) 
 
 and return a single integer field that maps to the string field (containing
 the code).
 
 example: 
 
 boo=1
 baz=2
 bar=3
 
 thx
 mark
 






Re: Multitable import - uniqueKey

2013-06-07 Thread sodoo
Thank you for all reply members. Solve the issue. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multitable-import-uniqueKey-tp4067796p4069007.html
Sent from the Solr - User mailing list archive at Nabble.com.