in Solr 3.5, optimization increase the index size to double

2013-06-13 Thread Montu v Boda
Hi,

i have replicate my index from 1.4 to 3.5 and after replication i try
optimize the index in 3.5 with below URL.
http://localhost:9002/solr35/collection1/update?optimize=true&commit=true

when i optimize the index in 3.5, it's increase the index size to double.

in 1.4 the size of index is 428GB and after optimization in 3.5 it becomes
791 GB.

Thanks & Rrgards
Montu v Boda



--
View this message in context: 
http://lucene.472066.n3.nabble.com/in-Solr-3-5-optimization-increase-the-index-size-to-double-tp4070433.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering down terms in suggest

2013-06-13 Thread Aloke Ghoshal
Thanks Barani. Could also work out this way provided we start with a large
set of suggestions initially to increase the likelihood of getting some
matches when filtering down with the second query.


On Wed, Jun 12, 2013 at 10:51 PM, bbarani  wrote:

> I would suggest you to take the suggested string and create another query
> to
> solr along with the filter parameter.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Filtering-down-terms-in-suggest-tp4069627p4069997.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


RE: Solr 4.0 Optimize query very slow before the optimize end of a few minutes

2013-06-13 Thread Jeffery Wang
Hi Otis,

Sorry, it does not formatted. 

TimequeryTime(ms),  CPU %   r/s   w/s   rMB/s   wMB/s   IO %
...
7:30:24 12  89  156.44  0   16.40   94.06
7:30:25 18  91  157 0   15.35   0   98.1
7:30:26 9   91  194 0   19.62   0   96.1
7:30:27 14  38  352 0   38.17   0   100.1
7:30:28 30  77  205.94  16.83   20.17   4.0298.51
7:30:30 101 88  396 0   45.99   0   90.7
7:30:31 11  90  120 0   11.34   0   97.5
7:30:32 38  89  262.38  0   28.03   0   96.24
7:30:33 11  78  68  17  4.894.9399.9
7:30:34 9   29  201 0   20.16   0   100.3
7:30:35 9   87  181 0   17.27   0   94.3
7:30:52 16594   26  36  0   0.140   99.3
7:30:53 31  80  368 0   42.43   0   94.3
7:31:23 28575   41  35  21  0.372.3695.9   
7:31:27 267660  127 0   13.76   0   83.5
7:31:28 8   59  279 0   30.99   0   99.4
7:32:22 53399   31  81  39  0.742.6399.5!!!
7:32:23 11  54  155 0   16.46   0   99.6
7:32:24 9   47  63.37   4.954.180.0298.42
7:32:25 9   25  34  0   0.130   98.8
7:32:26 8   27  30  0   0.120   99.9
7:33:28 60199   28  30  2   0.120.0199.8!!


But why it always query slow at the last few minutes. I have tested it many 
times the optimize will last for 2 hours , almost every time, the query is 
quick enough(query cost about 30ms) in the 2 hours, only slow at the last few 
minutes(query will cost 6ms). 

Thanks,
Jeffery
-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: 2013年6月14日 12:20
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.0 Optimize query very slow before the optimize end of a few 
minutes

Hi,

What you pasted from console didn't come across well.  Yes, optimizing a static 
index is OK and yes, if your index is "very unoptimized" then yes, it will be 
slower than when it is optimized not sure if that addresses your concerns...

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/





On Fri, Jun 14, 2013 at 12:04 AM, Jeffery Wang  
wrote:
> Do someone known Why the query is very slow before the optimize end of a few 
> minutes.
>
> When the solr optimize, I have a loop query( curl "query url" and sleep one 
> second) every one second to check the query speed. It is normal, the query 
> time can be accept. But it always very slow before the optimize end of a few 
> minutes.
> The solr index size is about 22G after optimized.
>
> The follows is the query time cost, CPU and IO usage. The whole optimize 
> process, the IO is high, it can be understand.
> time
>
> query time(ms)
>
> CPU %
>
> r/s
>
>   w/s
>
> rMB/s
>
> wMB/s
>
> IO %
>
> 7:30:24
>
> 12
>
> 89
>
> 156.44
>
> 0
>
> 16.4
>
> 0
>
> 94.06
>
> 7:30:25
>
> 18
>
> 91
>
> 157
>
> 0
>
> 15.35
>
> 0
>
> 98.1
>
> 7:30:26
>
> 9
>
> 91
>
> 194
>
> 0
>
> 19.62
>
> 0
>
> 96.1
>
> 7:30:27
>
> 14
>
> 38
>
> 352
>
> 0
>
> 38.17
>
> 0
>
> 100.1
>
> 7:30:28
>
> 30
>
> 77
>
> 205.94
>
> 16.83
>
> 20.17
>
> 4.02
>
> 98.51
>
> 7:30:30
>
> 101
>
> 88
>
> 396
>
> 0
>
> 45.99
>
> 0
>
> 90.7
>
> 7:30:31
>
> 11
>
> 90
>
> 120
>
> 0
>
> 11.34
>
> 0
>
> 97.5
>
> 7:30:32
>
> 38
>
> 89
>
> 262.38
>
> 0
>
> 28.03
>
> 0
>
> 96.24
>
> 7:30:33
>
> 11
>
> 78
>
> 68
>
> 17
>
> 4.89
>
> 4.93
>
> 99.9
>
> 7:30:34
>
> 9
>
> 29
>
> 201
>
> 0
>
> 20.16
>
> 0
>
> 100.3
>
> 7:30:35
>
> 9
>
> 87
>
> 181
>
> 0
>
> 17.27
>
> 0
>
> 94.3
>
> 7:30:52
>
> 16594
>
> 26
>
> 36
>
> 0
>
> 0.14
>
> 0
>
> 99.3
>
> 7:30:53
>
> 31
>
> 80
>
> 368
>
> 0
>
> 42.43
>
> 0
>
> 94.3
>
> 7:31:23
>
> 28575
>
> 41
>
> 35
>
> 21
>
> 0.37
>
> 2.36
>
> 95.9
>
> 7:31:27
>
> 2676
>
> 60
>
> 127
>
> 0
>
> 13.76
>
> 0
>
> 83.5
>
> 7:31:28
>
> 8
>
> 59
>
> 279
>
> 0
>
> 30.99
>
> 0
>
> 99.4
>
> 7:32:22
>
> 53399
>
> 31
>
> 81
>
> 39
>
> 0.74
>
> 2.63
>
> 99.5
>
> 7:32:23
>
> 11
>
> 54
>
> 155
>
> 0
>
> 16.46
>
> 0
>
> 99.6
>
> 7:32:24
>
> 9
>
> 47
>
> 63.37
>
> 4.95
>
> 4.18
>
> 0.02
>
> 98.42
>
> 7:32:25
>
> 9
>
> 25
>
> 34
>
> 0
>
> 0.13
>
> 0
>
> 98.8
>
> 7:32:26
>
> 8
>
> 27
>
> 30
>
> 0
>
> 0.12
>
> 0
>
> 99.9
>
> 7:33:28
>
> 60199
>
> 28
>
> 30
>
> 2
>
> 0.12
>
> 0.01
>
> 99.8
>
>
> Thanks,
> __
> 
> Jeffery Wang


Re: Best way to match umlauts

2013-06-13 Thread Steve Rowe
Aditya,

Char filters are applied prior to tokenization, so they can affect 
tokenization, but I can't think of any tokenization changes that accent 
stripping would cause.

Token filters can be re-ordered to achieve certain objectives.  For example, if 
you want to use a stemmer that only recognizes lowercase terms, you could put a 
lowercasing filter in front of it.

In your case, if you use a char filter to do the accent stripping, and you use 
a stemmer, you won't be able to order it after stemming, because char filters 
always precede tokenization, which always precedes token filtering.  Stripping 
accents before stemming could be a problem, though, if your stemmer assumes 
properly accented words in order to function properly; in that case, you'd want 
to use a token filter to do the accent stripping instead, and place it after 
your stemmer.

There may be other reasons you'd want to choose one over the other that I'm not 
thinking of, but primarily it's about choosing processing order to affect 
further stages in the pipeline.  If you don't think order matters, then you 
should be fine choosing either one. 

Steve

On Jun 13, 2013, at 8:17 PM, adityab  wrote:

> this might be a dumb question. But can you please point me some key
> difference between ASCIIFolding Filter  and Character Filter using a map
> File.  
> thanks
> Aditya 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Best-way-to-match-umlauts-tp4070256p4070398.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Query time exceeds timeout specified by parameter timeAllowed

2013-06-13 Thread Otis Gospodnetic
Hi Christof,

In short: yes, known behaviour, you can't rely on timeAllowed as you'd
think - it is limited to only a portion of total execution.
See http://search-lucene.com/?q=timeallowed&sort=newestOnTop&fc_project=Solr
for previous answers to this Q.

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/





On Thu, Jun 13, 2013 at 11:53 AM, Christof Doll  wrote:
> Hello,
>
> I just gave the parameter timeAllowed a try and noticed that in some cases
> the actual query time exceeds the timeout specified by the timeAllowed
> parameter, e.g., having set timeAllowed to 100 the actual query time is
> 300ms. Unfortunately, the documentation of the timeAllowed parameter is
> quite short and does not explain how the timeAllowed parameter is treated
> internally. Can any one explain this behavior?
>
> Best regards,
> Christof
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Query-time-exceeds-timeout-specified-by-parameter-timeAllowed-tp4070266.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Suggest and Filtering

2013-06-13 Thread Ing. Jorge Luis Betancourt Gonzalez
If is query suggestion what you are looking for, what we've done is storing the 
user queries into a separated core and pull the suggestions from there. 

- Mensaje original -
De: "Brendan Grainger" 
Para: solr-user@lucene.apache.org
Enviados: Jueves, 13 de Junio 2013 19:43:03
Asunto: Suggest and Filtering

Hi Solr Guru's

I am trying to implement auto suggest where solr would suggest several
phrases that would return results as the user types in a query (as distinct
from autocomplete). e.g. say the user starts typing 'br' and we have
documents that contain "brake pads" and "left disc brake", solr would
suggest both of those phrases with "brake pads" first. I also want to only
look at documents that match a given filter query. So say I have a bunch of
documents for a toyota cressida that contain the bi-gram "brake pads",
while the documents for a honda accord don't have any brake pad articles.
If the user is filtering on the honda accord I wouldn't want "brake pads"
as a suggestion.

Right now, I've played with the suggest component and using faceting.

Any thoughts?

Thanks
Brendan

-- 
Brendan Grainger
www.kuripai.com

http://www.uci.cu
http://www.uci.cu


Re: Solr 4.0 Optimize query very slow before the optimize end of a few minutes

2013-06-13 Thread Otis Gospodnetic
Hi,

What you pasted from console didn't come across well.  Yes, optimizing
a static index is OK and yes, if your index is "very unoptimized" then
yes, it will be slower than when it is optimized not sure if that
addresses your concerns...

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/





On Fri, Jun 14, 2013 at 12:04 AM, Jeffery Wang
 wrote:
> Do someone known Why the query is very slow before the optimize end of a few 
> minutes.
>
> When the solr optimize, I have a loop query( curl "query url" and sleep one 
> second) every one second to check the query speed. It is normal, the query 
> time can be accept. But it always very slow before the optimize end of a few 
> minutes.
> The solr index size is about 22G after optimized.
>
> The follows is the query time cost, CPU and IO usage. The whole optimize 
> process, the IO is high, it can be understand.
> time
>
> query time(ms)
>
> CPU %
>
> r/s
>
>   w/s
>
> rMB/s
>
> wMB/s
>
> IO %
>
> 7:30:24
>
> 12
>
> 89
>
> 156.44
>
> 0
>
> 16.4
>
> 0
>
> 94.06
>
> 7:30:25
>
> 18
>
> 91
>
> 157
>
> 0
>
> 15.35
>
> 0
>
> 98.1
>
> 7:30:26
>
> 9
>
> 91
>
> 194
>
> 0
>
> 19.62
>
> 0
>
> 96.1
>
> 7:30:27
>
> 14
>
> 38
>
> 352
>
> 0
>
> 38.17
>
> 0
>
> 100.1
>
> 7:30:28
>
> 30
>
> 77
>
> 205.94
>
> 16.83
>
> 20.17
>
> 4.02
>
> 98.51
>
> 7:30:30
>
> 101
>
> 88
>
> 396
>
> 0
>
> 45.99
>
> 0
>
> 90.7
>
> 7:30:31
>
> 11
>
> 90
>
> 120
>
> 0
>
> 11.34
>
> 0
>
> 97.5
>
> 7:30:32
>
> 38
>
> 89
>
> 262.38
>
> 0
>
> 28.03
>
> 0
>
> 96.24
>
> 7:30:33
>
> 11
>
> 78
>
> 68
>
> 17
>
> 4.89
>
> 4.93
>
> 99.9
>
> 7:30:34
>
> 9
>
> 29
>
> 201
>
> 0
>
> 20.16
>
> 0
>
> 100.3
>
> 7:30:35
>
> 9
>
> 87
>
> 181
>
> 0
>
> 17.27
>
> 0
>
> 94.3
>
> 7:30:52
>
> 16594
>
> 26
>
> 36
>
> 0
>
> 0.14
>
> 0
>
> 99.3
>
> 7:30:53
>
> 31
>
> 80
>
> 368
>
> 0
>
> 42.43
>
> 0
>
> 94.3
>
> 7:31:23
>
> 28575
>
> 41
>
> 35
>
> 21
>
> 0.37
>
> 2.36
>
> 95.9
>
> 7:31:27
>
> 2676
>
> 60
>
> 127
>
> 0
>
> 13.76
>
> 0
>
> 83.5
>
> 7:31:28
>
> 8
>
> 59
>
> 279
>
> 0
>
> 30.99
>
> 0
>
> 99.4
>
> 7:32:22
>
> 53399
>
> 31
>
> 81
>
> 39
>
> 0.74
>
> 2.63
>
> 99.5
>
> 7:32:23
>
> 11
>
> 54
>
> 155
>
> 0
>
> 16.46
>
> 0
>
> 99.6
>
> 7:32:24
>
> 9
>
> 47
>
> 63.37
>
> 4.95
>
> 4.18
>
> 0.02
>
> 98.42
>
> 7:32:25
>
> 9
>
> 25
>
> 34
>
> 0
>
> 0.13
>
> 0
>
> 98.8
>
> 7:32:26
>
> 8
>
> 27
>
> 30
>
> 0
>
> 0.12
>
> 0
>
> 99.9
>
> 7:33:28
>
> 60199
>
> 28
>
> 30
>
> 2
>
> 0.12
>
> 0.01
>
> 99.8
>
>
> Thanks,
> __
> Jeffery Wang


Solr 4.0 Optimize query very slow before the optimize end of a few minutes

2013-06-13 Thread Jeffery Wang
Do someone known Why the query is very slow before the optimize end of a few 
minutes.

When the solr optimize, I have a loop query( curl "query url" and sleep one 
second) every one second to check the query speed. It is normal, the query time 
can be accept. But it always very slow before the optimize end of a few minutes.
The solr index size is about 22G after optimized.

The follows is the query time cost, CPU and IO usage. The whole optimize 
process, the IO is high, it can be understand.
time

query time(ms)

CPU %

r/s

  w/s

rMB/s

wMB/s

IO %

7:30:24

12

89

156.44

0

16.4

0

94.06

7:30:25

18

91

157

0

15.35

0

98.1

7:30:26

9

91

194

0

19.62

0

96.1

7:30:27

14

38

352

0

38.17

0

100.1

7:30:28

30

77

205.94

16.83

20.17

4.02

98.51

7:30:30

101

88

396

0

45.99

0

90.7

7:30:31

11

90

120

0

11.34

0

97.5

7:30:32

38

89

262.38

0

28.03

0

96.24

7:30:33

11

78

68

17

4.89

4.93

99.9

7:30:34

9

29

201

0

20.16

0

100.3

7:30:35

9

87

181

0

17.27

0

94.3

7:30:52

16594

26

36

0

0.14

0

99.3

7:30:53

31

80

368

0

42.43

0

94.3

7:31:23

28575

41

35

21

0.37

2.36

95.9

7:31:27

2676

60

127

0

13.76

0

83.5

7:31:28

8

59

279

0

30.99

0

99.4

7:32:22

53399

31

81

39

0.74

2.63

99.5

7:32:23

11

54

155

0

16.46

0

99.6

7:32:24

9

47

63.37

4.95

4.18

0.02

98.42

7:32:25

9

25

34

0

0.13

0

98.8

7:32:26

8

27

30

0

0.12

0

99.9

7:33:28

60199

28

30

2

0.12

0.01

99.8


Thanks,
__
Jeffery Wang


Re: Best way to match umlauts

2013-06-13 Thread Jack Krupansky

Token filter character filter is a key difference.

-- Jack Krupansky

-Original Message- 
From: adityab

Sent: Thursday, June 13, 2013 8:17 PM
To: solr-user@lucene.apache.org
Subject: Re: Best way to match umlauts

this might be a dumb question. But can you please point me some key
difference between ASCIIFolding Filter  and Character Filter using a map
File.
thanks
Aditya



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-way-to-match-umlauts-tp4070256p4070398.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Solr4 cluster setup for high performance reads

2013-06-13 Thread Shawn Heisey
On 6/13/2013 7:51 PM, Utkarsh Sengar wrote:
> Sure, I will reduce the count and see how it goes. The problem I have is,
> after such a change, I need to reindex everything again, which again is
> slow and takes time (40-60hours).

There should be no need to reindex after changing most things in
solrconfig.xml.  Changing cache sizes does not require it.  Most of the
time, reindexing is only required after changing schema.xml, but there
are a few changes you can make to schema that don't require it.

> Some queries are really bad, like this one:
> http://explain.solr.pl/explains/bzy034qi
> How can this be improved? I understand that there is something horribly
> wrong here, but not sure what points to look at (Been using solr from the
> last 20 days).

You are using a *LOT* of query clauses against your allText field in
that boost query.  I assume that allText is your largest field.  I'm not
really sure, but based on what we're seeing here, I bet that a bq
parameter doesn't get cached.  With some additional RAM available, this
might not be such a big problem.

> The query is simple, although it used edismax. I have shared an explain
> query above. Other than the query, this is my performance stats:
> 
> iostat -m 5 result: http://apaste.info/hjNV
> 
> top result: http://apaste.info/jlHN

You've got a pretty well-sustained iowait around ten percent.  You are
I/O bound.  You need more total RAM.  With indexing only happening once
a day, that doesn't sound like it's a factor.  If you are also having
problems with garbage collection because your heap is a little bit too
small, that makes all the other problems worse.

> For the initial training, I will hit solr 1.3M times and request 2000
> documents in each query. By the current speed (just one machine), it will
> take me ~20 days to do the initial training.

This is really mystifying.  There is no need to send a million plus
queries to warm your index.  A few dozen or a few hundred queries should
be all you need, and you don't need 2000 docs returned per query.  Go
with ten rows, or maybe a few dozen rows at most.  Because you're using
SSD, I'm not sure you need warming queries at all.

Thanks,
Shawn



Re: Solr4 cluster setup for high performance reads

2013-06-13 Thread Otis Gospodnetic
Hi,

Changing cache sizes doesn't require indexing.
You have high IO Wait - waiting on your disks?  Ideally your index
will be cached.  Lower those cached, possibly reduce heap size, and
leave more RAM to the OS for caching and IO Wait will hopefully go
down.  I'd try with just -Xmx4g and see.

That python there - maybe you can kill the snake so it's not using the
CPU, seems to be eating a good % of it.

Oh, just looked at your query.  It's massive.  I couldn't quite see
the whole thing.  What exactly are you trying to do with such a long
query?  Maybe describe high-level goal you have

Otis
--
Solr & ElasticSearch Support - http://sematext.com/





On Thu, Jun 13, 2013 at 9:51 PM, Utkarsh Sengar  wrote:
> Otis,Shawn,
>
> Thanks for reply.
> You can find my schema.xml and solrconfig.xml here:
> https://gist.github.com/utkarsh2012/5778811
>
>
> To answer your questions:
>
> Those are massive caches.  Rethink their size.  More specifically,
> plug in some monitoring tool and see what you are getting out of them.
>  Just today I looked at one Sematext's client's caches - 200K entries,
> 0 evictions ==> needless waste of JVM heap.  So lower those numbers
> and increase only if you are getting evictions.
>
> Sure, I will reduce the count and see how it goes. The problem I have is,
> after such a change, I need to reindex everything again, which again is
> slow and takes time (40-60hours).
>
> &debugQuery=true output will tell you something about timings, etc.
>
> Some queries are really bad, like this one:
> http://explain.solr.pl/explains/bzy034qi
> How can this be improved? I understand that there is something horribly
> wrong here, but not sure what points to look at (Been using solr from the
> last 20 days).
>
> consider edismax and qf param instead of that field copy stuff, info
> on zee Wiki
> Related back to my last point, how can such a query be improved? Maybe
> using qf?
>
> back to monitoring - what is your bottleneck?  The query looks
> simplistic.  Is it IO? Memory? CPU?  Share some graphs and let's look.
> The query is simple, although it used edismax. I have shared an explain
> query above. Other than the query, this is my performance stats:
>
> iostat -m 5 result: http://apaste.info/hjNV
>
> top result: http://apaste.info/jlHN
>
>
> How often do you index and commit, and how many documents each time?
> This is done by datastax's dse. I assume it is configurable via
> solrconfig.xml. The updates to cassandra are daily but all the documents
> are not updated.
>
> What is your query rate?
> For the initial training, I will hit solr 1.3M times and request 2000
> documents in each query. By the current speed (just one machine), it will
> take me ~20 days to do the initial training.
>
>
> Thanks,
> -Utkarsh
>
>
>
> On Thu, Jun 13, 2013 at 6:25 PM, Shawn Heisey  wrote:
>
>> On 6/13/2013 5:53 PM, Utkarsh Sengar wrote:
>> > *Problems:*
>> > The initial training pulls 2000 documents from solr to find the most
>> > probable matches and calculates score (PMI/NPMI). This query is extremely
>> > slow. Also, a regular query also takes 3-4 seconds.
>> > I am running solr currently on just one VM with 12GB RAM and 8GB of Heap
>> > space is allocated to solr, the block storage is an SSD.
>>
>> Normally, I would say that you should have as much RAM as your heap size
>> plus your index size, so with your 8GB heap and 15GB index, you'd want
>> 24GB total RAM.  With SSD, that requirement should not be quite so high,
>> but you might want to try 16GB or more.  Solr works much better on bare
>> metal than it does on virtual machines.
>>
>> I suspect that what might be happening here is that your heap is just a
>> little bit too small for the combination of your index size (both
>> document count and disk space), how you use Solr, and your config, so
>> your JVM is constantly doing garbage collections.
>>
>> > What is the suggested setup for this usecase?
>> > My guess is, setting up 4 solr nodes will help, but what is the suggested
>> > RAM/heap for this kind of data?
>> > And what are the recommended configuration (solrconfig.xml) where I *need
>> > to speed up reads*?
>>
>> http://wiki.apache.org/solr/SolrPerformanceProblems
>> http://wiki.apache.org/solr/SolrPerformanceFactors
>>
>> Heap size requirements are hard to predict.  I can tell you that it's
>> highly unlikely that you will need cache sizes as large as you have
>> configured.  Start with the defaults and only increase them (by small
>> amounts) if your hitratio is not high enough.  If increasing the size
>> doesn't increase hitratio, there may be another problem.
>>
>> > Also, is there a way I can debug what is going on with solr internally?
>> As
>> > you can see, my queries are not that complex, so I don't need to debug my
>> > queries but just debug solr and see the troubled pieces in it.
>>
>> If you add &debugQuery=true to your URL, Solr will give you a lot of
>> extra information in the response.  One of the things that would be
>> important her

Re: Solr4 cluster setup for high performance reads

2013-06-13 Thread Utkarsh Sengar
Otis,Shawn,

Thanks for reply.
You can find my schema.xml and solrconfig.xml here:
https://gist.github.com/utkarsh2012/5778811


To answer your questions:

Those are massive caches.  Rethink their size.  More specifically,
plug in some monitoring tool and see what you are getting out of them.
 Just today I looked at one Sematext's client's caches - 200K entries,
0 evictions ==> needless waste of JVM heap.  So lower those numbers
and increase only if you are getting evictions.

Sure, I will reduce the count and see how it goes. The problem I have is,
after such a change, I need to reindex everything again, which again is
slow and takes time (40-60hours).

&debugQuery=true output will tell you something about timings, etc.

Some queries are really bad, like this one:
http://explain.solr.pl/explains/bzy034qi
How can this be improved? I understand that there is something horribly
wrong here, but not sure what points to look at (Been using solr from the
last 20 days).

consider edismax and qf param instead of that field copy stuff, info
on zee Wiki
Related back to my last point, how can such a query be improved? Maybe
using qf?

back to monitoring - what is your bottleneck?  The query looks
simplistic.  Is it IO? Memory? CPU?  Share some graphs and let's look.
The query is simple, although it used edismax. I have shared an explain
query above. Other than the query, this is my performance stats:

iostat -m 5 result: http://apaste.info/hjNV

top result: http://apaste.info/jlHN


How often do you index and commit, and how many documents each time?
This is done by datastax's dse. I assume it is configurable via
solrconfig.xml. The updates to cassandra are daily but all the documents
are not updated.

What is your query rate?
For the initial training, I will hit solr 1.3M times and request 2000
documents in each query. By the current speed (just one machine), it will
take me ~20 days to do the initial training.


Thanks,
-Utkarsh



On Thu, Jun 13, 2013 at 6:25 PM, Shawn Heisey  wrote:

> On 6/13/2013 5:53 PM, Utkarsh Sengar wrote:
> > *Problems:*
> > The initial training pulls 2000 documents from solr to find the most
> > probable matches and calculates score (PMI/NPMI). This query is extremely
> > slow. Also, a regular query also takes 3-4 seconds.
> > I am running solr currently on just one VM with 12GB RAM and 8GB of Heap
> > space is allocated to solr, the block storage is an SSD.
>
> Normally, I would say that you should have as much RAM as your heap size
> plus your index size, so with your 8GB heap and 15GB index, you'd want
> 24GB total RAM.  With SSD, that requirement should not be quite so high,
> but you might want to try 16GB or more.  Solr works much better on bare
> metal than it does on virtual machines.
>
> I suspect that what might be happening here is that your heap is just a
> little bit too small for the combination of your index size (both
> document count and disk space), how you use Solr, and your config, so
> your JVM is constantly doing garbage collections.
>
> > What is the suggested setup for this usecase?
> > My guess is, setting up 4 solr nodes will help, but what is the suggested
> > RAM/heap for this kind of data?
> > And what are the recommended configuration (solrconfig.xml) where I *need
> > to speed up reads*?
>
> http://wiki.apache.org/solr/SolrPerformanceProblems
> http://wiki.apache.org/solr/SolrPerformanceFactors
>
> Heap size requirements are hard to predict.  I can tell you that it's
> highly unlikely that you will need cache sizes as large as you have
> configured.  Start with the defaults and only increase them (by small
> amounts) if your hitratio is not high enough.  If increasing the size
> doesn't increase hitratio, there may be another problem.
>
> > Also, is there a way I can debug what is going on with solr internally?
> As
> > you can see, my queries are not that complex, so I don't need to debug my
> > queries but just debug solr and see the troubled pieces in it.
>
> If you add &debugQuery=true to your URL, Solr will give you a lot of
> extra information in the response.  One of the things that would be
> important here is seeing how much time is spent in various components.
>
> > Also, I am new to solr, so there anything else which I missed to share
> > which would help debug the problem?
>
> Sharing the entire config, schema, examples of all fields from your
> indexed documents, and examples of your full queries would help.
> http://apaste.info
>
> How often do you index and commit, and how many documents each time?
> What is your query rate?
>
> Thanks,
> Shawn
>
>


-- 
Thanks,
-Utkarsh


Re: Solr4 cluster setup for high performance reads

2013-06-13 Thread Shawn Heisey
On 6/13/2013 5:53 PM, Utkarsh Sengar wrote:
> *Problems:*
> The initial training pulls 2000 documents from solr to find the most
> probable matches and calculates score (PMI/NPMI). This query is extremely
> slow. Also, a regular query also takes 3-4 seconds.
> I am running solr currently on just one VM with 12GB RAM and 8GB of Heap
> space is allocated to solr, the block storage is an SSD.

Normally, I would say that you should have as much RAM as your heap size
plus your index size, so with your 8GB heap and 15GB index, you'd want
24GB total RAM.  With SSD, that requirement should not be quite so high,
but you might want to try 16GB or more.  Solr works much better on bare
metal than it does on virtual machines.

I suspect that what might be happening here is that your heap is just a
little bit too small for the combination of your index size (both
document count and disk space), how you use Solr, and your config, so
your JVM is constantly doing garbage collections.

> What is the suggested setup for this usecase?
> My guess is, setting up 4 solr nodes will help, but what is the suggested
> RAM/heap for this kind of data?
> And what are the recommended configuration (solrconfig.xml) where I *need
> to speed up reads*?

http://wiki.apache.org/solr/SolrPerformanceProblems
http://wiki.apache.org/solr/SolrPerformanceFactors

Heap size requirements are hard to predict.  I can tell you that it's
highly unlikely that you will need cache sizes as large as you have
configured.  Start with the defaults and only increase them (by small
amounts) if your hitratio is not high enough.  If increasing the size
doesn't increase hitratio, there may be another problem.

> Also, is there a way I can debug what is going on with solr internally? As
> you can see, my queries are not that complex, so I don't need to debug my
> queries but just debug solr and see the troubled pieces in it.

If you add &debugQuery=true to your URL, Solr will give you a lot of
extra information in the response.  One of the things that would be
important here is seeing how much time is spent in various components.

> Also, I am new to solr, so there anything else which I missed to share
> which would help debug the problem?

Sharing the entire config, schema, examples of all fields from your
indexed documents, and examples of your full queries would help.
http://apaste.info

How often do you index and commit, and how many documents each time?
What is your query rate?

Thanks,
Shawn



Re: Solr4 cluster setup for high performance reads

2013-06-13 Thread Otis Gospodnetic
Hi,

Hard to tell, but here are some tips:

* Those are massive caches.  Rethink their size.  More specifically,
plug in some monitoring tool and see what you are getting out of them.
 Just today I looked at one Sematext's client's caches - 200K entries,
0 evictions ==> needless waste of JVM heap.  So lower those numbers
and increase only if you are getting evictions.

* &debugQuery=true output will tell you something about timings, etc.

* consider edismax and qf param instead of that field copy stuff, info
on zee Wiki

* back to monitoring - what is your bottleneck?  The query looks
simplistic.  Is it IO? Memory? CPU?  Share some graphs and let's look.

Otis
--
Solr & ElasticSearch Support - http://sematext.com/
Performance Monitoring - http://sematext.com/spm/index.html




On Thu, Jun 13, 2013 at 7:53 PM, Utkarsh Sengar  wrote:
> Hello,
>
> I am evaluating solr for indexing about 45M product catalog info. Catalog
> mainly contains title and description which takes most of the space (other
> attributes are brand, category, price, etc)
>
> The data is stored in cassandra and I am using datastax's solr (DSE 3.0.2)
> which handles incremental updates. The column family I am indexing is about
> 50GB in size and solr.data's size is about 15GB for now.
>
> *Points of interest in solr config/schema:*
> 1. schema.xml has a copyField called allText which merges title and
> description.
> 2. solconfig has the following config:
>
>class="${solr.directoryFactory:solr.MMapDirectoryFactory}"/>
>   
>
>   size="512"
>  initialSize="512"
>  autowarmCount="0"/>
>   size="100"
>  initialSize="100"
>  autowarmCount="10"/>
> size="5000"
>initialSize="500"
>autowarmCount="0"/>
>
>
>
>
> *Relevancy:*
> Now, the default "text matching" does not suite our search needs, so I have
> implemented a wrapper around the Solr API which adds boost queries to the
> default solr query. For example:
>
> Original query: ipod
> Final Query: allText:ipod^1000, allText:apple^1000, allText:music^950 etc.
>
> So as you can see, I construct new query based on related keywords and
> assign score to those keywords based on relevance. This approach looks good
> and the results look relevant.
>
>
> But I am having issues with *Solr performance*.
>
> *Problems:*
> The initial training pulls 2000 documents from solr to find the most
> probable matches and calculates score (PMI/NPMI). This query is extremely
> slow. Also, a regular query also takes 3-4 seconds.
> I am running solr currently on just one VM with 12GB RAM and 8GB of Heap
> space is allocated to solr, the block storage is an SSD.
>
> What is the suggested setup for this usecase?
> My guess is, setting up 4 solr nodes will help, but what is the suggested
> RAM/heap for this kind of data?
> And what are the recommended configuration (solrconfig.xml) where I *need
> to speed up reads*?
>
> Also, is there a way I can debug what is going on with solr internally? As
> you can see, my queries are not that complex, so I don't need to debug my
> queries but just debug solr and see the  troubled pieces in it.
>
> Also, I am new to solr, so there anything else which I missed to share
> which would help debug the problem?
>
> --
> Thanks,
> -Utkarsh


Re: analyzer for Code

2013-06-13 Thread Otis Gospodnetic
Gian,

Lucene in Action has a case study from Krugle about their analysis for a
code search engine, if you want to look there.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Thu, Jun 13, 2013 at 4:19 AM, Gian Maria Ricci
wrote:

> I did a little search around and did not find anything interesting. Anyone
> know if some analyzers exists to better index source code (es C#, C++. Java
> etc)?
>
> ** **
>
> Standard analyzer is quite good, but I wish to know if there are some more
> specific analyzers that can do a better indexing. Es I did a little try
> with C# and the full class name was indexed without splitting by dots. So
> MyLib.Helpers.Myclass becomes one token and when I search for MyClass I did
> not find matches. 
>
> ** **
>
> Thanks in advance.
>
> ** **
>
> --
>
> Gian Maria Ricci
>
> Mobile: +39 320 0136949
>
>  [image:
> https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rnuVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]
>  [image:
> https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]
>  [image:
> https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMUTub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]
>  [image:
> https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xfDtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3]
> 
>
> ** **
>
> ** **
>


Re: Suggest and Filtering

2013-06-13 Thread Otis Gospodnetic
Hi,

I think you are talking about wanting instant search?

See https://github.com/fergiemcdowall/solrstrap

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Thu, Jun 13, 2013 at 7:43 PM, Brendan Grainger
 wrote:
> Hi Solr Guru's
>
> I am trying to implement auto suggest where solr would suggest several
> phrases that would return results as the user types in a query (as distinct
> from autocomplete). e.g. say the user starts typing 'br' and we have
> documents that contain "brake pads" and "left disc brake", solr would
> suggest both of those phrases with "brake pads" first. I also want to only
> look at documents that match a given filter query. So say I have a bunch of
> documents for a toyota cressida that contain the bi-gram "brake pads",
> while the documents for a honda accord don't have any brake pad articles.
> If the user is filtering on the honda accord I wouldn't want "brake pads"
> as a suggestion.
>
> Right now, I've played with the suggest component and using faceting.
>
> Any thoughts?
>
> Thanks
> Brendan
>
> --
> Brendan Grainger
> www.kuripai.com


Re: Best way to match umlauts

2013-06-13 Thread adityab
this might be a dumb question. But can you please point me some key
difference between ASCIIFolding Filter  and Character Filter using a map
File.  
thanks
Aditya 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-way-to-match-umlauts-tp4070256p4070398.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr4 cluster setup for high performance reads

2013-06-13 Thread Utkarsh Sengar
Hello,

I am evaluating solr for indexing about 45M product catalog info. Catalog
mainly contains title and description which takes most of the space (other
attributes are brand, category, price, etc)

The data is stored in cassandra and I am using datastax's solr (DSE 3.0.2)
which handles incremental updates. The column family I am indexing is about
50GB in size and solr.data's size is about 15GB for now.

*Points of interest in solr config/schema:*
1. schema.xml has a copyField called allText which merges title and
description.
2. solconfig has the following config:


  








*Relevancy:*
Now, the default "text matching" does not suite our search needs, so I have
implemented a wrapper around the Solr API which adds boost queries to the
default solr query. For example:

Original query: ipod
Final Query: allText:ipod^1000, allText:apple^1000, allText:music^950 etc.

So as you can see, I construct new query based on related keywords and
assign score to those keywords based on relevance. This approach looks good
and the results look relevant.


But I am having issues with *Solr performance*.

*Problems:*
The initial training pulls 2000 documents from solr to find the most
probable matches and calculates score (PMI/NPMI). This query is extremely
slow. Also, a regular query also takes 3-4 seconds.
I am running solr currently on just one VM with 12GB RAM and 8GB of Heap
space is allocated to solr, the block storage is an SSD.

What is the suggested setup for this usecase?
My guess is, setting up 4 solr nodes will help, but what is the suggested
RAM/heap for this kind of data?
And what are the recommended configuration (solrconfig.xml) where I *need
to speed up reads*?

Also, is there a way I can debug what is going on with solr internally? As
you can see, my queries are not that complex, so I don't need to debug my
queries but just debug solr and see the  troubled pieces in it.

Also, I am new to solr, so there anything else which I missed to share
which would help debug the problem?

-- 
Thanks,
-Utkarsh


Suggest and Filtering

2013-06-13 Thread Brendan Grainger
Hi Solr Guru's

I am trying to implement auto suggest where solr would suggest several
phrases that would return results as the user types in a query (as distinct
from autocomplete). e.g. say the user starts typing 'br' and we have
documents that contain "brake pads" and "left disc brake", solr would
suggest both of those phrases with "brake pads" first. I also want to only
look at documents that match a given filter query. So say I have a bunch of
documents for a toyota cressida that contain the bi-gram "brake pads",
while the documents for a honda accord don't have any brake pad articles.
If the user is filtering on the honda accord I wouldn't want "brake pads"
as a suggestion.

Right now, I've played with the suggest component and using faceting.

Any thoughts?

Thanks
Brendan

-- 
Brendan Grainger
www.kuripai.com


Re: Debugging Solr XSL

2013-06-13 Thread Upayavira
Use command line Xalan, debug the stylesheet outside of Solr. You can
save the XML output to disk, and then transform that with xalan. 

Upayavira

On Thu, Jun 13, 2013, at 10:45 PM, O. Olson wrote:
> Hi,
> 
>   I am attempting to transform the XML output of Solr using the
> XsltResponseWriter http://wiki.apache.org/solr/XsltResponseWriter to
> HTML.
> This works, but I am wondering if there is a way for me to debug my
> creation
> of XSL. If there is any problem in the XSL you simply get a stack trace
> in
> the Solr Output. 
> 
>   For e.g. In adding a HTML Link Tag to my XSL, I forgot the closing i.e. 
> I
> did “>” instead of a “/>”. I would just get a stack trace, nothing to
> tell
> me what I did wrong. Another time I had a template match that was very
> specific. I expected it to have precedence over the more general
> template.
> It did not, and I have no clue. I ultimately put in a priority to get my
> expected value. 
> 
>   I am new to XSL. Is there any other free tool that would help me debug 
> XSL
> that Solr would accept? I have Visual Studio (full version) that has XSLT
> debugging – but I have not tried this as yet. Would Solr accept as valid
> what Visual Studio OKs?
> 
>   I’m sorry I am new to this. I’d be grateful for any pointers. 
> 
> Thank you,
> O.O.
> 
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Debugging-Solr-XSL-tp4070368.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Sorting by field is slow

2013-06-13 Thread Shane Perry
I've dug through the code and have narrowed the delay down
to TopFieldCollector$OneComparatorNonScoringCollector.setNextReader() at
the point where the comparator's setNextReader() method is called (line 98
in the lucene_solr_4_3 branch).  That line is actually two method calls so
I'm not yet certain which path is the cause.  I'll continue to dig through
the code but am on thin ice so input would be great.

Shane

On Thu, Jun 13, 2013 at 7:56 AM, Shane Perry  wrote:

> Erick,
>
> We do have soft commits turned.  Initially, autoCommit was set at 15000
> and autoSoftCommit at 1000.  We did up those to 120 and 60
> respectively.  However, since the core in question is a slave, we don't
> actually do writes to the core but rely on replication only to populate the
> index.  In this case wouldn't autoCommit and autoSoftCommit essentially be
> no-ops?  I thought I had pulled out all hard commits but a double check
> shows one instance where it still occurs.
>
> Thanks for your time.
>
> Shane
>
> On Thu, Jun 13, 2013 at 5:19 AM, Erick Erickson 
> wrote:
>
>> Shane:
>>
>> You've covered all the config stuff that I can think of. There's one
>> other possibility. Do you have the soft commits turned on and are
>> they very short? Although soft commits shouldn't invalidate any
>> segment-level caches (but I'm not sure whether the sorting buffers
>> are low-level or not).
>>
>> About the only other thing I can think of is that you're somehow
>> doing hard commits from, say, the client but that's really
>> stretching.
>>
>> All I can really say at this point is that this isn't a problem I've seen
>> before, so it's _likely_ some innocent-seeming config has changed.
>> I'm sure it'll be obvious once you find it ...
>>
>> Erick
>>
>> On Wed, Jun 12, 2013 at 11:51 PM, Shane Perry  wrote:
>> > Erick,
>> >
>> > I agree, it doesn't make sense.  I manually merged the solrconfig.xml
>> from
>> > the distribution example with my 3.6 solrconfig.xml, pulling out what I
>> > didn't need.  There is the possibility I removed something I shouldn't
>> have
>> > though I don't know what it would be.  Minus removing the dynamic
>> fields, a
>> > custom tokenizer class, and changing all my fields to be stored, the
>> > schema.xml file should be the same as well.  I'm not currently in the
>> > position to do so, but I'll double check those two files.  Finally, the
>> > data was re-indexed when I moved to 4.3.
>> >
>> > My statement about field values wasn't stated very well.  What I meant
>> is
>> > that the 'text' field has more unique terms than some of my other
>> fields.
>> >
>> > As for this being an edge case, I'm not sure why it would manifest
>> itself
>> > in 4.3 but not in 3.6 (short of me having a screwy configuration
>> setting).
>> >  If I get a chance, I'll see if I can duplicate the behavior with a
>> small
>> > document count in a sandboxed environment.
>> >
>> > Shane
>> >
>> > On Wed, Jun 12, 2013 at 5:14 PM, Erick Erickson <
>> erickerick...@gmail.com>wrote:
>> >
>> >> This doesn't make much sense, particularly the fact
>> >> that you added first/new searchers. I'm assuming that
>> >> these are sorting on the same field as your slow query.
>> >>
>> >> But sorting on a text field for which
>> >> "Overall, the values of the field are unique"
>> >> is a red-flag. Solr doesn't sort on fields that have
>> >> more than one term, so you might as well use a
>> >> string field and be done with it, it's possible you're
>> >> hitting some edge case.
>> >>
>> >> Did you just copy your 3.6 schema and configs to
>> >> 4.3? Did you re-index?
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Wed, Jun 12, 2013 at 5:11 PM, Shane Perry 
>> wrote:
>> >> > Thanks for the responses.
>> >> >
>> >> > Setting first/newSearcher had no noticeable effect.  I'm sorting on a
>> >> > stored/indexed field named 'text' who's fieldType is solr.TextField.
>> >> >  Overall, the values of the field are unique. The JVM is only using
>> about
>> >> > 2G of the available 12G, so no OOM/GC issue (at least on the
>> surface).
>> >>  The
>> >> > server is question is a slave with approximately 56 million
>> documents.
>> >> >  Additionally, sorting on a field of the same type but with
>> significantly
>> >> > less uniqueness results quick response times.
>> >> >
>> >> > The following is a sample of *debugQuery=true* for a query which
>> returns
>> >> 1
>> >> > document:
>> >> >
>> >> > 
>> >> >   61458.0
>> >> >   
>> >> > 61452.0
>> >> >   
>> >> >   
>> >> > 0.0
>> >> >   
>> >> >   
>> >> > 0.0
>> >> >   
>> >> >   
>> >> > 0.0
>> >> >   
>> >> >   
>> >> > 0.0
>> >> >   
>> >> >   
>> >> > 6.0
>> >> >   
>> >> > 
>> >> >
>> >> >
>> >> > -- Update --
>> >> >
>> >> > Out of desperation, I turned off replication by commenting out the
>> *> >> > name="slave">* element in the replication requestHandler block.
>>  After
>> >> > restarting tomcat I was surprised to find that the replication admin
>> UI
>> >> > still reported the core as rep

Re: Solr - Get DocID of search result

2013-06-13 Thread Chris Hostetter

Your phrasing of the question may be convoluting things -- you refered to 
"DocID" but it's not clear if you mean...

 * the low level internal lucene doc id
 * the uniqueKey field of your schema.xml
 * some identifier whose providence you don't care about.

In the first case, you can use the doctransformer jack mentioned, and you 
can even sort on the internal lucene id using "_docid_" but you can't 
really filter on it.

In the context of your specific problem the internal lucene doc id 
wouldn't help you anyway, since internal lucene doc ids can change as 
segments get merged and delets get flushed.

You can however use any of the later two cases, along with an fq to 
implement" cursor" style logic to insure you never get the same document 
more then once.  instead of increasing the "start" param, you just 
specify an fq param that filters on you id field using a range 
query, and you constinually increase (or descrease) the boundary on the 
range based on the last document fetched.

using your previous example...
: query   return
: start=0&rows=1   A
: start=1&rows=1   B
: start=2&rows=1   C

you would instead do...

start=0&rows=1&sort=id+asc  A
start=0&rows=1&sort=id+asc&fq=id:{A TO *]   B
start=0&rows=1&sort=id+asc&fq=id:{B TO *]   C

If you choose your id field such that the ids were always increasing (ie: 
time based) then you could also be certain that you were always able to 
fetch all documents (ie: you would never miss a doc because you were 
already "past" it's place in the ordered list of docs)

-Hoss


Debugging Solr XSL

2013-06-13 Thread O. Olson
Hi,

I am attempting to transform the XML output of Solr using the
XsltResponseWriter http://wiki.apache.org/solr/XsltResponseWriter to HTML.
This works, but I am wondering if there is a way for me to debug my creation
of XSL. If there is any problem in the XSL you simply get a stack trace in
the Solr Output. 

For e.g. In adding a HTML Link Tag to my XSL, I forgot the closing i.e. 
I
did “>” instead of a “/>”. I would just get a stack trace, nothing to tell
me what I did wrong. Another time I had a template match that was very
specific. I expected it to have precedence over the more general template.
It did not, and I have no clue. I ultimately put in a priority to get my
expected value. 

I am new to XSL. Is there any other free tool that would help me debug 
XSL
that Solr would accept? I have Visual Studio (full version) that has XSLT
debugging – but I have not tried this as yet. Would Solr accept as valid
what Visual Studio OKs?

I’m sorry I am new to this. I’d be grateful for any pointers. 

Thank you,
O.O.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Debugging-Solr-XSL-tp4070368.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Update multiple documents in one query

2013-06-13 Thread Jack Krupansky
I haven't heard any mention of it, but it seems like a reasonable 
enhancement.


There have been cases where people want to do things like add a new value to 
every document.


I'll have to check into how easy it is to perform a query from an update 
processor.


-- Jack Krupansky

-Original Message- 
From: Siamak Kolahi

Sent: Thursday, June 13, 2013 4:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Update multiple documents in one query

Thanks Jack.

I currently have written down a script which does that - effectively
retrieving all those documents and updating them one by one, atomically.
But I was hoping Solr does a more efficient implementation internally.

Is there any thinking about implementing such feature in future?

Siamak


On Thu, Jun 13, 2013 at 4:21 PM, Jack Krupansky 
wrote:



That is not a feature available in Solr.

You can update a full document or do a partial update of a single document
based on its unique key, and you can update a batch of documents using
those two techniques.

You probably could implement custom code to do it.

Maybe even using a script update processor.

Or, just do it all in client code.

The latter would probably be the most straight forward.

A script update processor might actually be semi-reasonable as well, but
you're going to have to get down and dirty to do it.

-- Jack Krupansky

-Original Message- From: Siamak Kolahi
Sent: Thursday, June 13, 2013 3:55 PM
To: solr-user@lucene.apache.org
Subject: Update multiple documents in one query


Hi folks,

I am trying to update multiple fields (assume q=id:*) and add a filed to
all of them. Is this possible?
If yes, what would be the syntax?

I am using the json update interface - /update/json ...

Thanks,
Siamak





Re: Update multiple documents in one query

2013-06-13 Thread Siamak Kolahi
Thanks Jack.

I currently have written down a script which does that - effectively
retrieving all those documents and updating them one by one, atomically.
But I was hoping Solr does a more efficient implementation internally.

Is there any thinking about implementing such feature in future?

Siamak


On Thu, Jun 13, 2013 at 4:21 PM, Jack Krupansky wrote:

> That is not a feature available in Solr.
>
> You can update a full document or do a partial update of a single document
> based on its unique key, and you can update a batch of documents using
> those two techniques.
>
> You probably could implement custom code to do it.
>
> Maybe even using a script update processor.
>
> Or, just do it all in client code.
>
> The latter would probably be the most straight forward.
>
> A script update processor might actually be semi-reasonable as well, but
> you're going to have to get down and dirty to do it.
>
> -- Jack Krupansky
>
> -Original Message- From: Siamak Kolahi
> Sent: Thursday, June 13, 2013 3:55 PM
> To: solr-user@lucene.apache.org
> Subject: Update multiple documents in one query
>
>
> Hi folks,
>
> I am trying to update multiple fields (assume q=id:*) and add a filed to
> all of them. Is this possible?
> If yes, what would be the syntax?
>
> I am using the json update interface - /update/json ...
>
> Thanks,
> Siamak
>


Re: Best way to match umlauts

2013-06-13 Thread Steve Rowe
On Jun 13, 2013, at 3:48 PM, Jack Krupansky  wrote:
>  mapping="mapping-FoldToASCII"/>

The mapping attribute above is missing the .txt file extension:



Steve



Re: Update multiple documents in one query

2013-06-13 Thread Jack Krupansky

That is not a feature available in Solr.

You can update a full document or do a partial update of a single document 
based on its unique key, and you can update a batch of documents using those 
two techniques.


You probably could implement custom code to do it.

Maybe even using a script update processor.

Or, just do it all in client code.

The latter would probably be the most straight forward.

A script update processor might actually be semi-reasonable as well, but 
you're going to have to get down and dirty to do it.


-- Jack Krupansky

-Original Message- 
From: Siamak Kolahi

Sent: Thursday, June 13, 2013 3:55 PM
To: solr-user@lucene.apache.org
Subject: Update multiple documents in one query

Hi folks,

I am trying to update multiple fields (assume q=id:*) and add a filed to
all of them. Is this possible?
If yes, what would be the syntax?

I am using the json update interface - /update/json ...

Thanks,
Siamak 



Update multiple documents in one query

2013-06-13 Thread Siamak Kolahi
Hi folks,

I am trying to update multiple fields (assume q=id:*) and add a filed to
all of them. Is this possible?
If yes, what would be the syntax?

I am using the json update interface - /update/json ...

Thanks,
Siamak


Re: Best way to match umlauts

2013-06-13 Thread Jack Krupansky
Yes, but it's the third best choice. It's a token filter, while the issue at 
hand is a character filtering issue.


A second best choice would be to map for full ASCII folding at the character 
level:


mapping="mapping-FoldToASCII"/>


-- Jack Krupansky

-Original Message- 
From: adityab

Sent: Thursday, June 13, 2013 2:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Best way to match umlauts

Just to confirm even "solr.ASCIIFoldingFilterFactory" should solve the
purpose.
am i correct ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-way-to-match-umlauts-tp4070256p4070317.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: analyzer for Code

2013-06-13 Thread Steve Rowe
Hi Gian Maria,

OpenGrok  has a bunch of JFlex-based 
computer language tokenizers for Lucene: 
.
  Not sure how much work it would be to use them in another project, though.

There's a bunch of JFlex grammars listed here, though most (almost all?) are 
not integrated with Lucene: 

  


Looks like at least the Jsyntaxpane and RSyntaxTextArea projects have multiple 
programming language lexers.

Steve

On Jun 13, 2013, at 1:40 PM, Gian Maria Ricci  wrote:

> Thanks for the suggestions, I’ll try with the WordDelimiterFilterFactory. My 
> aim is not to have a perfect analysis, just a way to quick search for words 
> in the whole history of a codebase. J
>  
> --
> Gian Maria Ricci
> Mobile: +39 320 0136949
>
>  
>  
> From: Erick Erickson [mailto:erickerick...@gmail.com] 
> Sent: Thursday, June 13, 2013 1:24 PM
> To: solr-user@lucene.apache.org; Gian Maria Ricci
> Subject: Re: analyzer for Code
>  
> Well, WordDelimiterFilterFactory would split on the punctuation, so
> you could add it to the analyzer chain along with StandardAnalyzer.
>  
> You could use one of the regex filters to break up tokens that make it
> through the analyzer as you see fit.
>  
> But in general, this will be a bunch of compromises since programming
> languages are, shall we say, not standard 
>  
> Best
> Erick
>  
> 
> On Thu, Jun 13, 2013 at 4:19 AM, Gian Maria Ricci  
> wrote:
> I did a little search around and did not find anything interesting. Anyone 
> know if some analyzers exists to better index source code (es C#, C++. Java 
> etc)?
>  
> Standard analyzer is quite good, but I wish to know if there are some more 
> specific analyzers that can do a better indexing. Es I did a little try with 
> C# and the full class name was indexed without splitting by dots. So 
> MyLib.Helpers.Myclass becomes one token and when I search for MyClass I did 
> not find matches.
>  
> Thanks in advance.
>  
> --
> Gian Maria Ricci
> Mobile: +39 320 0136949
>
>  
>  



Re: The 'threads' parameter in DIH - SOLR 4.3.0

2013-06-13 Thread Shawn Heisey

On 6/13/2013 12:08 PM, bbarani wrote:

I see that the threads parameter has been removed from DIH from all version
starting SOLR 4.x. Can someone let me know the best way to initiate indexing
in multi threaded mode when using DIH now? Is there a way to do that?


That parameter was removed because it didn't work right, and there was 
no apparent way to fix it.  The change that went into a later 3.6 
version was a bandaid, not a fix.  I don't know all the details.


There's no way to get multithreading with DIH directly, but you can do 
it indirectly:


Create multiple request handlers with different names, such as 
/dataimport1, /dataimport2, etc.  Configure each handler with settings 
that will pull part of your data source.  Start them so they run 
concurrently.


Depending on your environment, it may be easier to just write a 
multi-threaded indexing application using the Solr API for your language 
of choice.


Thanks,
Shawn



Re: shardkey

2013-06-13 Thread Joel Bernstein
Also you might want to check this blog post, just went up today.

http://searchhub.org/2013/06/13/solr-cloud-document-routing/


On Wed, Jun 12, 2013 at 2:18 PM, James Thomas  wrote:

> This page has some good information on custom document routing:
>
> http://docs.lucidworks.com/display/solr/Shards+and+Indexing+Data+in+SolrCloud
>
>
>
> -Original Message-
> From: Rishi Easwaran [mailto:rishi.easwa...@aol.com]
> Sent: Wednesday, June 12, 2013 1:40 PM
> To: solr-user@lucene.apache.org
> Subject: Re: shardkey
>
> From my understanding.
> In SOLR cloud the CompositeIdDocRouter uses HashbasedDocRouter.
> CompositeId router is default if your numShards>1 on collection creation.
> CompositeId router generates an hash using the uniqueKey defined in your
> schema.xml to route your documents to a dedicated shard.
>
> You can use select?q=xyz&shard.keys=uniquekey to focus your search to hit
> only the shard that has your shard.key
>
>
>
>  Thanks,
>
> Rishi.
>
>
>
> -Original Message-
> From: Joshi, Shital 
> To: 'solr-user@lucene.apache.org' 
> Sent: Wed, Jun 12, 2013 10:01 am
> Subject: shardkey
>
>
> Hi,
>
> We are using Solr 4.3.0 SolrCloud (5 shards, 10 replicas). I have couple
> questions on shard key.
>
> 1. Looking at the admin GUI, how do I know which field is being
> used for shard key.
> 2. What is the default shard key used?
> 3. How do I override the default shard key?
>
> Thanks.
>
>
>


-- 
Joel Bernstein
Professional Services LucidWorks


Re: Best way to match umlauts

2013-06-13 Thread adityab
Just to confirm even "solr.ASCIIFoldingFilterFactory" should solve the
purpose. 
am i correct ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-way-to-match-umlauts-tp4070256p4070317.html
Sent from the Solr - User mailing list archive at Nabble.com.


The 'threads' parameter in DIH - SOLR 4.3.0

2013-06-13 Thread bbarani
I see that the threads parameter has been removed from DIH from all version
starting SOLR 4.x. Can someone let me know the best way to initiate indexing
in multi threaded mode when using DIH now? Is there a way to do that?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-threads-parameter-in-DIH-SOLR-4-3-0-tp4070315.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: analyzer for Code

2013-06-13 Thread Gian Maria Ricci
Thanks for the suggestions, I'll try with the WordDelimiterFilterFactory. My
aim is not to have a perfect analysis, just a way to quick search for words
in the whole history of a codebase. :)

 

--

Gian Maria Ricci

Mobile: +39 320 0136949

 

   
 

 

 

From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, June 13, 2013 1:24 PM
To: solr-user@lucene.apache.org; Gian Maria Ricci
Subject: Re: analyzer for Code

 

Well, WordDelimiterFilterFactory would split on the punctuation, so

you could add it to the analyzer chain along with StandardAnalyzer.

 

You could use one of the regex filters to break up tokens that make it

through the analyzer as you see fit.

 

But in general, this will be a bunch of compromises since programming

languages are, shall we say, not standard 

 

Best

Erick

 

On Thu, Jun 13, 2013 at 4:19 AM, Gian Maria Ricci mailto:alkamp...@nablasoft.com> > wrote:

I did a little search around and did not find anything interesting. Anyone
know if some analyzers exists to better index source code (es C#, C++. Java
etc)?

 

Standard analyzer is quite good, but I wish to know if there are some more
specific analyzers that can do a better indexing. Es I did a little try with
C# and the full class name was indexed without splitting by dots. So
MyLib.Helpers.Myclass becomes one token and when I search for MyClass I did
not find matches. 

 

Thanks in advance.

 

--

Gian Maria Ricci

Mobile: +39 320 0136949  

 


 

 



 



Re: Solr 5.0 NOT released - tweet is incorrect

2013-06-13 Thread Shawn Heisey

On 6/13/2013 10:02 AM, Alexandre Rafalovitch wrote:

P.s. As an aside, being relatively new, I do wonder what kind of
event/discussion will trigger version 5 branch-off. I guess it would
actually be more of a Lucene decision these days.


I foresee two likely reasons for a branch-off.  Based on what I've 
noticed over the last year or so, the latter reason is probably more 
likely.  I don't think that the decision to branch will depend much on 
whether the improvements are in Solr or Lucene, mostly because they tend 
to improve together, not separately.



1) Trunk gets really awesome features X, Y, and Z that aren't included 
in 4x, and it becomes stable enough that the committers believe the 
entire codebase is almost ready for ALPHA.  This reason is more likely 
if those features are major performance enhancements.



2) Backporting fixes from trunk to branch_4x becomes too difficult due 
to code divergence.  Right now, almost all changes backport with no 
difficulty, and even when there are conflicts, they tend to be easy to 
resolve.


Over time, various factors will cause trunk code to become very 
different from 4x, and backporting will become a nightmare.  Examples:


A) Pieces of code that get deprecated in 4x are removed completely from 
trunk.  B) Trunk requires Java 7 and 4x must work with Java 6.  C) There 
are differences that appear because some changes are too major and not 
tested enough to include in the stable branch.


Thanks,
Shawn



Re: Solr 5.0 NOT released - tweet is incorrect

2013-06-13 Thread Alexandre Rafalovitch
Should we worry about Stock Options being shorted?

Just kidding. :-)

Regards,
   Alex.
P.s. As an aside, being relatively new, I do wonder what kind of
event/discussion will trigger version 5 branch-off. I guess it would
actually be more of a Lucene decision these days.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Jun 13, 2013 at 11:58 AM, Shawn Heisey  wrote:
> Someone tweeted with the #solr hashtag that Solr 5.0 is released.
>
> https://twitter.com/nadr
>
> This is not correct.  At this time, version 4.3.0 is the current release.  I
> expect that the announcement for 4.3.1 will appear within the next couple of
> days.
>
> Right now, there is no timeframe for the 5.0 release.  The trunk 'branch' is
> very similar to 4.x.  There's no compelling reason to work on a 5.0 version
> yet.
>
> Thanks,
> Shawn


Re: Best way to match umlauts

2013-06-13 Thread jimtronic
Thanks! Sorry for the basic question, but I was having trouble finding the
results through google.

On Thu, Jun 13, 2013 at 10:39 AM, Jack Krupansky-2 [via Lucene] <
ml-node+s472066n4070262...@n3.nabble.com> wrote:

>  mapping="mapping-ISOLatin1Accent.txt"/>
>
> -- Jack Krupansky
>
> -Original Message-
> From: jimtronic
> Sent: Thursday, June 13, 2013 11:31 AM
> To: [hidden email] 
> Subject: Best way to match umlauts
>
> I'm trying to make Brüno come up in my results when the user types in
> "Bruno".
>
> What's the best way to accomplish this?
>
> Using Solr 4.2
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Best-way-to-match-umlauts-tp4070256.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Best-way-to-match-umlauts-tp4070256p4070262.html
>  To unsubscribe from Best way to match umlauts, click 
> here
> .
> NAML
>




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-way-to-match-umlauts-tp4070256p4070273.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - Get DocID of search result

2013-06-13 Thread vrparekh
Thanks Jack, below is the actual problem,

suppose currently 4 records are there in solr engine. A,B, C and D.

query   return

start=0&rows=1   A
start=1&rows=1   B
start=2&rows=1   C

now at this time, 1 new record has been inserted in solr "AA". so documents
in solr are AA,A,B,C and D.

now if i query

   start=3&rows=1, it will return "C" (2nd time so i have A, B, C, C),
while i am expecting "D".



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Get-DocID-of-search-result-tp4070253p4070271.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr 5.0 NOT released - tweet is incorrect

2013-06-13 Thread Shawn Heisey

Someone tweeted with the #solr hashtag that Solr 5.0 is released.

https://twitter.com/nadr

This is not correct.  At this time, version 4.3.0 is the current 
release.  I expect that the announcement for 4.3.1 will appear within 
the next couple of days.


Right now, there is no timeframe for the 5.0 release.  The trunk 
'branch' is very similar to 4.x.  There's no compelling reason to work 
on a 5.0 version yet.


Thanks,
Shawn


Re: Solr - Get DocID of search result

2013-06-13 Thread Alexandre Rafalovitch
I think the problem is the desire for the idempotent search results
across paging calls. Not sure if that explains it any better than the
original poster though. :-)

Basically, if the repeated search gets a different documents returned,
the offsets become somewhat problematic. Specifically, a page 2 may
skip some of the documents if there were some deletions that affected
page 1 or repeat documents if there were some additions.

I swore there was an article explaining this whole problem and
possible solution, but I can't find it now.

I did find the pageDoc reference that seems related, but I am not sure
if was actually implemented (JIRA is confusing on this one):
http://wiki.apache.org/solr/CommonQueryParameters#pageDoc_and_pageScore

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Jun 13, 2013 at 11:34 AM, Jack Krupansky
 wrote:
> You can use a Solr transformer to get the Lucene docID in the fl parameter:
>
>   &fl=id,[docid],score,my-field,...
>
> But... you can't use the Lucene docId in a query.
>
> Relevancy and sorting, not to mention updating of existing documents, can
> change the order of results so that docId is not a good indicator of
> document "order".
>
> But, rather than focus prematurely on a solution, what exactly is the
> problem you are trying to solve? What exactly is duplicated?
>
> -- Jack Krupansky
>
> -Original Message- From: vrparekh
> Sent: Thursday, June 13, 2013 11:24 AM
> To: solr-user@lucene.apache.org
> Subject: Solr - Get DocID of search result
>
>
> Hello All,
>
> How can I get docID of result from solr?
>
> What I am doing currently is,
>
> I do search request in solr.
>
> I get certain records (Say 10).
>
>solrurl/start=0&rows=10
>
> Now, again I do search request with below
>
>solrurl/start=10&rows=10
>
> So i get next 10 records.
>
> Now new records are inserted in solr (Say 10 records).
> and Now If I do request again by
>solrurl/start=20&rows=10
>
> So I might get repeated records.
>
> So if I have docID of than I can query by less than that docID.
>
> So is it possible to get docID?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Get-DocID-of-search-result-tp4070253.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Query time exceeds timeout specified by parameter timeAllowed

2013-06-13 Thread Christof Doll
Hello,

I just gave the parameter timeAllowed a try and noticed that in some cases
the actual query time exceeds the timeout specified by the timeAllowed
parameter, e.g., having set timeAllowed to 100 the actual query time is
300ms. Unfortunately, the documentation of the timeAllowed parameter is
quite short and does not explain how the timeAllowed parameter is treated
internally. Can any one explain this behavior?

Best regards,
Christof



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-time-exceeds-timeout-specified-by-parameter-timeAllowed-tp4070266.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Got ping response for sessionid every ms at CloudSolrServer

2013-06-13 Thread Shawn Heisey

On 6/13/2013 8:19 AM, Furkan KAMACI wrote:

17:16:56.560 [tion(localhost:9983)] DEBUG ClientCnxn - Got ping response
for sessionid: 0x13f3c94662c0026 after 0ms
17:16:59.897 [tion(localhost:9983)] DEBUG ClientCnxn - Got ping response
for sessionid: 0x13f3c94662c0026 after 0ms
17:17:03.232 [tion(localhost:9983)] DEBUG ClientCnxn - Got ping response
for sessionid: 0x13f3c94662c0026 after 0ms

it is too often. Is it usual?


This logging comes from Zookeeper -- google ClientCnxn to verify.  It's 
only pinging once every three seconds.  That's not very often.  It is 
not happening every millisecond.


You really shouldn't run with all logs at DEBUG.  It's a huge amount of 
information, just logging it can cause performance issues, and it will 
show you things like the above that might seem bad, but aren't.


INFO logging is as detailed as most people should ever need to go, 
unless they are trying to modify the source code.


If you are trying to debug something in particular, you should leave the 
default logging in the log4j config file at INFO or WARN, then turn up 
the logging on specific Solr classes using the admin UI.  If the 
specific class you want to debug isn't in the UI, you can add any class 
you want to the logging config file.


Thanks,
Shawn



Re: Solr - Get DocID of search result

2013-06-13 Thread Jack Krupansky

You can use a Solr transformer to get the Lucene docID in the fl parameter:

  &fl=id,[docid],score,my-field,...

But... you can't use the Lucene docId in a query.

Relevancy and sorting, not to mention updating of existing documents, can 
change the order of results so that docId is not a good indicator of 
document "order".


But, rather than focus prematurely on a solution, what exactly is the 
problem you are trying to solve? What exactly is duplicated?


-- Jack Krupansky

-Original Message- 
From: vrparekh

Sent: Thursday, June 13, 2013 11:24 AM
To: solr-user@lucene.apache.org
Subject: Solr - Get DocID of search result

Hello All,

How can I get docID of result from solr?

What I am doing currently is,

I do search request in solr.

I get certain records (Say 10).

   solrurl/start=0&rows=10

Now, again I do search request with below

   solrurl/start=10&rows=10

So i get next 10 records.

Now new records are inserted in solr (Say 10 records).
and Now If I do request again by
   solrurl/start=20&rows=10

So I might get repeated records.

So if I have docID of than I can query by less than that docID.

So is it possible to get docID?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Get-DocID-of-search-result-tp4070253.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Best way to match umlauts

2013-06-13 Thread Jack Krupansky
mapping="mapping-ISOLatin1Accent.txt"/>


-- Jack Krupansky

-Original Message- 
From: jimtronic

Sent: Thursday, June 13, 2013 11:31 AM
To: solr-user@lucene.apache.org
Subject: Best way to match umlauts

I'm trying to make Brüno come up in my results when the user types in
"Bruno".

What's the best way to accomplish this?

Using Solr 4.2



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-way-to-match-umlauts-tp4070256.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Best way to match umlauts

2013-06-13 Thread jimtronic
I'm trying to make Brüno come up in my results when the user types in
"Bruno". 

What's the best way to accomplish this?

Using Solr 4.2



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-way-to-match-umlauts-tp4070256.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr - Get DocID of search result

2013-06-13 Thread vrparekh
Hello All,

How can I get docID of result from solr?

What I am doing currently is,

I do search request in solr.

I get certain records (Say 10).

solrurl/start=0&rows=10

Now, again I do search request with below

solrurl/start=10&rows=10

So i get next 10 records.

Now new records are inserted in solr (Say 10 records). 
and Now If I do request again by 
solrurl/start=20&rows=10

So I might get repeated records.

So if I have docID of than I can query by less than that docID.

So is it possible to get docID?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Get-DocID-of-search-result-tp4070253.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Adding pdf/word file using JSON/XML

2013-06-13 Thread Walter Underwood
That was my thought exactly. Contribute a REST request handler. --wunder

On Jun 13, 2013, at 6:04 AM, Alexandre Rafalovitch wrote:

> And sometimes useful projects come out from the annoying, confusing
> corner situations like yours.
> 
> See if you can get permission to open-source your implementation and
> you may find more people interested in the same thing. It could also
> be a good visibility for your consultancy. Worst case, there are some
> good blog articles in that.
> 
> Regards,
>   Alex.
> 
> On Thu, Jun 13, 2013 at 3:32 AM, Roland Everaert  wrote:
>> To conclude, yesterday I discuss with the team and we decide that I will
>> provide a RESTful web service that will hide the access to the indexers
>> among other things, so even the .NET guy will be able to use it. That will
>> allow me to study REST and, I hope, make clearer questions in the future.
> 
> 
> 
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)

--
Walter Underwood
wun...@wunderwood.org





Re: Dynamically create new fields

2013-06-13 Thread Steve Rowe
I wrote a blog post about this stuff here: 
. - Steve

On Jun 12, 2013, at 3:26 PM, Chris Hostetter  wrote:

> 
> : Dynamically adding fields to schema is yet to get released..
> : 
> : https://issues.apache.org/jira/browse/SOLR-3251
> 
> Just to clarify...
> 
> *explicitly* adding fields dynamicly based on client commands has been 
> implimented and will be included in Solr 4.4
> 
> *implicitly* adding fields dynamically based on what documents are added 
> to the index is a feature that sarowe is still currently working on...
> 
> https://issues.apache.org/jira/browse/SOLR-3250
> 
> 
> -Hoss



Re: analyzer for Code

2013-06-13 Thread Walter Underwood
It could be pretty complicated to do well.

I'm pretty sure that Krugle is based on Solr: http://opensearch.krugle.org/

You might also look at the UI for Ohloh (used to be Koders): 
http://code.ohloh.net/

wunder

On Jun 13, 2013, at 1:19 AM, Gian Maria Ricci wrote:

> I did a little search around and did not find anything interesting. Anyone 
> know if some analyzers exists to better index source code (es C#, C++. Java 
> etc)?
>  
> Standard analyzer is quite good, but I wish to know if there are some more 
> specific analyzers that can do a better indexing. Es I did a little try with 
> C# and the full class name was indexed without splitting by dots. So 
> MyLib.Helpers.Myclass becomes one token and when I search for MyClass I did 
> not find matches.
>  
> Thanks in advance.
>  
> --
> Gian Maria Ricci
> Mobile: +39 320 0136949
> 
>  
>  

--
Walter Underwood
wun...@wunderwood.org





Got ping response for sessionid every ms at CloudSolrServer

2013-06-13 Thread Furkan KAMACI
I am usng CloudSolrServer at my application. When I look at output I bunch
of that messages:

17:16:33.205 [tion(localhost:9983)] DEBUG ClientCnxn - Got ping response
for sessionid: 0x13f3c94662c0026 after 0ms
17:16:36.542 [tion(localhost:9983)] DEBUG ClientCnxn - Got ping response
for sessionid: 0x13f3c94662c0026 after 0ms
17:16:39.879 [tion(localhost:9983)] DEBUG ClientCnxn - Got ping response
for sessionid: 0x13f3c94662c0026 after 0ms
17:16:43.216 [tion(localhost:9983)] DEBUG ClientCnxn - Got ping response
for sessionid: 0x13f3c94662c0026 after 0ms
17:16:46.552 [tion(localhost:9983)] DEBUG ClientCnxn - Got ping response
for sessionid: 0x13f3c94662c0026 after 0ms
17:16:49.889 [tion(localhost:9983)] DEBUG ClientCnxn - Got ping response
for sessionid: 0x13f3c94662c0026 after 0ms
17:16:53.225 [tion(localhost:9983)] DEBUG ClientCnxn - Got ping response
for sessionid: 0x13f3c94662c0026 after 0ms
17:16:56.560 [tion(localhost:9983)] DEBUG ClientCnxn - Got ping response
for sessionid: 0x13f3c94662c0026 after 0ms
17:16:59.897 [tion(localhost:9983)] DEBUG ClientCnxn - Got ping response
for sessionid: 0x13f3c94662c0026 after 0ms
17:17:03.232 [tion(localhost:9983)] DEBUG ClientCnxn - Got ping response
for sessionid: 0x13f3c94662c0026 after 0ms

it is too often. Is it usual?


Re: Apache Nutch data to Solr 4.3 schema issues ?

2013-06-13 Thread Tony Mullins
OK. Thanks.

Tony.


On Thu, Jun 13, 2013 at 7:02 PM, Shawn Heisey  wrote:

> > Hi.
> >
> > I was hoping by replacing Nutch provided schema to my Solr schema ( as
> the
> > described by Nutch documentation) would solve all my problems.
> >
> > So you are suggesting I edit my existing Solr schema and just add the
> > additional information found in Nutch-Solr schema line by line
>
> I hate to tell you to do such a labor intensive process, but Jack is right.
>
> The fact that you had to add the _version_ field means that Nutch 2.2 Had
> a schema designed for a Solr release prior to 4.0, which was released last
> October. There have been four Solr releases since then and another should
> be out in The next few days.
>
> Someone on the nutch mailing list might have a schema designed to work
> with Solr 4.x, and you might also want to look in the source code
> repository for Nutch. I'm on my phone so it's difficult to interrupt this
> email in progress to locate resources for you.
>
> Thanks,
> Shawn
>
>
>


Re: Apache Nutch data to Solr 4.3 schema issues ?

2013-06-13 Thread Shawn Heisey
> Hi.
>
> I was hoping by replacing Nutch provided schema to my Solr schema ( as the
> described by Nutch documentation) would solve all my problems.
>
> So you are suggesting I edit my existing Solr schema and just add the
> additional information found in Nutch-Solr schema line by line

I hate to tell you to do such a labor intensive process, but Jack is right.

The fact that you had to add the _version_ field means that Nutch 2.2 Had
a schema designed for a Solr release prior to 4.0, which was released last
October. There have been four Solr releases since then and another should
be out in The next few days.

Someone on the nutch mailing list might have a schema designed to work
with Solr 4.x, and you might also want to look in the source code
repository for Nutch. I'm on my phone so it's difficult to interrupt this
email in progress to locate resources for you.

Thanks,
Shawn




Re: Sorting by field is slow

2013-06-13 Thread Shane Perry
Erick,

We do have soft commits turned.  Initially, autoCommit was set at 15000 and
autoSoftCommit at 1000.  We did up those to 120 and 60
respectively.  However, since the core in question is a slave, we don't
actually do writes to the core but rely on replication only to populate the
index.  In this case wouldn't autoCommit and autoSoftCommit essentially be
no-ops?  I thought I had pulled out all hard commits but a double check
shows one instance where it still occurs.

Thanks for your time.

Shane

On Thu, Jun 13, 2013 at 5:19 AM, Erick Erickson wrote:

> Shane:
>
> You've covered all the config stuff that I can think of. There's one
> other possibility. Do you have the soft commits turned on and are
> they very short? Although soft commits shouldn't invalidate any
> segment-level caches (but I'm not sure whether the sorting buffers
> are low-level or not).
>
> About the only other thing I can think of is that you're somehow
> doing hard commits from, say, the client but that's really
> stretching.
>
> All I can really say at this point is that this isn't a problem I've seen
> before, so it's _likely_ some innocent-seeming config has changed.
> I'm sure it'll be obvious once you find it ...
>
> Erick
>
> On Wed, Jun 12, 2013 at 11:51 PM, Shane Perry  wrote:
> > Erick,
> >
> > I agree, it doesn't make sense.  I manually merged the solrconfig.xml
> from
> > the distribution example with my 3.6 solrconfig.xml, pulling out what I
> > didn't need.  There is the possibility I removed something I shouldn't
> have
> > though I don't know what it would be.  Minus removing the dynamic
> fields, a
> > custom tokenizer class, and changing all my fields to be stored, the
> > schema.xml file should be the same as well.  I'm not currently in the
> > position to do so, but I'll double check those two files.  Finally, the
> > data was re-indexed when I moved to 4.3.
> >
> > My statement about field values wasn't stated very well.  What I meant is
> > that the 'text' field has more unique terms than some of my other fields.
> >
> > As for this being an edge case, I'm not sure why it would manifest itself
> > in 4.3 but not in 3.6 (short of me having a screwy configuration
> setting).
> >  If I get a chance, I'll see if I can duplicate the behavior with a small
> > document count in a sandboxed environment.
> >
> > Shane
> >
> > On Wed, Jun 12, 2013 at 5:14 PM, Erick Erickson  >wrote:
> >
> >> This doesn't make much sense, particularly the fact
> >> that you added first/new searchers. I'm assuming that
> >> these are sorting on the same field as your slow query.
> >>
> >> But sorting on a text field for which
> >> "Overall, the values of the field are unique"
> >> is a red-flag. Solr doesn't sort on fields that have
> >> more than one term, so you might as well use a
> >> string field and be done with it, it's possible you're
> >> hitting some edge case.
> >>
> >> Did you just copy your 3.6 schema and configs to
> >> 4.3? Did you re-index?
> >>
> >> Best
> >> Erick
> >>
> >> On Wed, Jun 12, 2013 at 5:11 PM, Shane Perry  wrote:
> >> > Thanks for the responses.
> >> >
> >> > Setting first/newSearcher had no noticeable effect.  I'm sorting on a
> >> > stored/indexed field named 'text' who's fieldType is solr.TextField.
> >> >  Overall, the values of the field are unique. The JVM is only using
> about
> >> > 2G of the available 12G, so no OOM/GC issue (at least on the surface).
> >>  The
> >> > server is question is a slave with approximately 56 million documents.
> >> >  Additionally, sorting on a field of the same type but with
> significantly
> >> > less uniqueness results quick response times.
> >> >
> >> > The following is a sample of *debugQuery=true* for a query which
> returns
> >> 1
> >> > document:
> >> >
> >> > 
> >> >   61458.0
> >> >   
> >> > 61452.0
> >> >   
> >> >   
> >> > 0.0
> >> >   
> >> >   
> >> > 0.0
> >> >   
> >> >   
> >> > 0.0
> >> >   
> >> >   
> >> > 0.0
> >> >   
> >> >   
> >> > 6.0
> >> >   
> >> > 
> >> >
> >> >
> >> > -- Update --
> >> >
> >> > Out of desperation, I turned off replication by commenting out the
> * >> > name="slave">* element in the replication requestHandler block.  After
> >> > restarting tomcat I was surprised to find that the replication admin
> UI
> >> > still reported the core as replicating.  Search queries were still
> slow.
> >>  I
> >> > then disabled replication via the UI and the display updated to report
> >> the
> >> > core was no longer replicating.  Queries are now fast so it appears
> that
> >> > the sorting may be a red-herring.
> >> >
> >> > It's may be of note to also mention that the slow queries don't
> appear to
> >> > be getting cached.
> >> >
> >> > Thanks again for the feed back.
> >> >
> >> > On Wed, Jun 12, 2013 at 2:33 PM, Jack Krupansky <
> j...@basetechnology.com
> >> >wrote:
> >> >
> >> >> Rerun the sorted query with &debugQuery=true and look at the module
> >> >> timings. See what stands out
> >> >>
> >> >> Ar

Re: Apache Nutch data to Solr 4.3 schema issues ?

2013-06-13 Thread Jack Krupansky
I don't know what choice you have until somebody on the Nutch project takes 
the time to do the same thing and update their schema to 4.3. They should 
keep a schema for every Solr release.


-- Jack Krupansky

-Original Message- 
From: Tony Mullins

Sent: Thursday, June 13, 2013 9:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Apache Nutch data to Solr 4.3 schema issues ?

Hi.

I was hoping by replacing Nutch provided schema to my Solr schema ( as the
described by Nutch documentation) would solve all my problems.

So you are suggesting I edit my existing Solr schema and just add the
additional information found in Nutch-Solr schema line by line .

Thanks,
Tony.


On Thu, Jun 13, 2013 at 5:06 PM, Jack Krupansky 
wrote:



Look further down in the stack trace in the Solr log for the final "Caused
By:".

And better to start with the Solr 4.3 schema and config files and then
merge in your Nutch changes one line at a time.

-- Jack Krupansky

-Original Message- From: Tony Mullins
Sent: Thursday, June 13, 2013 3:56 AM
To: solr-user@lucene.apache.org
Subject: Apache Nutch data to Solr 4.3 schema issues ?


Hi ,

I am trying to index my Solr 4.3 from Apache Nutch 2.2 data. And for that 
I

have copied the schema-solr4.xml from Nutch2.2 runtime/local/conf and
pasted it to my SolrHome solr/collection1/conf.

My Solr4.3 is hosted in Tomcat. And initially when I tried
http://localhost:8080/solr/#/**collection1
it wasn't working and on further investigation I found _version_ field was
missing so I added this field as
 and it
started working ok.

And now when I try 
http://localhost:8080/solr/**collection1/browse... 
it

shows me errors like
"HTTP Status 500 - {msg=lazy loading
error,trace=org.apache.solr.**common.SolrException: lazy loading error at
org.apache.solr.core.SolrCore$**LazyQueryResponseWriterWrapper**
.getWrappedWriter(SolrCore.**java:2260)
at
org.apache.solr.core.SolrCore$**LazyQueryResponseWriterWrapper**
.getContentType(SolrCore.java:**2279)
at
org.apache.solr.servlet.**SolrDispatchFilter.**writeResponse(**
SolrDispatchFilter.java:623)
at
org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
SolrDispatchFilter.java:372)
at
org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
SolrDispatchFilter.java:155)
at
org.apache.catalina.core.**ApplicationFilterChain.**internalDoFilter(**
ApplicationFilterChain.java:**243)
at
org.apache.catalina.core.**ApplicationFilterChain.**doFilter(**
ApplicationFilterChain.java:**210)
at "

So could you please guide me that whats missing now ? is this again due to
any schema issue or something else ?

Thanks,
Tony





Re: Apache Nutch data to Solr 4.3 schema issues ?

2013-06-13 Thread Tony Mullins
Hi.

I was hoping by replacing Nutch provided schema to my Solr schema ( as the
described by Nutch documentation) would solve all my problems.

So you are suggesting I edit my existing Solr schema and just add the
additional information found in Nutch-Solr schema line by line .

Thanks,
Tony.


On Thu, Jun 13, 2013 at 5:06 PM, Jack Krupansky wrote:

> Look further down in the stack trace in the Solr log for the final "Caused
> By:".
>
> And better to start with the Solr 4.3 schema and config files and then
> merge in your Nutch changes one line at a time.
>
> -- Jack Krupansky
>
> -Original Message- From: Tony Mullins
> Sent: Thursday, June 13, 2013 3:56 AM
> To: solr-user@lucene.apache.org
> Subject: Apache Nutch data to Solr 4.3 schema issues ?
>
>
> Hi ,
>
> I am trying to index my Solr 4.3 from Apache Nutch 2.2 data. And for that I
> have copied the schema-solr4.xml from Nutch2.2 runtime/local/conf and
> pasted it to my SolrHome solr/collection1/conf.
>
> My Solr4.3 is hosted in Tomcat. And initially when I tried
> http://localhost:8080/solr/#/**collection1
> it wasn't working and on further investigation I found _version_ field was
> missing so I added this field as
>  and it
> started working ok.
>
> And now when I try 
> http://localhost:8080/solr/**collection1/browse...
>  it
> shows me errors like
> "HTTP Status 500 - {msg=lazy loading
> error,trace=org.apache.solr.**common.SolrException: lazy loading error at
> org.apache.solr.core.SolrCore$**LazyQueryResponseWriterWrapper**
> .getWrappedWriter(SolrCore.**java:2260)
> at
> org.apache.solr.core.SolrCore$**LazyQueryResponseWriterWrapper**
> .getContentType(SolrCore.java:**2279)
> at
> org.apache.solr.servlet.**SolrDispatchFilter.**writeResponse(**
> SolrDispatchFilter.java:623)
> at
> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> SolrDispatchFilter.java:372)
> at
> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> SolrDispatchFilter.java:155)
> at
> org.apache.catalina.core.**ApplicationFilterChain.**internalDoFilter(**
> ApplicationFilterChain.java:**243)
> at
> org.apache.catalina.core.**ApplicationFilterChain.**doFilter(**
> ApplicationFilterChain.java:**210)
> at "
>
> So could you please guide me that whats missing now ? is this again due to
> any schema issue or something else ?
>
> Thanks,
> Tony
>


Re: Adding pdf/word file using JSON/XML

2013-06-13 Thread Alexandre Rafalovitch
And sometimes useful projects come out from the annoying, confusing
corner situations like yours.

See if you can get permission to open-source your implementation and
you may find more people interested in the same thing. It could also
be a good visibility for your consultancy. Worst case, there are some
good blog articles in that.

Regards,
   Alex.

On Thu, Jun 13, 2013 at 3:32 AM, Roland Everaert  wrote:
> To conclude, yesterday I discuss with the team and we decide that I will
> provide a RESTful web service that will hide the access to the indexers
> among other things, so even the .NET guy will be able to use it. That will
> allow me to study REST and, I hope, make clearer questions in the future.



Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


Re: Adding pdf/word file using JSON/XML

2013-06-13 Thread Jack Krupansky
Thanks. And I apologize for the fact that Solr doesn't have a clean and true 
REST API (like ElasticSearch!) - even though it's not my fault!


An app-specific REST API is the way to go. Solr is too much of a beast for 
average app developers to master.


Let us know of any additional, specific questions.

-- Jack Krupansky

-Original Message- 
From: Roland Everaert

Sent: Thursday, June 13, 2013 3:32 AM
To: solr-user@lucene.apache.org
Subject: Re: Adding pdf/word file using JSON/XML

I apologize also for my obscure questions and I thanks you and the list for
your help so far and the very clear explanation you give about the
behaviour of Solr and SolrCell.

I am effectively an intermediary between the list and the dev, because our
development process is not efficient. The full story is (beware its
boring), we are a bunch of devs in a consultancy company waiting for the
next mission. In the mean time, our boss gives us something to do, but
instead of developing a big application where each dev has a module to care
of, or working each on its own machine. We have to develop the same
application with various technologies/tools/language. One is using .NET,
another is using Java and the spring framework and the 3rd one is using
JavaEE. And I am in the middle as a sysadmin/dba/investigator of tools and
API/provider of information and transparent API for everybody while
managing 3 databases, 2 application servers and 2 different indexers on the
same server and take into consideration that at some points in time the
devs will interchange their tools (rdbms and/or indexers) *now you can
breath*.

Top that with the fact that, one of the dev is experienced in REST and web
technologies (the IDIOT ;)) and that I have misread the first line of the
Solr feature page (Solr is a standalone enterprise search server with a
REST-like API), I actually communicate that Solr provides a RESTful API.

So I think I am a bit overwhelmed by the task at hand.

To conclude, yesterday I discuss with the team and we decide that I will
provide a RESTful web service that will hide the access to the indexers
among other things, so even the .NET guy will be able to use it. That will
allow me to study REST and, I hope, make clearer questions in the future.

Thanks again for your help and your patience,


Roland Everaert.




On Wed, Jun 12, 2013 at 4:18 PM, Jack Krupansky 
wrote:



I'm sorry if I came across as aggressive or insulting - I'm only trying to
dig down to what your actual difficulty is - and you have been making that
extremely difficult for all of us. You need to help us all out here by 
more
clearly expressing what your actual problem is. You will have to excuse 
the

rest of us if we are unable to read your mind!

It sounds as if you are an intermediary between your devs and this list.
That's NOT a very effective communications strategy! You need to either
have your devs communicate directly on this list, or you need to do a much
better job of understanding what their actual problem is and then
communicate that actual problem to this list, plainly and clearly.

TRYING to read your mind (and indirectly your devs' minds as well - not an
easy task!), and reading between the lines, it is starting to sound as if
you (or/and your devs) are not clear on how Solr works as a "database".

Core Solr does have full CRUD (Add or Create, Read or Query, Update, and
Delete), although not in a strict, pure REST sense, that is true.

A "full" update in Solr is the same as an Add - add a new, fresh document,
and then delete the old document. Some people call this an "Upsert"
(combination of Update or Insert).

There are really two forms of update (a difficulty in REST): 1) full
update or "replace" - equal to a delete and an add, and 2) partial or
incremental update. True REST only has the latter

Core Solr does have support for partial or incremental Update with Atomic
Updates. Solr will in fact retain the existing data and only update any 
new

field values that are supplied on the update request.

SolrCell (Extracting RequestHandler or "/update/extract") is not a core
part of Solr. It is an add on "contrib" module. It does not have full CRUD
- no delete, and no partial update, but it does support add and full 
update.


As someone else already suggested, you can do the work of SolrCell
yourself by calling Tika directly in your app layer and then sending 
normal

Solr CRUD requests.


-- Jack Krupansky

-Original Message- From: Roland Everaert
Sent: Wednesday, June 12, 2013 5:21 AM

To: solr-user@lucene.apache.org
Subject: Re: Adding pdf/word file using JSON/XML

1) Being aggressive and insulting is not a way to help people understand
such complex tool or to help people in general.

2) I read again the feature page of Solr and it is stated that the
interface is REST-like and not RESTful as I though in the first place, and
communicate to the devs. And as the devs told me a RESTful interface
doesn't use parameters in the URI/URL, so

Re: Solr 3.5 Optimization takes index file size almost double

2013-06-13 Thread Rafał Kuć
Hello!

Do you have some backup after commit in your configuration? It would
also be good to see how your index directory looks like, can you list
that ?

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch

> Thanks Rafal for reply...

> I agree with you. But Actually After optimization , it does not reduce size
> and it remains double. so is there any thing we missed or need to do for
> achieving index size reduction ?

> Is there any special setting we need to configure for replication?




> On 13 June 2013 16:53, Rafał Kuć  wrote:

>> Hello!
>>
>> Optimize command needs to rewrite the segments, so while it is
>> still working you may see the index size to be doubled. However after
>> it is finished the index size will be usually lowered comparing to the
>> index size before optimize.
>>
>> --
>> Regards,
>>  Rafał Kuć
>>  Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch
>>
>> > Hi,
>> > I have solr server 1.4.1 with index file size 428GB.Now When I upgrade
>> solr
>> > Server 1.4.1 to Solr 3.5.0 by replication method. Size remains same.
>> > But when optimize index for Solr 3.5.0 instance its size reaches 791GB.so
>> > what is solutions for size remains same or lesser.
>> > I optimize Solr 3.5 with Query:
>> > /update?optimize=true&commit=true
>>
>> > Thanks & regards
>> > Viresh Modi
>>
>>



Re: Apache Nutch data to Solr 4.3 schema issues ?

2013-06-13 Thread Jack Krupansky
Look further down in the stack trace in the Solr log for the final "Caused 
By:".


And better to start with the Solr 4.3 schema and config files and then merge 
in your Nutch changes one line at a time.


-- Jack Krupansky

-Original Message- 
From: Tony Mullins

Sent: Thursday, June 13, 2013 3:56 AM
To: solr-user@lucene.apache.org
Subject: Apache Nutch data to Solr 4.3 schema issues ?

Hi ,

I am trying to index my Solr 4.3 from Apache Nutch 2.2 data. And for that I
have copied the schema-solr4.xml from Nutch2.2 runtime/local/conf and
pasted it to my SolrHome solr/collection1/conf.

My Solr4.3 is hosted in Tomcat. And initially when I tried
http://localhost:8080/solr/#/collection1
it wasn't working and on further investigation I found _version_ field was
missing so I added this field as
 and it
started working ok.

And now when I try http://localhost:8080/solr/collection1/browse ... it
shows me errors like
"HTTP Status 500 - {msg=lazy loading
error,trace=org.apache.solr.common.SolrException: lazy loading error at
org.apache.solr.core.SolrCore$LazyQueryResponseWriterWrapper.getWrappedWriter(SolrCore.java:2260)
at
org.apache.solr.core.SolrCore$LazyQueryResponseWriterWrapper.getContentType(SolrCore.java:2279)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:623)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:372)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at "

So could you please guide me that whats missing now ? is this again due to
any schema issue or something else ?

Thanks,
Tony 



Re: Solr 3.5 Optimization takes index file size almost double

2013-06-13 Thread Viresh Modi
Thanks Rafal for reply...

I agree with you. But Actually After optimization , it does not reduce size
and it remains double. so is there any thing we missed or need to do for
achieving index size reduction ?

Is there any special setting we need to configure for replication?




On 13 June 2013 16:53, Rafał Kuć  wrote:

> Hello!
>
> Optimize command needs to rewrite the segments, so while it is
> still working you may see the index size to be doubled. However after
> it is finished the index size will be usually lowered comparing to the
> index size before optimize.
>
> --
> Regards,
>  Rafał Kuć
>  Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch
>
> > Hi,
> > I have solr server 1.4.1 with index file size 428GB.Now When I upgrade
> solr
> > Server 1.4.1 to Solr 3.5.0 by replication method. Size remains same.
> > But when optimize index for Solr 3.5.0 instance its size reaches 791GB.so
> > what is solutions for size remains same or lesser.
> > I optimize Solr 3.5 with Query:
> > /update?optimize=true&commit=true
>
> > Thanks & regards
> > Viresh Modi
>
>

-- 

--
This email and its attachments are intended for the above named only and 
may be confidential. If they have come to you in error you must take no 
action based on them, nor must you copy or show them to anyone; please 
reply to this email and highlight the error.


Re: Solr 3.5 Optimization takes index file size almost double

2013-06-13 Thread Rafał Kuć
Hello!

Optimize command needs to rewrite the segments, so while it is
still working you may see the index size to be doubled. However after
it is finished the index size will be usually lowered comparing to the
index size before optimize.

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch

> Hi,
> I have solr server 1.4.1 with index file size 428GB.Now When I upgrade solr
> Server 1.4.1 to Solr 3.5.0 by replication method. Size remains same.
> But when optimize index for Solr 3.5.0 instance its size reaches 791GB.so
> what is solutions for size remains same or lesser.
> I optimize Solr 3.5 with Query:
> /update?optimize=true&commit=true

> Thanks & regards
> Viresh Modi



Re: analyzer for Code

2013-06-13 Thread Erick Erickson
Well, WordDelimiterFilterFactory would split on the punctuation, so
you could add it to the analyzer chain along with StandardAnalyzer.

You could use one of the regex filters to break up tokens that make it
through the analyzer as you see fit.

But in general, this will be a bunch of compromises since programming
languages are, shall we say, not standard 

Best
Erick


On Thu, Jun 13, 2013 at 4:19 AM, Gian Maria Ricci
wrote:

> I did a little search around and did not find anything interesting. Anyone
> know if some analyzers exists to better index source code (es C#, C++. Java
> etc)?
>
> ** **
>
> Standard analyzer is quite good, but I wish to know if there are some more
> specific analyzers that can do a better indexing. Es I did a little try
> with C# and the full class name was indexed without splitting by dots. So
> MyLib.Helpers.Myclass becomes one token and when I search for MyClass I did
> not find matches. 
>
> ** **
>
> Thanks in advance.
>
> ** **
>
> --
>
> Gian Maria Ricci
>
> Mobile: +39 320 0136949
>
>  [image:
> https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rnuVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]
>  [image:
> https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]
>  [image:
> https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMUTub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]
>  [image:
> https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xfDtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3]
> 
>
> ** **
>
> ** **
>


Re: Sorting by field is slow

2013-06-13 Thread Erick Erickson
Shane:

You've covered all the config stuff that I can think of. There's one
other possibility. Do you have the soft commits turned on and are
they very short? Although soft commits shouldn't invalidate any
segment-level caches (but I'm not sure whether the sorting buffers
are low-level or not).

About the only other thing I can think of is that you're somehow
doing hard commits from, say, the client but that's really
stretching.

All I can really say at this point is that this isn't a problem I've seen
before, so it's _likely_ some innocent-seeming config has changed.
I'm sure it'll be obvious once you find it ...

Erick

On Wed, Jun 12, 2013 at 11:51 PM, Shane Perry  wrote:
> Erick,
>
> I agree, it doesn't make sense.  I manually merged the solrconfig.xml from
> the distribution example with my 3.6 solrconfig.xml, pulling out what I
> didn't need.  There is the possibility I removed something I shouldn't have
> though I don't know what it would be.  Minus removing the dynamic fields, a
> custom tokenizer class, and changing all my fields to be stored, the
> schema.xml file should be the same as well.  I'm not currently in the
> position to do so, but I'll double check those two files.  Finally, the
> data was re-indexed when I moved to 4.3.
>
> My statement about field values wasn't stated very well.  What I meant is
> that the 'text' field has more unique terms than some of my other fields.
>
> As for this being an edge case, I'm not sure why it would manifest itself
> in 4.3 but not in 3.6 (short of me having a screwy configuration setting).
>  If I get a chance, I'll see if I can duplicate the behavior with a small
> document count in a sandboxed environment.
>
> Shane
>
> On Wed, Jun 12, 2013 at 5:14 PM, Erick Erickson 
> wrote:
>
>> This doesn't make much sense, particularly the fact
>> that you added first/new searchers. I'm assuming that
>> these are sorting on the same field as your slow query.
>>
>> But sorting on a text field for which
>> "Overall, the values of the field are unique"
>> is a red-flag. Solr doesn't sort on fields that have
>> more than one term, so you might as well use a
>> string field and be done with it, it's possible you're
>> hitting some edge case.
>>
>> Did you just copy your 3.6 schema and configs to
>> 4.3? Did you re-index?
>>
>> Best
>> Erick
>>
>> On Wed, Jun 12, 2013 at 5:11 PM, Shane Perry  wrote:
>> > Thanks for the responses.
>> >
>> > Setting first/newSearcher had no noticeable effect.  I'm sorting on a
>> > stored/indexed field named 'text' who's fieldType is solr.TextField.
>> >  Overall, the values of the field are unique. The JVM is only using about
>> > 2G of the available 12G, so no OOM/GC issue (at least on the surface).
>>  The
>> > server is question is a slave with approximately 56 million documents.
>> >  Additionally, sorting on a field of the same type but with significantly
>> > less uniqueness results quick response times.
>> >
>> > The following is a sample of *debugQuery=true* for a query which returns
>> 1
>> > document:
>> >
>> > 
>> >   61458.0
>> >   
>> > 61452.0
>> >   
>> >   
>> > 0.0
>> >   
>> >   
>> > 0.0
>> >   
>> >   
>> > 0.0
>> >   
>> >   
>> > 0.0
>> >   
>> >   
>> > 6.0
>> >   
>> > 
>> >
>> >
>> > -- Update --
>> >
>> > Out of desperation, I turned off replication by commenting out the *> > name="slave">* element in the replication requestHandler block.  After
>> > restarting tomcat I was surprised to find that the replication admin UI
>> > still reported the core as replicating.  Search queries were still slow.
>>  I
>> > then disabled replication via the UI and the display updated to report
>> the
>> > core was no longer replicating.  Queries are now fast so it appears that
>> > the sorting may be a red-herring.
>> >
>> > It's may be of note to also mention that the slow queries don't appear to
>> > be getting cached.
>> >
>> > Thanks again for the feed back.
>> >
>> > On Wed, Jun 12, 2013 at 2:33 PM, Jack Krupansky > >wrote:
>> >
>> >> Rerun the sorted query with &debugQuery=true and look at the module
>> >> timings. See what stands out
>> >>
>> >> Are you actually sorting on a "text" field, as opposed to a "string"
>> field?
>> >>
>> >> Of course, it's always possible that maybe you're hitting some odd
>> OOM/GC
>> >> condition as a result of Solr growing  between releases.
>> >>
>> >> -- Jack Krupansky
>> >>
>> >> -Original Message- From: Shane Perry
>> >> Sent: Wednesday, June 12, 2013 3:00 PM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Sorting by field is slow
>> >>
>> >>
>> >> In upgrading from Solr 3.6.1 to 4.3.0, our query response time has
>> >> increased exponentially.  After testing in 4.3.0 it appears the same
>> query
>> >> (with 1 matching document) returns after 100 ms without sorting but
>> takes 1
>> >> minute when sorting by a text field.  I've looked around but haven't yet
>> >> found a reason for the degradation.  Can someone give me some insight or
>> >> point

Solr 3.5 Optimization takes index file size almost double

2013-06-13 Thread Viresh Modi
Hi,
I have solr server 1.4.1 with index file size 428GB.Now When I upgrade solr
Server 1.4.1 to Solr 3.5.0 by replication method. Size remains same.
But when optimize index for Solr 3.5.0 instance its size reaches 791GB.so
what is solutions for size remains same or lesser.
I optimize Solr 3.5 with Query:
/update?optimize=true&commit=true

Thanks & regards
Viresh Modi

-- 

--
This email and its attachments are intended for the above named only and 
may be confidential. If they have come to you in error you must take no 
action based on them, nor must you copy or show them to anyone; please 
reply to this email and highlight the error.


Re: Configuring Solr to connect to a SQL server instance

2013-06-13 Thread Shalin Shekhar Mangar
Daniel, DIH JdbcDataSource does not support integrated security. You must
provide a username and password for it to work.


On Wed, Jun 12, 2013 at 11:06 PM, Daniel Mosesson  wrote:

> I currently have the following:
>
> I am running the example-DIH instance of solr, and it works fine.
> I then change the data-db-confix.xml file to make the dataSource the
> following:
>
>driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
>   url="jdbc:sqlserver://SomeNameHere\instanceName "
>   integratedSecurity="true"
>   database="SomeOtherdatabase"
> />
>
> As far as I can tell from the SQL profiler, it is never able to log in, or
> even attempt to connect.
>
> I did get the jdbc  .jar file and sqljdbc_auth.dll file, and loaded them
> into example-DIH\solr\db\lib
>
> The error I am getting from the attempted import is as follows:
> Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
> execute query: select * from temp_ip_solr_test Processing Document # 1
>
> What could I be doing wrong?
> Solr version 4.3
>
> 
>
> **
> This e-mail message and any attachments are confidential. Dissemination,
> distribution or copying of this e-mail or any attachments by anyone other
> than the intended recipient is prohibited. If you are not the intended
> recipient, please notify Ipreo immediately by replying to this e-mail, and
> destroy all copies of this e-mail and any attachments. Thank you!
> **
>



-- 
Regards,
Shalin Shekhar Mangar.


DIH Update question

2013-06-13 Thread PeriS
> What would be the process to update a new record in an existing db using DIH?

Thanks



AW: SOLR-4641: Schema now throws exception on illegal field parameters.

2013-06-13 Thread uwe72
Erick, i think he didn't at the validate=false to a field, but global to
the schema.xml/solrconfig.xml (i don't remember where exactly define this
globally)

 

Von: Erick Erickson [via Lucene]
[mailto:ml-node+s472066n4070067...@n3.nabble.com] 
Gesendet: Donnerstag, 13. Juni 2013 00:51
An: uwe72
Betreff: Re: SOLR-4641: Schema now throws exception on illegal field
parameters.

 

bbarani: 

Where did you see this? I haven't seen it before and I get an error on 
startup if I add validate="false" to a  definition 

Thanks, 
Erick 

On Tue, Jun 11, 2013 at 12:33 PM, bbarani <[hidden email]> wrote: 


> I think if you use validate=false in schema.xml, field or dynamicField
level, 
> Solr will not disable validation. 
> 
> I think this only works in solr 4.3 and above.. 
> 
> 
> 
> -- 
> View this message in context:
http://lucene.472066.n3.nabble.com/SOLR-4641-Schema-now-throws-exception-o
n-illegal-field-parameters-tp4069622p4069688.html
> Sent from the Solr - User mailing list archive at Nabble.com. 

 

  _  

If you reply to this email, your message will be added to the discussion
below:

http://lucene.472066.n3.nabble.com/SOLR-4641-Schema-now-throws-exception-o
n-illegal-field-parameters-tp4069622p4070067.html 

To unsubscribe from SOLR-4641: Schema now throws exception on illegal
field parameters., click here
 .
 
 NAML 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-4641-Schema-now-throws-exception-on-illegal-field-parameters-tp4069622p4070159.html
Sent from the Solr - User mailing list archive at Nabble.com.

AW: SOLR-4641: Schema now throws exception on illegal field parameters.

2013-06-13 Thread uwe72
How can i load this custom properties with solrJ?

 

Von: Erick Erickson [via Lucene]
[mailto:ml-node+s472066n4070068...@n3.nabble.com] 
Gesendet: Donnerstag, 13. Juni 2013 00:53
An: uwe72
Betreff: Re: SOLR-4641: Schema now throws exception on illegal field
parameters.

 

But see Steve Rowe's comments at 
https://issues.apache.org/jira/browse/SOLR-4641 and use custom child 
properties as: 

 
  VALUE   

  ... 
 

Best 
Erick 

On Wed, Jun 12, 2013 at 6:49 PM, Erick Erickson <[hidden email]> wrote: 


> bbarani: 
> 
> Where did you see this? I haven't seen it before and I get an error on 
> startup if I add validate="false" to a  definition 
> 
> Thanks, 
> Erick 
> 
> On Tue, Jun 11, 2013 at 12:33 PM, bbarani <[hidden email]> wrote: 
>> I think if you use validate=false in schema.xml, field or dynamicField
level, 
>> Solr will not disable validation. 
>> 
>> I think this only works in solr 4.3 and above.. 
>> 
>> 
>> 
>> -- 
>> View this message in context:
http://lucene.472066.n3.nabble.com/SOLR-4641-Schema-now-throws-exception-o
n-illegal-field-parameters-tp4069622p4069688.html
>> Sent from the Solr - User mailing list archive at Nabble.com. 

 

  _  

If you reply to this email, your message will be added to the discussion
below:

http://lucene.472066.n3.nabble.com/SOLR-4641-Schema-now-throws-exception-o
n-illegal-field-parameters-tp4069622p4070068.html 

To unsubscribe from SOLR-4641: Schema now throws exception on illegal
field parameters., click here
 .
 
 NAML 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-4641-Schema-now-throws-exception-on-illegal-field-parameters-tp4069622p4070160.html
Sent from the Solr - User mailing list archive at Nabble.com.

analyzer for Code

2013-06-13 Thread Gian Maria Ricci
I did a little search around and did not find anything interesting. Anyone
know if some analyzers exists to better index source code (es C#, C++. Java
etc)?

 

Standard analyzer is quite good, but I wish to know if there are some more
specific analyzers that can do a better indexing. Es I did a little try with
C# and the full class name was indexed without splitting by dots. So
MyLib.Helpers.Myclass becomes one token and when I search for MyClass I did
not find matches. 

 

Thanks in advance.

 

--

Gian Maria Ricci

Mobile: +39 320 0136949

 

   
 

 

 



Apache Nutch data to Solr 4.3 schema issues ?

2013-06-13 Thread Tony Mullins
Hi ,

I am trying to index my Solr 4.3 from Apache Nutch 2.2 data. And for that I
have copied the schema-solr4.xml from Nutch2.2 runtime/local/conf and
pasted it to my SolrHome solr/collection1/conf.

My Solr4.3 is hosted in Tomcat. And initially when I tried
http://localhost:8080/solr/#/collection1
it wasn't working and on further investigation I found _version_ field was
missing so I added this field as
 and it
started working ok.

And now when I try http://localhost:8080/solr/collection1/browse ... it
shows me errors like
"HTTP Status 500 - {msg=lazy loading
error,trace=org.apache.solr.common.SolrException: lazy loading error at
org.apache.solr.core.SolrCore$LazyQueryResponseWriterWrapper.getWrappedWriter(SolrCore.java:2260)
at
org.apache.solr.core.SolrCore$LazyQueryResponseWriterWrapper.getContentType(SolrCore.java:2279)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:623)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:372)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at "

So could you please guide me that whats missing now ? is this again due to
any schema issue or something else ?

Thanks,
Tony


Re: Adding pdf/word file using JSON/XML

2013-06-13 Thread Roland Everaert
I apologize also for my obscure questions and I thanks you and the list for
your help so far and the very clear explanation you give about the
behaviour of Solr and SolrCell.

I am effectively an intermediary between the list and the dev, because our
development process is not efficient. The full story is (beware its
boring), we are a bunch of devs in a consultancy company waiting for the
next mission. In the mean time, our boss gives us something to do, but
instead of developing a big application where each dev has a module to care
of, or working each on its own machine. We have to develop the same
application with various technologies/tools/language. One is using .NET,
another is using Java and the spring framework and the 3rd one is using
JavaEE. And I am in the middle as a sysadmin/dba/investigator of tools and
API/provider of information and transparent API for everybody while
managing 3 databases, 2 application servers and 2 different indexers on the
same server and take into consideration that at some points in time the
devs will interchange their tools (rdbms and/or indexers) *now you can
breath*.

Top that with the fact that, one of the dev is experienced in REST and web
technologies (the IDIOT ;)) and that I have misread the first line of the
Solr feature page (Solr is a standalone enterprise search server with a
REST-like API), I actually communicate that Solr provides a RESTful API.

So I think I am a bit overwhelmed by the task at hand.

To conclude, yesterday I discuss with the team and we decide that I will
provide a RESTful web service that will hide the access to the indexers
among other things, so even the .NET guy will be able to use it. That will
allow me to study REST and, I hope, make clearer questions in the future.

Thanks again for your help and your patience,


Roland Everaert.




On Wed, Jun 12, 2013 at 4:18 PM, Jack Krupansky wrote:

> I'm sorry if I came across as aggressive or insulting - I'm only trying to
> dig down to what your actual difficulty is - and you have been making that
> extremely difficult for all of us. You need to help us all out here by more
> clearly expressing what your actual problem is. You will have to excuse the
> rest of us if we are unable to read your mind!
>
> It sounds as if you are an intermediary between your devs and this list.
> That's NOT a very effective communications strategy! You need to either
> have your devs communicate directly on this list, or you need to do a much
> better job of understanding what their actual problem is and then
> communicate that actual problem to this list, plainly and clearly.
>
> TRYING to read your mind (and indirectly your devs' minds as well - not an
> easy task!), and reading between the lines, it is starting to sound as if
> you (or/and your devs) are not clear on how Solr works as a "database".
>
> Core Solr does have full CRUD (Add or Create, Read or Query, Update, and
> Delete), although not in a strict, pure REST sense, that is true.
>
> A "full" update in Solr is the same as an Add - add a new, fresh document,
> and then delete the old document. Some people call this an "Upsert"
> (combination of Update or Insert).
>
> There are really two forms of update (a difficulty in REST): 1) full
> update or "replace" - equal to a delete and an add, and 2) partial or
> incremental update. True REST only has the latter
>
> Core Solr does have support for partial or incremental Update with Atomic
> Updates. Solr will in fact retain the existing data and only update any new
> field values that are supplied on the update request.
>
> SolrCell (Extracting RequestHandler or "/update/extract") is not a core
> part of Solr. It is an add on "contrib" module. It does not have full CRUD
> - no delete, and no partial update, but it does support add and full update.
>
> As someone else already suggested, you can do the work of SolrCell
> yourself by calling Tika directly in your app layer and then sending normal
> Solr CRUD requests.
>
>
> -- Jack Krupansky
>
> -Original Message- From: Roland Everaert
> Sent: Wednesday, June 12, 2013 5:21 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: Adding pdf/word file using JSON/XML
>
> 1) Being aggressive and insulting is not a way to help people understand
> such complex tool or to help people in general.
>
> 2) I read again the feature page of Solr and it is stated that the
> interface is REST-like and not RESTful as I though in the first place, and
> communicate to the devs. And as the devs told me a RESTful interface
> doesn't use parameters in the URI/URL, so ii is my mistake. Hence we have
> no problem with the interface as it is.
>
> Any way I still have a question regarding the /extract interface. It seems
> that every time a file is updated in Solr, the lucene document is recreated
> from scratch which means that any extra information we want to be
> indexed/stored along the file is erased if the request doesn't contains
> them. Is there a parameter that allow 

Re: Need help with search in multiple indexes

2013-06-13 Thread Toke Eskildsen
On Wed, 2013-06-12 at 23:05 +0200, smanad wrote:
> Is this a limitation of solr/lucene, should I be considering using other
> option like using Elasticsearch (which is also based on lucene)? 
> But I am sure search in multiple indexes is kind of a common problem.

You try to treat separate sources as a single index and that is tricky.
Assuming you need relevance ranking, the sources need to be homogeneous
in order for the scores to be somewhat comparable. That seems not to be
the case for you, so even if you align your schemas to get "formal"
compatibility, your ranking will be shot with Solr.

ElasticSearch has elaborate handling of this problem
http://www.elasticsearch.org/guide/reference/api/search/search-type/
and seems to be a better fit for you in this regard.

- Toke Eskildsen, State and University Library, Denmark



Re: What is Difference Between Down and Gone At Admin Cloud Page?

2013-06-13 Thread Furkan KAMACI
Thanks Stefan, that is what I want.

2013/6/12 Stefan Matheis 

> The ticket for the legend is SOLR-3915, the definition came up in
> SOLR-3174:
>
>
> https://issues.apache.org/jira/browse/SOLR-3174?focusedCommentId=13255923&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13255923
>
>
> On Wednesday, June 12, 2013 at 3:54 PM, Mark Miller wrote:
>
> >
> > On Jun 12, 2013, at 3:19 AM, Furkan KAMACI  furkankam...@gmail.com)> wrote:
> >
> > > What is Difference Between Down and Gone At Admin Cloud Page?
> >
> > If I remember right, Down can mean the node is still actively working
> towards something - eg, without action by you, it might go into recovering
> or active state. Gone means it has given up or disappeared. It's not likely
> to make another state change without your intervention.
> >
> > - Mark
>
>