Re: useFastVectorHighlighter doesn't work

2012-05-28 Thread Ahmet Arslan
 I had a schema defined as field name=text type=text
 indexed=true stored=false termVectors=true
 termPositions=true termOffsets=true/

You need to mark your text field as stored=true to use 
hl.useFastVectorHighlighter=true


Which time consuming processes are executed during Solr startup?

2012-05-28 Thread Ivan Hrytsyuk
For example we know that cache warming is executed during startup.
Are any other processes executed during Solr startup?

Thank you, Ivan


Re: Is there any relationship between size of index and indexing performance?

2012-05-28 Thread Aditya
Hi Ivan,

It depends on number of terms it has to load. If you index less amount of
data but store large amount of data then your index size may be big but
actual terms may be less.

It is not directly proportional.

Regards
Aditya
www.findbestopensource.com


On Mon, May 28, 2012 at 3:00 PM, Ivan Hrytsyuk
ihryts...@softserveinc.comwrote:

 Let's assume we are indexing 1GB of data. Does size of index have any
 impact on indexing performance? I.e. will we have any difference in case of
 empty index vs 50 GB index?

 Thank you, Ivan



boost date for fresh result query

2012-05-28 Thread Jonty Rhods
Hi

I am facing problem to boost on date field.
I have following field in schema
   field name=release_date type=date indexed=true stored=true/

solr version 3.4
I don't want to sort by date but want to give 50 to 60% boost those result
which have latest date...

following are the query :

http://localhost:8083/solr/movie/select/?defType=dismaxq=titanicfq=title,tagsqf=tags
^2.0,title^1.0mm=1bf=recip(rord(release_date),1,1000,1000)

but it seems the result has no effect because still all old date results
are on top or I am unable to make query syntax. Is there anything missing
in query..

Please help me to make the function possible..

thanks

regards
Jonty


Re: Is there any relationship between size of index and indexing performance?

2012-05-28 Thread bilal dadanlar
indexing performance is mostly about the number of docs.
but when you are optimizing, a large index takes a bit much time

On Mon, May 28, 2012 at 12:48 PM, Aditya findbestopensou...@gmail.comwrote:

 Hi Ivan,

 It depends on number of terms it has to load. If you index less amount of
 data but store large amount of data then your index size may be big but
 actual terms may be less.

 It is not directly proportional.

 Regards
 Aditya
 www.findbestopensource.com


 On Mon, May 28, 2012 at 3:00 PM, Ivan Hrytsyuk
 ihryts...@softserveinc.comwrote:

  Let's assume we are indexing 1GB of data. Does size of index have any
  impact on indexing performance? I.e. will we have any difference in case
 of
  empty index vs 50 GB index?
 
  Thank you, Ivan
 




-- 
Bilal Dadanlar
fizy.com | Yazılım Mühendisi


Re: UpdateRequestProcessor : flattened values

2012-05-28 Thread Raphaël
On Sun, May 27, 2012 at 11:54:02PM -0400, Jack Krupansky wrote:
 You can create your own update processor that gets control between the 
 output of Tika and the indexing of the document.
 
 See:
 http://wiki.apache.org/solr/UpdateRequestProcessor

Seems to be exactly what I was looking for, thanks a lot !

I just started an (almost working) implementation but I've one notice:

Let's get a field valueS:
 Collection v = doc.getFieldValues( author );
( in my `processAdd(AddUpdateCommand cmd)` )

and push a doc, say using:
 `curl -F content=@my.pdf -F literal.author=a -F literal.author=b -F 
 literal.author=c d`

Then `log.warn(author:  + v + : + v.size());` throws:
 WARN: author: [pdfauthor, a b c d] : 2

It's not (yet) a blocker in my personal case but I fear it's important
enough to be noted: using a custom UpdateRequestProcessor, the access to
individual literal fields seems (currently) very limited as they appear
to be flattened. I'm quite sure there should already an hidden bug report
about this somewhere.


Other than that and unless I hit some other unexpected issue, this way
to customize the request processor perfectly suits my needs.


thanks !


indexing unstructured text (tweets)

2012-05-28 Thread Giovanni Gherdovich
Hi all.

I am in the process of setting up Solr for my application,
which is full text search on a bunch of tweets from twitter.

I am afraid I am missing something.
From the books I am reading, Apache Solr 3 Enterprise Search Server,
it looks like Solr works with structured input, like XML or CVS,
while I have the most wild and unstructured input ever (tweets).
A section named Indexing documents with Solr Cell seems to address my problem,
but also shows that before getting to Solr, I might need to use
another Apache tool called Tika.

Can anybody provide a brief explaination about the general picture?
Can I index my tweets with Solr?
Or do I need to put also Tika in my pipeline?

Best regards,
Giovanni Gherdovich


Re: indexing unstructured text (tweets)

2012-05-28 Thread Dmitry Kan
Hi,

You want to use Tika, if you have your data in some binary format, like pdf
or excel. It extracts text from the binary for you. If you just want to
index the text contents of tweets (including web links etc), using just
off-the-shelf Solr is enough. You'll have to wrap your text input (per each
tweet I would assume) into an xml or other supported structured format in
accordance with the schema that you have defined. So at minimum, you would
have two fields: a unique id of a document and its textual contents (a
tweet). So design your schema first, create (e.g.) xml with the documents
to add and post them onto SOLR.

Dmitry

On Mon, May 28, 2012 at 2:37 PM, Giovanni Gherdovich g.gherdov...@gmail.com
 wrote:

 Hi all.

 I am in the process of setting up Solr for my application,
 which is full text search on a bunch of tweets from twitter.

 I am afraid I am missing something.
 From the books I am reading, Apache Solr 3 Enterprise Search Server,
 it looks like Solr works with structured input, like XML or CVS,
 while I have the most wild and unstructured input ever (tweets).
 A section named Indexing documents with Solr Cell seems to address my
 problem,
 but also shows that before getting to Solr, I might need to use
 another Apache tool called Tika.

 Can anybody provide a brief explaination about the general picture?
 Can I index my tweets with Solr?
 Or do I need to put also Tika in my pipeline?

 Best regards,
 Giovanni Gherdovich




-- 
Regards,

Dmitry Kan


Re: indexing unstructured text (tweets)

2012-05-28 Thread David Radunz

Hey,

I think you might be over-thinking this. Tweets are structured. You 
have the content (tweet), the user who tweeted it and various other meta 
data. So your 'document', might look like this:


add
doc
field name=tweetIdABCD1234/field
field name=tweetI bought some apples/field
field name=userJohnnyBoy/field
/doc
/add

To get this structure, you can use any programming language your 
comfortable with and load it into Solr via various means. Obviously you 
can add more 'meta' fields that you get from twitter if you want as well.


David

On 28/05/2012 9:37 PM, Giovanni Gherdovich wrote:

Hi all.

I am in the process of setting up Solr for my application,
which is full text search on a bunch of tweets from twitter.

I am afraid I am missing something.
 From the books I am reading, Apache Solr 3 Enterprise Search Server,
it looks like Solr works with structured input, like XML or CVS,
while I have the most wild and unstructured input ever (tweets).
A section named Indexing documents with Solr Cell seems to address my problem,
but also shows that before getting to Solr, I might need to use
another Apache tool called Tika.

Can anybody provide a brief explaination about the general picture?
Can I index my tweets with Solr?
Or do I need to put also Tika in my pipeline?

Best regards,
Giovanni Gherdovich




Re: indexing unstructured text (tweets)

2012-05-28 Thread Giovanni Gherdovich
Hello Dmitry and David,

2012/5/28 Dmitry Kan dmitry@gmail.com:
 [...] If you just want to
 index the text contents of tweets (including web links etc), using just
 off-the-shelf Solr is enough. You'll have to wrap your text input (per each
 tweet I would assume) into an xml [...]
  So design your schema first, create (e.g.) xml with the documents
 to add and post them onto SOLR.

2012/5/28 David Radunz da...@boxen.net:
 Hey,

I think you might be over-thinking this. [...]  So
 your 'document', might look like this: [...]

Thank you for your feedbacks. I'll take it easy and do as you suggest.

Cheers,
Giovanni


Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support

2012-05-28 Thread Nagendra Nagarajayya
It is a single node. I am trying to find out if the performance can be 
referenced.


Regarding information on Solr with RankingAlgorithm, you can find all 
the information here:


http://solr-ra.tgels.org

On RankingAlgorithm:

http://rankingalgorithm.tgels.org

Regards,
- NN

On 5/27/2012 4:50 PM, Li Li wrote:

yes, I am also interested in good performance with 2 billion docs. how
many search nodes do you use? what's the average response time and qps
?

another question: where can I find related paper or resources of your
algorithm which explains the algorithm in detail? why it's better than
google site(better than lucene is not very interested because lucene
is not originally designed to provide search function like google)?

On Mon, May 28, 2012 at 1:06 AM, Darren Govonidar...@ontrenet.com  wrote:

I think people on this list would be more interested in your approach to
scaling 2 billion documents than modifying solr/lucene scoring (which is
already top notch). So given that, can you share any references or
otherwise substantiate good performance with 2 billion documents?

Thanks.

On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote:

Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion
docs. With RankingAlgorithm 1.4.3, using the parameters
age=latestdocs=number feature, you can retrieve the NRT inserted
documents in milliseconds from such a huge index improving query and
faceting performance and using very little resources ...

Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and
the NRT insert performance with Solr 4.0 is about 70,000 docs / sec.
RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon.

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org



On 5/27/2012 7:32 AM, Darren Govoni wrote:

Hi,
Have you tested this with a billion documents?

Darren

On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote:

Hi!

I am very excited to announce the availability of Solr 3.6 with
RankingAlgorithm 1.4.2.

This NRT supports now works with both RankingAlgorithm and Lucene. The
insert/update performance should be about 5000 docs in about 490 ms with
the MbArtists Index.

RankingAlgorithm 1.4.2 has multiple algorithms, improved performance
over the earlier releases, supports the entire Lucene Query Syntax, ±
and/or boolean queries and can scale to more than a billion documents.

You can get more information about NRT performance from here:
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x

You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here:
http://solr-ra.tgels.org

Please download and give the new version a try.

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

ps. MbArtists index is the example index used in the Solr 1.4 Enterprise
Book












Negative value in numFound

2012-05-28 Thread tosenthu
Hi 

I have a index of size 1 Tb.. And I prepared this by setting up a background
script to index records. The index was fine last 2 days, and i have not
disturbed the process. Suddenly when i queried the index i get this
response, where the value of numFound is negative. Can any one say why/how
this occurs and also the cure.

response 
lst name=responseHeader 
int name=status0/int 
int name=QTime0/int 
lst name=params 
str name=indenton/str 
str name=start0/str 
str name=q*:*/str 
str name=version2.2/str 
str name=rows10/str 
/lst 
/lst
result name=response numFound=-2125433775 start=0/
/response

Regards
Senthil Kumar M R

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Negative value in numFound

2012-05-28 Thread ku3ia
Hi!
Can you please show your hardware parameters, version of Solr, that you're
using and schema.xml file?

thanks.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986408.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support

2012-05-28 Thread Darren Govoni
I don't recall anyone being able to get acceptable performance with a
single index that large with solr/lucene. The conventional wisdom is
that parallel searching across cores (or shards in SolrCloud) is the
best way to handle index sizes in the illions. So its of great
interest how you did.

Anyone else gotten an index(es) with billions of documents to perform
well? I'm greatly interested in how.

On Mon, 2012-05-28 at 05:12 -0700, Nagendra Nagarajayya wrote:
 It is a single node. I am trying to find out if the performance can be 
 referenced.
 
 Regarding information on Solr with RankingAlgorithm, you can find all 
 the information here:
 
 http://solr-ra.tgels.org
 
 On RankingAlgorithm:
 
 http://rankingalgorithm.tgels.org
 
 Regards,
 - NN
 
 On 5/27/2012 4:50 PM, Li Li wrote:
  yes, I am also interested in good performance with 2 billion docs. how
  many search nodes do you use? what's the average response time and qps
  ?
 
  another question: where can I find related paper or resources of your
  algorithm which explains the algorithm in detail? why it's better than
  google site(better than lucene is not very interested because lucene
  is not originally designed to provide search function like google)?
 
  On Mon, May 28, 2012 at 1:06 AM, Darren Govonidar...@ontrenet.com  wrote:
  I think people on this list would be more interested in your approach to
  scaling 2 billion documents than modifying solr/lucene scoring (which is
  already top notch). So given that, can you share any references or
  otherwise substantiate good performance with 2 billion documents?
 
  Thanks.
 
  On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote:
  Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion
  docs. With RankingAlgorithm 1.4.3, using the parameters
  age=latestdocs=number feature, you can retrieve the NRT inserted
  documents in milliseconds from such a huge index improving query and
  faceting performance and using very little resources ...
 
  Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and
  the NRT insert performance with Solr 4.0 is about 70,000 docs / sec.
  RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon.
 
  Regards,
 
  Nagendra Nagarajayya
  http://solr-ra.tgels.org
  http://rankingalgorithm.tgels.org
 
 
 
  On 5/27/2012 7:32 AM, Darren Govoni wrote:
  Hi,
  Have you tested this with a billion documents?
 
  Darren
 
  On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote:
  Hi!
 
  I am very excited to announce the availability of Solr 3.6 with
  RankingAlgorithm 1.4.2.
 
  This NRT supports now works with both RankingAlgorithm and Lucene. The
  insert/update performance should be about 5000 docs in about 490 ms with
  the MbArtists Index.
 
  RankingAlgorithm 1.4.2 has multiple algorithms, improved performance
  over the earlier releases, supports the entire Lucene Query Syntax, ±
  and/or boolean queries and can scale to more than a billion documents.
 
  You can get more information about NRT performance from here:
  http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x
 
  You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here:
  http://solr-ra.tgels.org
 
  Please download and give the new version a try.
 
  Regards,
 
  Nagendra Nagarajayya
  http://solr-ra.tgels.org
  http://rankingalgorithm.tgels.org
 
  ps. MbArtists index is the example index used in the Solr 1.4 Enterprise
  Book
 
 
 
 
 
 




Re: Negative value in numFound

2012-05-28 Thread tosenthu
The details are below

Solr : 3.5
Using a Schema file with 53 fields and 8 fields indexed among them.
OS : CentOS 5.4 64 Bit
Java : 1.6.0 64 Bit
Apache Tomcat : 7.0.22

Intel(R) Xeon(R) CPU L5518  @ 2.13GHz (16 Processors)

/dev/mapper/index 5.9T  1.9T  4.0T  33% /Index

Had around 2 Billion Records, when I queried it last time (2 days back)

Do I need to run the checkIndex tool?

Regards
Senthil Kumar M R

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986421.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexing unstructured text (tweets)

2012-05-28 Thread Jack Krupansky
Other obvious metadata from the Twitter API to index would be hashtags, user 
mentions (both the user id/screen name and user name), date/time, urls 
mentioned (expanded if a URL shortener is used), and possibly coordinates 
for spatial search.


You would have to add all these fields and values yourself in your Solr 
input document. Tika can't help you there.


Although, I imagine quite a few people have already done this quite a few 
times before, so maybe somebody could contribute their Twitter Solr schema. 
Anybody?


-- Jack Krupansky

-Original Message- 
From: David Radunz

Sent: Monday, May 28, 2012 8:00 AM
To: solr-user@lucene.apache.org
Subject: Re: indexing unstructured text (tweets)

Hey,

I think you might be over-thinking this. Tweets are structured. You
have the content (tweet), the user who tweeted it and various other meta
data. So your 'document', might look like this:

add
doc
field name=tweetIdABCD1234/field
field name=tweetI bought some apples/field
field name=userJohnnyBoy/field
/doc
/add

To get this structure, you can use any programming language your
comfortable with and load it into Solr via various means. Obviously you
can add more 'meta' fields that you get from twitter if you want as well.

David

On 28/05/2012 9:37 PM, Giovanni Gherdovich wrote:

Hi all.

I am in the process of setting up Solr for my application,
which is full text search on a bunch of tweets from twitter.

I am afraid I am missing something.
 From the books I am reading, Apache Solr 3 Enterprise Search Server,
it looks like Solr works with structured input, like XML or CVS,
while I have the most wild and unstructured input ever (tweets).
A section named Indexing documents with Solr Cell seems to address my 
problem,

but also shows that before getting to Solr, I might need to use
another Apache tool called Tika.

Can anybody provide a brief explaination about the general picture?
Can I index my tweets with Solr?
Or do I need to put also Tika in my pipeline?

Best regards,
Giovanni Gherdovich 




Re: Accent Characters

2012-05-28 Thread couto.vicente
Hi, Jack.
First of all thank you for your help.
Well, I tried again then I realized that my problem is not really with solr.
I did run this query against solr after start it up with the command java
-jar start.jar:
http://localhost:8983/solr/coreFR/spell?q=content:pr%C3%A9sentaspellcheck=truespellcheck.collate=truerows=0spellcheck.count=10

It gives me the result:
?xml version=1.0 encoding=UTF-8 ? 
response
 lst name=responseHeader
  int name=status0/int 
  int name=QTime31/int 
  /lst
  result name=response numFound=0 start=0 / 
 lst name=spellcheck
 lst name=suggestions
 lst name=présenta
  int name=numFound10/int 
  int name=startOffset8/int 
  int name=endOffset16/int 
 arr name=suggestion
  strprésente/str 
  strprésent/str 
  strprésenté/str 
  strprésents/str 
  strprésentant/str 
  strprésentera/str 
  strprésentait/str 
  strprésentes/str 
  strprésenter/str 
  strprésentée/str 
  /arr
  /lst
  str name=collationcontent:présente/str 
  /lst
  /lst
/response

And I did run exactly the same query after deploy solr.war in tomcat 7. Here
is my result:
?xml version=1.0 encoding=UTF-8 ? 
response
 lst name=responseHeader
  int name=status0/int 
  int name=QTime16/int 
  /lst
  result name=response numFound=0 start=0 / 
 lst name=spellcheck
 lst name=suggestions
 lst name=présenta
  int name=numFound10/int 
  int name=startOffset8/int 
  int name=endOffset16/int 
 arr name=suggestion
  strpresent/str 
  strprbsent/str 
  strpresentant/str 
  strpresentait/str 
  strpuisent/str 
  strpasent/str 
  strpensent/str 
  strposent/str 
  strdresent/str 
  strresenti/str 
  /arr
  /lst
  str name=collationcontent:present/str 
  /lst
  /lst
/response

As my application is running under tomcat, it means that I have some issue
with tomcat, but the weird stuff is that I already google it looking for a
fix and find out that we have to set up a parameter into server.xml tomcat
config file:

Connector port=5443 protocol=HTTP/1.1
   connectionTimeout=2
   redirectPort=8443
   URIEncoding=UTF-8 /

But it's not working as you can see.
I'm feeling a little stupid because it doesn't look like a big problem. For
sure people around the world are using solr with accents queries running
under tomcat properly!

Thank you

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986423.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexing unstructured text (tweets)

2012-05-28 Thread Anuj Kumar
This is a bit old but provides good information for schema design-
http://www.readwriteweb.com/archives/this_is_what_a_tweet_looks_like.php

Found this link as well- https://gist.github.com/702360

The types of the field may depend on the search requirements.

Regards,
Anuj

On Mon, May 28, 2012 at 7:21 PM, Jack Krupansky j...@basetechnology.comwrote:

 Other obvious metadata from the Twitter API to index would be hashtags,
 user mentions (both the user id/screen name and user name), date/time, urls
 mentioned (expanded if a URL shortener is used), and possibly coordinates
 for spatial search.

 You would have to add all these fields and values yourself in your Solr
 input document. Tika can't help you there.

 Although, I imagine quite a few people have already done this quite a few
 times before, so maybe somebody could contribute their Twitter Solr schema.
 Anybody?

 -- Jack Krupansky

 -Original Message- From: David Radunz
 Sent: Monday, May 28, 2012 8:00 AM
 To: solr-user@lucene.apache.org
 Subject: Re: indexing unstructured text (tweets)


 Hey,

I think you might be over-thinking this. Tweets are structured. You
 have the content (tweet), the user who tweeted it and various other meta
 data. So your 'document', might look like this:

 add
 doc
 field name=tweetIdABCD1234/**field
 field name=tweetI bought some apples/field
 field name=userJohnnyBoy/field
 /doc
 /add

 To get this structure, you can use any programming language your
 comfortable with and load it into Solr via various means. Obviously you
 can add more 'meta' fields that you get from twitter if you want as well.

 David

 On 28/05/2012 9:37 PM, Giovanni Gherdovich wrote:

 Hi all.

 I am in the process of setting up Solr for my application,
 which is full text search on a bunch of tweets from twitter.

 I am afraid I am missing something.
  From the books I am reading, Apache Solr 3 Enterprise Search Server,
 it looks like Solr works with structured input, like XML or CVS,
 while I have the most wild and unstructured input ever (tweets).
 A section named Indexing documents with Solr Cell seems to address my
 problem,
 but also shows that before getting to Solr, I might need to use
 another Apache tool called Tika.

 Can anybody provide a brief explaination about the general picture?
 Can I index my tweets with Solr?
 Or do I need to put also Tika in my pipeline?

 Best regards,
 Giovanni Gherdovich





Re: indexing unstructured text (tweets)

2012-05-28 Thread Giovanni Gherdovich
Hello Jack, hi all,

2012/5/28 Jack Krupansky j...@basetechnology.com:
 Other obvious metadata from the Twitter API to index would be hashtags, user
 mentions (both the user id/screen name and user name), date/time, urls
 mentioned (expanded if a URL shortener is used), and possibly coordinates
 for spatial search.

You rise good points here.

Just to understand better how it works in Solr:
say that we have a tweet that makes use of a hashtag and
mentions another user. I don't know how this would actually
appear coming from the Twitter Streaming API,
and I am assuming that, at least the tweet itself
(excluding date/time and stuff) , is just raw text,
like

Hey @alex1987, thank you for telling me how cool is #rubyonrails

So: in order to make Solr understand that here we have
a mention to a user (@alex1987) and a hashtag (#rubyonrails)
I have to format it myself and include those info
in my own XML schema, and preprocess that tweet in
order to get to

add
doc
field name=tweetIdABCD1234/field
field name=tweet_textHey @alex1987, thank you for telling me how
cool is #rubyonrails/field
field name=userhappyRubyist/field
field name=mentionsalex1987/field
field name=hashtagsrubyonrails/field
/doc
/add

Correct?
I have to preprocess and explicit those fields, if I want
them to be indexed as metadata, right?
I am asking since I am new here to Solr.

 Although, I imagine quite a few people have already done this quite a few
 times before, so maybe somebody could contribute their Twitter Solr schema.
 Anybody?

Oh that would be nice :-)

Cheers,
Giovanni


Re: indexing unstructured text (tweets)

2012-05-28 Thread Jack Krupansky
The Twitter API extracts hash tag and user mentions for you, in addition to 
giving you the full raw text. You'll have to read up on the Twitter API.


-- Jack Krupansky

-Original Message- 
From: Giovanni Gherdovich

Sent: Monday, May 28, 2012 10:09 AM
To: solr-user@lucene.apache.org
Subject: Re: indexing unstructured text (tweets)

Hello Jack, hi all,

2012/5/28 Jack Krupansky j...@basetechnology.com:
Other obvious metadata from the Twitter API to index would be hashtags, 
user

mentions (both the user id/screen name and user name), date/time, urls
mentioned (expanded if a URL shortener is used), and possibly coordinates
for spatial search.


You rise good points here.

Just to understand better how it works in Solr:
say that we have a tweet that makes use of a hashtag and
mentions another user. I don't know how this would actually
appear coming from the Twitter Streaming API,
and I am assuming that, at least the tweet itself
(excluding date/time and stuff) , is just raw text,
like

Hey @alex1987, thank you for telling me how cool is #rubyonrails

So: in order to make Solr understand that here we have
a mention to a user (@alex1987) and a hashtag (#rubyonrails)
I have to format it myself and include those info
in my own XML schema, and preprocess that tweet in
order to get to

add
doc
field name=tweetIdABCD1234/field
field name=tweet_textHey @alex1987, thank you for telling me how
cool is #rubyonrails/field
field name=userhappyRubyist/field
field name=mentionsalex1987/field
field name=hashtagsrubyonrails/field
/doc
/add

Correct?
I have to preprocess and explicit those fields, if I want
them to be indexed as metadata, right?
I am asking since I am new here to Solr.


Although, I imagine quite a few people have already done this quite a few
times before, so maybe somebody could contribute their Twitter Solr 
schema.

Anybody?


Oh that would be nice :-)

Cheers,
Giovanni 



Re: Negative value in numFound

2012-05-28 Thread ku3ia
Hm... Have you any errors in logs? During search, during indexing?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986426.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: UpdateRequestProcessor : flattened values

2012-05-28 Thread Jack Krupansky
... the access to individual literal fields seems (currently) very limited 
as they appear to be flattened.


That is s feature of SolrCell, to flatten multiple values for a 
non-multi-valued field into a string concatenation of the values.


All you need to do is add multiValued=true to the author field in your 
schema.xml:


field name=author type=text_general indexed=true stored=true/

becomes

field name=author type=text_general indexed=true stored=true 
multiValued=true/


-- Jack Krupansky

-Original Message- 
From: Raphaël

Sent: Monday, May 28, 2012 7:17 AM
To: solr-user@lucene.apache.org
Subject: Re: UpdateRequestProcessor : flattened values

On Sun, May 27, 2012 at 11:54:02PM -0400, Jack Krupansky wrote:

You can create your own update processor that gets control between the
output of Tika and the indexing of the document.

See:
http://wiki.apache.org/solr/UpdateRequestProcessor


Seems to be exactly what I was looking for, thanks a lot !

I just started an (almost working) implementation but I've one notice:

Let's get a field valueS:

Collection v = doc.getFieldValues( author );

( in my `processAdd(AddUpdateCommand cmd)` )

and push a doc, say using:
`curl -F content=@my.pdf -F literal.author=a -F literal.author=b -F 
literal.author=c d`


Then `log.warn(author:  + v + : + v.size());` throws:

WARN: author: [pdfauthor, a b c d] : 2


It's not (yet) a blocker in my personal case but I fear it's important
enough to be noted: using a custom UpdateRequestProcessor, the access to
individual literal fields seems (currently) very limited as they appear
to be flattened. I'm quite sure there should already an hidden bug report
about this somewhere.


Other than that and unless I hit some other unexpected issue, this way
to customize the request processor perfectly suits my needs.


thanks ! 



Re: indexing unstructured text (tweets)

2012-05-28 Thread Giovanni Gherdovich
Hello Jack and Anuj,

2012/5/28 Jack Krupansky j...@basetechnology.com:
 The Twitter API extracts hash tag and user mentions for you, in addition to
 giving you the full raw text. You'll have to read up on the Twitter API.

That's what I thought just after hittind send on the message above ;-)
I am pretty sure the Twitter API format maps very nicely to a suitable
input format for Solr, if not even being already good for direct
feeding into Solr.

I am a bit unlucky here because I have been provided with
only the raw text for about 1.5 million tweets; so I would have
to write a few lines of code to restore at least user mentions,
hashtags and URLs.


2012/5/28 Anuj Kumar anujs...@gmail.com:
 This is a bit old but provides good information for schema design-
 http://www.readwriteweb.com/archives/this_is_what_a_tweet_looks_like.php

 Found this link as well- https://gist.github.com/702360

 The types of the field may depend on the search requirements.

Anuj you provide very interesting links here, thanks,
even tho those kind of specifics might be already present
in the twitter API doc.
After I'll be done with my first Solr setup, I might
setup the whole pipeline (getting the Twitter feeds myself)
on my machines, so that I can exploit the whole
information content provided by Twitter.

Cheers,
Giovanni


Re: Negative value in numFound

2012-05-28 Thread Jack Krupansky

Is this for a single-shard or multi-shard index?

There is a 2^31-1 limit for a single Lucene index since document numbers are 
int (32-bit signed in Java) in Lucene, but with Solr shards you can have a 
multiple of that, based on number of shards.


If you are multi-shard, maybe one of the shards grew too large.

-- Jack Krupansky

-Original Message- 
From: tosenthu

Sent: Monday, May 28, 2012 8:15 AM
To: solr-user@lucene.apache.org
Subject: Negative value in numFound

Hi

I have a index of size 1 Tb.. And I prepared this by setting up a background
script to index records. The index was fine last 2 days, and i have not
disturbed the process. Suddenly when i queried the index i get this
response, where the value of numFound is negative. Can any one say why/how
this occurs and also the cure.

response
   lst name=responseHeader
   int name=status0/int
   int name=QTime0/int
   lst name=params
   str name=indenton/str
   str name=start0/str
   str name=q*:*/str
   str name=version2.2/str
   str name=rows10/str
   /lst
   /lst
   result name=response numFound=-2125433775 start=0/
/response

Regards
Senthil Kumar M R

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: indexing unstructured text (tweets)

2012-05-28 Thread Jack Krupansky
Ah, okay. Here's some PHP regexp code for parsing a raw tweet to get user 
names and hash tags:


http://saturnboy.com/2010/02/parsing-twitter-with-regexp/

-- Jack Krupansky

-Original Message- 
From: Giovanni Gherdovich

Sent: Monday, May 28, 2012 10:35 AM
To: solr-user@lucene.apache.org
Subject: Re: indexing unstructured text (tweets)

Hello Jack and Anuj,

2012/5/28 Jack Krupansky j...@basetechnology.com:
The Twitter API extracts hash tag and user mentions for you, in addition 
to

giving you the full raw text. You'll have to read up on the Twitter API.


That's what I thought just after hittind send on the message above ;-)
I am pretty sure the Twitter API format maps very nicely to a suitable
input format for Solr, if not even being already good for direct
feeding into Solr.

I am a bit unlucky here because I have been provided with
only the raw text for about 1.5 million tweets; so I would have
to write a few lines of code to restore at least user mentions,
hashtags and URLs.


2012/5/28 Anuj Kumar anujs...@gmail.com:

This is a bit old but provides good information for schema design-
http://www.readwriteweb.com/archives/this_is_what_a_tweet_looks_like.php

Found this link as well- https://gist.github.com/702360

The types of the field may depend on the search requirements.


Anuj you provide very interesting links here, thanks,
even tho those kind of specifics might be already present
in the twitter API doc.
After I'll be done with my first Solr setup, I might
setup the whole pipeline (getting the Twitter feeds myself)
on my machines, so that I can exploit the whole
information content provided by Twitter.

Cheers,
Giovanni 



Re: indexing unstructured text (tweets)

2012-05-28 Thread Giovanni Gherdovich
2012/5/28 Jack Krupansky j...@basetechnology.com:
 Ah, okay. Here's some PHP regexp code for parsing a raw tweet to get user
 names and hash tags:

 http://saturnboy.com/2010/02/parsing-twitter-with-regexp/

Awesome!

thank you very much Jack.

GGhh


Re: indexing unstructured text (tweets)

2012-05-28 Thread Gora Mohanty
On 28 May 2012 20:12, Jack Krupansky j...@basetechnology.com wrote:
 Ah, okay. Here's some PHP regexp code for parsing a raw tweet to get user
 names and hash tags:

 http://saturnboy.com/2010/02/parsing-twitter-with-regexp/
[...]

One could also use the Solr DataImportHandler, and
RegexTransformer to do the job:
http://wiki.apache.org/solr/DataImportHandler#RegexTransformer

Regards,
Gora


Re: Negative value in numFound

2012-05-28 Thread tosenthu
There was an Out Of Memory.. But still the indexing was happening further.. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986437.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Negative value in numFound

2012-05-28 Thread ku3ia
In some cases multi-shard architecture might significantly slow down the
search process at this index size...

By the way, how much RAM do you use?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986438.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Negative value in numFound

2012-05-28 Thread tosenthu
Hi

It is a multicore but when i searched the shards query even then i get this
response

result name=response numFound=-390662429 start=0

which is again a negative value.

Might be the total number of records may be  2147483647 (2^31-1), But is
this limitation documented anywhere. What is the strategy to over come this
situation. Expectation of my application is to have 12 billion records. So
please suggest me a strategy for my situation.

Regards
Senthil Kumar M R
 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986439.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Negative value in numFound

2012-05-28 Thread Jack Krupansky

OOM is a problem.

You need more RAM and more machines, and maybe more  shards.

-- Jack Krupansky

-Original Message- 
From: tosenthu

Sent: Monday, May 28, 2012 11:29 AM
To: solr-user@lucene.apache.org
Subject: Re: Negative value in numFound

There was an Out Of Memory.. But still the indexing was happening further..

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986437.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Negative value in numFound

2012-05-28 Thread Jack Krupansky

numFound=-390662429

That suggests that you have at least two shards which each have  2G docs 
(2^31-1).


How many shards do you have and how big do you think they should be in terms 
of number of documents?


Are you being careful to distribute your update requests between shards so 
that no shard grows too large? That gets back to the preceding question.


-- Jack Krupansky

-Original Message- 
From: tosenthu

Sent: Monday, May 28, 2012 11:34 AM
To: solr-user@lucene.apache.org
Subject: Re: Negative value in numFound

Hi

It is a multicore but when i searched the shards query even then i get this
response

result name=response numFound=-390662429 start=0

which is again a negative value.

Might be the total number of records may be  2147483647 (2^31-1), But is
this limitation documented anywhere. What is the strategy to over come this
situation. Expectation of my application is to have 12 billion records. So
please suggest me a strategy for my situation.

Regards
Senthil Kumar M R



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986439.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: UpdateRequestProcessor : flattened values

2012-05-28 Thread Raphaël
On Mon, May 28, 2012 at 10:30:03AM -0400, Jack Krupansky wrote:
 ... the access to individual literal fields seems (currently) very limited 
 as they appear to be flattened.
 
 That is s feature of SolrCell, to flatten multiple values for a 
 non-multi-valued field into a string concatenation of the values.
 
 All you need to do is add multiValued=true to the author field in your 
 schema.xml:

Indeed it both works and makes sense though I'd have thought flattening
had happen later in the process.

Again, thank for your precious help.


Re: UpdateRequestProcessor : flattened values

2012-05-28 Thread Jack Krupansky
And it might make sense to have a multi-value flattening attribute for 
Solr itself rather than in SolrCell.


-- Jack Krupansky

-Original Message- 
From: Raphaël

Sent: Monday, May 28, 2012 12:56 PM
To: solr-user@lucene.apache.org
Subject: Re: UpdateRequestProcessor : flattened values

On Mon, May 28, 2012 at 10:30:03AM -0400, Jack Krupansky wrote:
... the access to individual literal fields seems (currently) very 
limited

as they appear to be flattened.

That is s feature of SolrCell, to flatten multiple values for a
non-multi-valued field into a string concatenation of the values.

All you need to do is add multiValued=true to the author field in 
your

schema.xml:


Indeed it both works and makes sense though I'd have thought flattening
had happen later in the process.

Again, thank for your precious help. 



Re: Negative value in numFound

2012-05-28 Thread tosenthu
The RAM is about 14.5G. Allocated for Tomcat..

I have now 2 shards. But I was in an impression i can handle it with couple
of Shards. But in this case i need to have shards which can only grow up
2^31-1 records and many such shards to support 12 Billion records.

I will try to have more cores and distribute update between them. Then comes
my next question. Is there a possibility to restrict by any configuration
for a core to reject updates based on the number of records. And is there a
possibility to split a index into 2 or more based on a query. 

Any how my network will have 2 SOLR servers to participate in indexing and
search.. Probably i need to have at least 6 cores distributed across these
machines to support 12 Billion Records. What is you say?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986453.html
Sent from the Solr - User mailing list archive at Nabble.com.


xpathentityprocessor not import all documents

2012-05-28 Thread Sagar Joshi
i have xml files need to import in solr, 
xml looks like below,

 root
 doc
  id1/id
  namealbert/name
   addLA/add
   /doc
 doc
  id2/id
  namejohn/name
   addNY/add
   /doc
/root

xml filepath is in sql database, so i have created dataimporthandler file as
per below

dataConfig
  dataSource name=ds1 ../  
  dataSource type=FileDataSource name=FD /
  document name=Emp
entity name=FilePath query=SelectFilePathFromDB
entity name=xmlEntity  onError=continue rootEntity=true 
processor=XPathEntityProcessor forEach=/root/doc/ url=${FilePath.Path}
dataSource=FD
field xpath=root/doc/id column=id /
field xpath=root/doc/name column=name /
field xpath=root/doc/add column=add /
/entity
/entity
/dataConfig

now when i do full import, it will just add one document only, but when i
tried with remove FilePath entity, and give static path in url its imported
all documents properly.

where i am making mistake ?


Sagar Joshi

--
View this message in context: 
http://lucene.472066.n3.nabble.com/xpathentityprocessor-not-import-all-documents-tp3986441.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Negative value in numFound

2012-05-28 Thread Jack Krupansky
I think 100 million documents is a realistic number for a single shard. 
Maybe 250 million depending on your data. But I would say that beyond that 
is being unrealistic. In some cases, even 50 million might be too much for a 
single shard, depending on the data and query usage. Sure, maybe depending 
on your data 2 billion documents might work, but I wouldn't bet on it. And 
even if you manage to index 500 million or more documents on a single shard, 
memory and performance for production query loads would be questionable. 
Query capacity also depends on things like number of faceted fields (i.e., 
the field cache), string field size, number of unique terms in each field, 
solr query cache, and highlighting of large fields. Not to mention wanting 
to have enough capacity so that the number of documents can grow over time.


As an experiment, index 250 million documents in one shard and see how 
typical queries perform, and how much JVM memory you use and still have 
available. Make sure to try quite a few queries (using a script), especially 
if any fields are faceted or highlighted. Then you can decide whether you 
feel comfortable trying a larger shard size or if a smaller size is needed.


-- Jack Krupansky

-Original Message- 
From: tosenthu

Sent: Monday, May 28, 2012 1:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Negative value in numFound

The RAM is about 14.5G. Allocated for Tomcat..

I have now 2 shards. But I was in an impression i can handle it with couple
of Shards. But in this case i need to have shards which can only grow up
2^31-1 records and many such shards to support 12 Billion records.

I will try to have more cores and distribute update between them. Then comes
my next question. Is there a possibility to restrict by any configuration
for a core to reject updates based on the number of records. And is there a
possibility to split a index into 2 or more based on a query.

Any how my network will have 2 SOLR servers to participate in indexing and
search.. Probably i need to have at least 6 cores distributed across these
machines to support 12 Billion Records. What is you say?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986453.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Negative value in numFound

2012-05-28 Thread William Bell
You went over the max limit for number of docs.

On Monday, May 28, 2012, tosenthu wrote:

 Hi

 I have a index of size 1 Tb.. And I prepared this by setting up a
 background
 script to index records. The index was fine last 2 days, and i have not
 disturbed the process. Suddenly when i queried the index i get this
 response, where the value of numFound is negative. Can any one say why/how
 this occurs and also the cure.

 response
lst name=responseHeader
int name=status0/int
int name=QTime0/int
lst name=params
str name=indenton/str
str name=start0/str
str name=q*:*/str
str name=version2.2/str
str name=rows10/str
/lst
/lst
result name=response numFound=-2125433775 start=0/
 /response

 Regards
 Senthil Kumar M R

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


suggestions developing a multi-version concurrency control (MVCC) mechanism

2012-05-28 Thread Nicholas Ball

Hello all,

For the first step of the distributed snapshot isolation system I'm
developing for Solr, I'm going to need to have a MVCC mechanism as opposed
to the single-version concurrency control mechanism already developed
(DistributedUpdateProcessor class). I'm trying to find the very best way to
develop this into Solr 4.x (trunk) and so any help would be greatly
appreciated!

Essentially I need to be able to store multiple version of a document so
that when you look up a document with a given timestamp, you're given the
correct version (anything the same or older, not fresher). The older
versioned documents need to be stored in the index itself to ensure they
are durable and can be manipulated as other Solr data can be.

One way to do this is to store the old versioned Solr documents within the
latest Solr Document, but I'm not sure this is even possible?
Alternatively, I could have the latest versioned Document store the unique
keys which point to other older documents. The problem with this is that it
complicates things having various partial objects which all combine as one
logically document.

Are there any suggestions as to the best way to develop this feature?

Thank you in advance for any help you can spare!

Nicholas


Re: xpathentityprocessor not import all documents

2012-05-28 Thread Jack Krupansky
Try adding rootEntity=false to the FilePath entity. The DIH code ends up 
ignoring your rootEntity=true on the XPathEntityProcessor entity if the 
parent does not have rootEntity=false. I'm not sure if that is really 
correct, but that's the way the code is.


-- Jack Krupansky

-Original Message- 
From: Sagar Joshi

Sent: Monday, May 28, 2012 11:47 AM
To: solr-user@lucene.apache.org
Subject: xpathentityprocessor not import all documents

i have xml files need to import in solr,
xml looks like below,

root
doc
 id1/id
 namealbert/name
  addLA/add
  /doc
doc
 id2/id
 namejohn/name
  addNY/add
  /doc
/root

xml filepath is in sql database, so i have created dataimporthandler file as
per below

dataConfig
 dataSource name=ds1 ../
 dataSource type=FileDataSource name=FD /
 document name=Emp
   entity name=FilePath query=SelectFilePathFromDB
   entity name=xmlEntity  onError=continue rootEntity=true
processor=XPathEntityProcessor forEach=/root/doc/ url=${FilePath.Path}
dataSource=FD
field xpath=root/doc/id column=id /
field xpath=root/doc/name column=name /
field xpath=root/doc/add column=add /
/entity
/entity
/dataConfig

now when i do full import, it will just add one document only, but when i
tried with remove FilePath entity, and give static path in url its imported
all documents properly.

where i am making mistake ?


Sagar Joshi

--
View this message in context: 
http://lucene.472066.n3.nabble.com/xpathentityprocessor-not-import-all-documents-tp3986441.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: suggestions developing a multi-version concurrency control (MVCC) mechanism

2012-05-28 Thread Lance Norskog
You can use the document id and timestamp as a compound unique id.
Then the search would also sort by id, then by timestamp. Result
grouping might let you pick the most recent document from each of the
sorted docs.

On Mon, May 28, 2012 at 3:15 PM, Nicholas Ball
nicholas.b...@nodelay.com wrote:

 Hello all,

 For the first step of the distributed snapshot isolation system I'm
 developing for Solr, I'm going to need to have a MVCC mechanism as opposed
 to the single-version concurrency control mechanism already developed
 (DistributedUpdateProcessor class). I'm trying to find the very best way to
 develop this into Solr 4.x (trunk) and so any help would be greatly
 appreciated!

 Essentially I need to be able to store multiple version of a document so
 that when you look up a document with a given timestamp, you're given the
 correct version (anything the same or older, not fresher). The older
 versioned documents need to be stored in the index itself to ensure they
 are durable and can be manipulated as other Solr data can be.

 One way to do this is to store the old versioned Solr documents within the
 latest Solr Document, but I'm not sure this is even possible?
 Alternatively, I could have the latest versioned Document store the unique
 keys which point to other older documents. The problem with this is that it
 complicates things having various partial objects which all combine as one
 logically document.

 Are there any suggestions as to the best way to develop this feature?

 Thank you in advance for any help you can spare!

 Nicholas



-- 
Lance Norskog
goks...@gmail.com


Re: boost date for fresh result query

2012-05-28 Thread Jonty Rhods
please suggest me I am stuck here..

On Mon, May 28, 2012 at 3:21 PM, Jonty Rhods jonty.rh...@gmail.com wrote:

 Hi

 I am facing problem to boost on date field.
 I have following field in schema
field name=release_date type=date indexed=true stored=true/

 solr version 3.4
 I don't want to sort by date but want to give 50 to 60% boost those result
 which have latest date...

 following are the query :


 http://localhost:8083/solr/movie/select/?defType=dismaxq=titanicfq=title,tagsqf=tags
 ^2.0,title^1.0mm=1bf=recip(rord(release_date),1,1000,1000)

 but it seems the result has no effect because still all old date results
 are on top or I am unable to make query syntax. Is there anything missing
 in query..

 Please help me to make the function possible..

 thanks

 regards
 Jonty



Re: boost date for fresh result query

2012-05-28 Thread Jack Krupansky
Add debugQuery=true to your query and look at the scores of the older vs. 
newer docs compared to the boost. Maybe the boost needs to be increased.


-- Jack Krupansky

-Original Message- 
From: Jonty Rhods

Sent: Monday, May 28, 2012 5:51 AM
To: solr-user@lucene.apache.org
Subject: boost date for fresh result query

Hi

I am facing problem to boost on date field.
I have following field in schema
  field name=release_date type=date indexed=true stored=true/

solr version 3.4
I don't want to sort by date but want to give 50 to 60% boost those result
which have latest date...

following are the query :

http://localhost:8083/solr/movie/select/?defType=dismaxq=titanicfq=title,tagsqf=tags
^2.0,title^1.0mm=1bf=recip(rord(release_date),1,1000,1000)

but it seems the result has no effect because still all old date results
are on top or I am unable to make query syntax. Is there anything missing
in query..

Please help me to make the function possible..

thanks

regards
Jonty 



RE: useFastVectorHighlighter doesn't work

2012-05-28 Thread ZHANG Liang F
Hi, 
The reason why I use useFastVectorHighlighter is because I want to set 
stored=false, and with more settings like  termVectors=true 
termPositions=true termOffsets=true. If stored=true, what is the 
difference between normal highlight and useFastVectorHighlighter? What is the 
right situation for using useFastVectorHighlighter?

Thanks!

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: 2012年5月28日 16:40
To: solr-user@lucene.apache.org
Subject: Re: useFastVectorHighlighter doesn't work

 I had a schema defined as field name=text type=text
 indexed=true stored=false termVectors=true
 termPositions=true termOffsets=true/

You need to mark your text field as stored=true to use 
hl.useFastVectorHighlighter=true


Re: Negative value in numFound

2012-05-28 Thread Jack Krupansky

... is this limitation documented anywhere...

Kind of, but not very well, at least at the Lucene level.

The Lucene File Formats page says Lucene uses a Java int to refer to 
document numbers, and the index file format uses an Int32 on-disk to store 
document numbers. This is a limitation of both the index file format and the 
current implementation. It also says that The first document added to an 
index is numbered zero. Since Java Integer.MAX_INT is 2^31-1, that means 
the maximum number of documents in a single Lucene (or Solr) index is 2^31.


See:
http://lucene.apache.org/core/3_6_0/fileformats.html

And the Lucene IndexSearcher API uses int for document number and number 
of documents in index.

See:
http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/IndexSearcher.html

There is a little discussion of the limit issue here:
https://issues.apache.org/jira/browse/LUCENE-2420

I am not aware of any explicit mention of the single-index Lucene document 
limit at the Solr level.


-- Jack Krupansky

-Original Message- 
From: tosenthu

Sent: Monday, May 28, 2012 11:34 AM
To: solr-user@lucene.apache.org
Subject: Re: Negative value in numFound

Hi

It is a multicore but when i searched the shards query even then i get this
response

result name=response numFound=-390662429 start=0

which is again a negative value.

Might be the total number of records may be  2147483647 (2^31-1), But is
this limitation documented anywhere. What is the strategy to over come this
situation. Expectation of my application is to have 12 billion records. So
please suggest me a strategy for my situation.

Regards
Senthil Kumar M R



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986439.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Accent Characters

2012-05-28 Thread Jack Krupansky
The query seems fine - as far as the URL being UTF-8. It seems that the 
documents are not being passed to Solr with UTF-8 encoding. The document is 
not part of the URL. It is HTTP POST data.


Try an explicit curl command to add a document and see if it is indexed with 
the accents.


-- Jack Krupansky

-Original Message- 
From: couto.vicente

Sent: Monday, May 28, 2012 9:58 AM
To: solr-user@lucene.apache.org
Subject: Re: Accent Characters

Hi, Jack.
First of all thank you for your help.
Well, I tried again then I realized that my problem is not really with solr.
I did run this query against solr after start it up with the command java
-jar start.jar:
http://localhost:8983/solr/coreFR/spell?q=content:pr%C3%A9sentaspellcheck=truespellcheck.collate=truerows=0spellcheck.count=10

It gives me the result:
?xml version=1.0 encoding=UTF-8 ?
response
lst name=responseHeader
 int name=status0/int
 int name=QTime31/int
 /lst
 result name=response numFound=0 start=0 /
lst name=spellcheck
lst name=suggestions
lst name=présenta
 int name=numFound10/int
 int name=startOffset8/int
 int name=endOffset16/int
arr name=suggestion
 strprésente/str
 strprésent/str
 strprésenté/str
 strprésents/str
 strprésentant/str
 strprésentera/str
 strprésentait/str
 strprésentes/str
 strprésenter/str
 strprésentée/str
 /arr
 /lst
 str name=collationcontent:présente/str
 /lst
 /lst
/response

And I did run exactly the same query after deploy solr.war in tomcat 7. Here
is my result:
?xml version=1.0 encoding=UTF-8 ?
response
lst name=responseHeader
 int name=status0/int
 int name=QTime16/int
 /lst
 result name=response numFound=0 start=0 /
lst name=spellcheck
lst name=suggestions
lst name=présenta
 int name=numFound10/int
 int name=startOffset8/int
 int name=endOffset16/int
arr name=suggestion
 strpresent/str
 strprbsent/str
 strpresentant/str
 strpresentait/str
 strpuisent/str
 strpasent/str
 strpensent/str
 strposent/str
 strdresent/str
 strresenti/str
 /arr
 /lst
 str name=collationcontent:present/str
 /lst
 /lst
/response

As my application is running under tomcat, it means that I have some issue
with tomcat, but the weird stuff is that I already google it looking for a
fix and find out that we have to set up a parameter into server.xml tomcat
config file:

Connector port=5443 protocol=HTTP/1.1
  connectionTimeout=2
  redirectPort=8443
  URIEncoding=UTF-8 /

But it's not working as you can see.
I'm feeling a little stupid because it doesn't look like a big problem. For
sure people around the world are using solr with accents queries running
under tomcat properly!

Thank you

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986423.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Multicore Issue - Server Restart

2012-05-28 Thread Sujatha Arun
Hello ,

We have a multicore webapp for every 50 cores.Currently 3 Multicore webapps
and 150 cores distributed across the 3 webapps.

When we re started the server [Tomcat] ,we noticed that the solr.xml was
wiped out and we could not see any cores  in webapp1 and webapp3 ,but only
a few cores in webapp 2.

The solr.xml has persistent =true. Given this what could have possibly
happened ?

Solution was to add all the cores manually to solr.xml and restart
server,But I am unsure as to what would have caused this and there is no
indication in the logs also  for any issue,

Regards
Suajtha