Solr UpdateJSON - extra fields

2011-09-25 Thread msingla
If JSON being posted to ''http://localhost:8983/solr/update/json' URL has
extra fields that are not defined in the index schema definition, will those
be silently ignored or an error thrown.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-UpdateJSON-extra-fields-tp3366066p3366066.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Update ingest rate drops suddenly

2011-09-25 Thread eks dev
Thanks Otis,
we will look into these issues again, slightly deeper. Network
problems are not likely, but DB, I do not know, this is huge select
... we will try to scan db, without indexing, just to see if it can
sustain... But gut feeling says, nope, this is not the one.

IO saturation would surprise me, but you never know. Might be very
well that SSD is somehow having problems with this sustained
throughput.

8 Core... no, this was single update thread.

we left default index settings (do not tweak if it works :)
ramBufferSizeMB32/ramBufferSizeMB

32MB sounds like a lot of our documents (100b average on disk size).
Assuming ram efficiency of 50% (?), we lend at 100k buffered
documents. Yes, this is kind of  smallish as every ~3 seconds we
fill-up ramBuffer. (our Analyzers surprised  me with 30k+ records per
second).

256 will do the job, ~24 seconds should be plenty of idle time for
IO-OS-JVM  to sort out MMAP issues, if any (windows was newer MMAP
performance champion when using it from java, but once you dance
around it, it works ok)...


Max jvm heap on this test was 768m, memory never went above 500m,
Using  -XX:-UseParallelGC ... this is definitely not a gc problem.

cheers,
eks


On Sun, Sep 25, 2011 at 6:20 AM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 eks,

 This is clear as day - you're using Winblows!  Kidding.

 I'd:
 * watch IO with something like vmstat 2 and see if the rate drops correlate 
 to increased disk IO or IO wait time
 * monitor the DB from which you were pulling the data - maybe the DB or the 
 server that runs it had issues
 * monitor the network over which you pull data from DB

 If none of the above reveals the problem I'd still:
 * grab all data you need to index and copy it locally
 * index everything locally

 Out of curiosity, how big is your ramBufferSizeMB and your -Xmx?
 And on that 8-core box you have ~8 indexing threads going?

 Otis
 
 Sematext is Hiring -- http://sematext.com/about/jobs.html





From: eks dev eks...@yahoo.co.uk
To: solr-user solr-user@lucene.apache.org
Sent: Saturday, September 24, 2011 3:18 PM
Subject: Update ingest rate drops suddenly

just looking for hints where to look for...

We were testing single threaded ingest rate on solr, trunk version on
atypical collection (a lot of small documents), and we noticed
something we are not able to explain.

Setup:
We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD,
machine with enough memory and 8 cores.   Schema has 5 stored fields,
4 of them indexed no positions no norms.
Average net document size (optimized index size / number of documents)
is around 100 bytes.

On a test with 40 Mio document:
- we had update ingest rate  on first 4,4Mio documents @  incredible
34k records / second...
- then it dropped, suddenly to 20k records per second and this rate
remained stable (variance 1k) until...
- we hit 13Mio, where ingest rate dropped again really hard, from one
instant in time to another to 10k records per second.

it stayed there until we reached the end @40Mio (slightly reducing, to
ca 9k, but this is not long enough to see trend).

Nothing unusual happening with jvm memory ( tooth-saw  200- 450M fully
regular). CPU in turn was  following the ingest rate trend, inicating
that we were waiting on something. No searches , no commits, nothing.

autoCommit was turned off. Updates were streaming directly from the database.

-
I did not expect something like this, knowing lucene merges in
background. Also, having such sudden drops in ingest rate is
indicative that we are not leaking something. (drop would have been
much more gradual). It is some caches, but why two really significant
drops? 33k/sec to 20k and than to 10k... We would love to keep it  @34
k/second :)

I am not really acquainted with the new MergePolicy and flushing
settings, but I suspect this is something there we could tweak.

Could it be windows is somehow, hmm, quirky with solr default
directory on win64/jvm (I think it is MMAP by default)... We did not
saturate IO with such a small documents I guess, It is a just couple
of Gig over 1-2 hours.

All in all, it works good, but is having such hard update ingest rate
drops normal?

Thanks,
eks.





Re: matching reponse and request

2011-09-25 Thread Roland Tollenaar

Hi Otis,

this is absolutely brilliant! I did not think it were possible.

It opens up a new possibility.

If I insert device ID's in this manner (as in a unique identifier of the 
device sending the request) , might it be possible to control (at least 
block or permit) the permissions of the user?


It seems like something of the sort is possible but I only come up with 
this:


http://search-lucene.com/m/Yuib11zCeYN

No redirect to where the permissions can be set (in schema) and how the 
requests are identified to come from a particular user/device..


Thanks for your help.

Kind regards,

Roland


Otis Gospodnetic wrote:

Hi Roland,

Check this:

response
lst name=responseHeader
int name=status0/int
int name=QTime0/int
lst name=params
str name=indenton/str
str name=start0/str
str name=qsolr/str
str name=foo1/str=== from foo=1
str name=version2.2/str
str name=rows10/str
/lst
 
I added foo=1 to the request to Solr and got the above back.


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/






From: Roland Tollenaar rwatollen...@gmail.com
To: solr-user@lucene.apache.org
Sent: Saturday, September 24, 2011 4:07 AM
Subject: matching reponse and request

Hi,

sorry for this question but I am hoping it has a quick solution.

I am sending multiple get request queries to solr but solr is not returning the 
responses in the sequence I send the requests.

The shortest responses arrive back first

I am wondering whether I can add a tag to the request which will be given back 
to me in the response so that when the response comes I can connect it to re 
original request and handle it in the appropriate manner.

If this is possible, how?

Help appreciated!

Regards,

Roland.





How to apply filters to stored data

2011-09-25 Thread drogon
Is it possible to apply filters to stored data like we can apply filter when
indexing. For example I use KeepWordFilter on a field during indexing. But I
don't want filtered data to be even stored ie I want the content in index
and store for this field to be same.
Also when retrieving data(querying solr) I find that the content retrieved
is the stored data. Is it possible to get the data that is indexed as
against the stored one?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3366230.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: matching reponse and request

2011-09-25 Thread Roland Tollenaar

Hi,

actually your are right in the sense that this should be sorted out a 
layer level lower. I.e. server-client connection level. Done that as well.


Thanks for the response.

Regards,

Roland

rkuris wrote:

I don't think you can do this.

If you are sending multiple GET requests, you are doing it across different
HTTP connections.  The web service has no way of knowing these are related.

One solution would be to pass a spare, unused parameter to your request,
like sequenceId=NNN and get the response to echo that back.  Then at least
you can tell which one is coming back and fix the order up in your program.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/matching-reponse-and-request-tp3363976p3364816.html
Sent from the Solr - User mailing list archive at Nabble.com.



Multiple servers support

2011-09-25 Thread Raja Ghulam Rasool
Hi,

I am new to Solr, and I am studying it currently. We are planning to
implement Solr in our production setup. We have 15 servers where we are
getting the data. The data is huge, like we are supposed to keep 150 Tera
bytes of data (in terms of documents it will be around  2592000 documents
per server), across all servers (combined). We have the
necessary storage capacity. Can anyone let me know whether Solr will be a
good solution for our text search needs ? We are required to provide text
searches or certain limited number of fields.

1- Does Solr support such architecture, i.e. multiple servers ? what
specific area in Solr do i need to explore (shards, cores etc, ???)
2- Any idea whether we will really benefit from Solr implementation for text
searches, vs let us say Oracle Text Search ? Currently our Oracle Text
search is giving a very bad performance and we are looking to some how
improve our text search performance
any high level pointers or help will be greatly appreciated.

thanks in advance guys

-- 
Regards,
Raja


escaping HTML tags within XML file

2011-09-25 Thread okayndc
Hello,

Was wondering if it is necessary to escape HTML tags within an XML file for
indexing?  If so, seems like a large XML files with tons of HTML tags could
get really messy (using CDATA).
Has this been your experience?  Do you escape the HTML tags? If so, what
technique do you use? Or do you leave the HTML tags in place without
escaping them?

Thanks!


Re: Solr UpdateJSON - extra fields

2011-09-25 Thread Erick Erickson
This is really easy to try, and you get the same kind
of error you get with undefined fields in XML.

Best
Erick

On Sat, Sep 24, 2011 at 11:29 PM, msingla msin...@hotmail.com wrote:
 If JSON being posted to ''http://localhost:8983/solr/update/json' URL has
 extra fields that are not defined in the index schema definition, will those
 be silently ignored or an error thrown.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-UpdateJSON-extra-fields-tp3366066p3366066.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to apply filters to stored data

2011-09-25 Thread Erick Erickson
No and no.

Hmmm, that's a bit terse. The split between stored and indexed
happens quite early in the update process, there's no way I know
of to use the tokenized stream as the input to your stored data.

And there's no out-of-the-box way to get the indexed tokens back
For anything except very small fields, this would be quite costly.

What problem are you trying to solve? Perhaps this is an XY problem.
See: http://people.apache.org/~hossman/#xyproblem

Best
Erick

On Sun, Sep 25, 2011 at 1:54 AM, drogon jithin1...@gmail.com wrote:
 Is it possible to apply filters to stored data like we can apply filter when
 indexing. For example I use KeepWordFilter on a field during indexing. But I
 don't want filtered data to be even stored ie I want the content in index
 and store for this field to be same.
 Also when retrieving data(querying solr) I find that the content retrieved
 is the stored data. Is it possible to get the data that is indexed as
 against the stored one?


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3366230.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: term vector parser in solr.NET

2011-09-25 Thread Mauricio Scheffer
TermVectorComponent support is a pending issue:
http://code.google.com/p/solrnet/issues/detail?id=68

Please use the SolrNet mailing list for specific questions about it:
http://groups.google.com/group/solrnet

Cheers,
Mauricio


On Mon, Sep 19, 2011 at 7:33 AM, jame vaalet jamevaa...@gmail.com wrote:

 hi,
 i was wondering if there is any method to get back the term vector list
 from
 solr through solr.NET?
 from the source code for SOLR.NET i couldn't notice any term vector parser
 in SOLR.NET .

 --

 -JAME



Re: Multiple servers support

2011-09-25 Thread Erick Erickson
Well, this is not a neutral forum G...

A common use-case for Solr is exactly to replace
database searches because, as you say, search
performance in a database is often slow and limited.
RDBMSs do very complex stuff very well, but they
are not designed for text searching.

Scaling is accomplished by either replication or
sharding. Replication is used when the entire index
fits on a single machine and you can get
reasonable responses. I've seen 40-50M docs fit
quite comfortably on one machine. But 150TB
*probably* indicates that this isn't reasonable in your
case.

If you can't fit the entire index on one machine, then
you shard, which splits up the single logical index
into multiple slices and Solr automatically will query
all the shards and assemble the parts into a single
response.

But you absolutely cannot guess the hardware
requirements ahead of time. It's like answering
How big is a Java program? There are too
many variables. But Solr is free, right? So you
absolutely have to get a copy and put your 2.5M
docs on it and test (Solrmeter or jMeter are
good options). If you get adequate throughput, add
another 1M docs to the machine. Keep on until
your QPS rate drops and you'll have a good idea how
many documents you can put on a single machine.
There's really no other way to answer that question

Best
Erick

On Sun, Sep 25, 2011 at 5:55 AM, Raja Ghulam Rasool the.r...@gmail.com wrote:
 Hi,

 I am new to Solr, and I am studying it currently. We are planning to
 implement Solr in our production setup. We have 15 servers where we are
 getting the data. The data is huge, like we are supposed to keep 150 Tera
 bytes of data (in terms of documents it will be around  2592000 documents
 per server), across all servers (combined). We have the
 necessary storage capacity. Can anyone let me know whether Solr will be a
 good solution for our text search needs ? We are required to provide text
 searches or certain limited number of fields.

 1- Does Solr support such architecture, i.e. multiple servers ? what
 specific area in Solr do i need to explore (shards, cores etc, ???)
 2- Any idea whether we will really benefit from Solr implementation for text
 searches, vs let us say Oracle Text Search ? Currently our Oracle Text
 search is giving a very bad performance and we are looking to some how
 improve our text search performance
 any high level pointers or help will be greatly appreciated.

 thanks in advance guys

 --
 Regards,
 Raja



Re: Production Issue: SolrJ client throwing this error even though field type is not defined in schema

2011-09-25 Thread pulkitsinghal
If I had to give a gentle nudge, I would ask you to validate your schema XML 
file. You can do so by looking for any w3c XML validator website and just copy 
pasting the text there to find out where its malformed.

Sent from my iPhone

On Sep 24, 2011, at 2:01 PM, Erick Erickson erickerick...@gmail.com wrote:

 You might want to review:
 
 http://wiki.apache.org/solr/UsingMailingLists
 
 There's really not much to go on here.
 
 Best
 Erick
 
 On Wed, Sep 21, 2011 at 12:13 PM, roz dev rozde...@gmail.com wrote:
 Hi All
 
 We are getting this error in our Production Solr Setup.
 
 Message: Element type t_sort must be followed by either attribute
 specifications,  or /.
 Solr version is 1.4.1
 
 Stack trace indicates that solr is returning malformed document.
 
 
 Caused by: org.apache.solr.client.solrj.SolrServerException: Error
 executing query
at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
at 
 com.gap.gid.search.impl.SearchServiceImpl.executeQuery(SearchServiceImpl.java:232)
... 15 more
 Caused by: org.apache.solr.common.SolrException: parsing error
at 
 org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:140)
at 
 org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:101)
at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:481)
at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
... 17 more
 Caused by: javax.xml.stream.XMLStreamException: ParseError at
 [row,col]:[3,136974]
 Message: Element type t_sort must be followed by either attribute
 specifications,  or /.
at 
 com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:594)
at 
 org.apache.solr.client.solrj.impl.XMLResponseParser.readArray(XMLResponseParser.java:282)
at 
 org.apache.solr.client.solrj.impl.XMLResponseParser.readDocument(XMLResponseParser.java:410)
at 
 org.apache.solr.client.solrj.impl.XMLResponseParser.readDocuments(XMLResponseParser.java:360)
at 
 org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:241)
at 
 org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:125)
... 21 more
 


Re: How to apply filters to stored data

2011-09-25 Thread Jithin
Hi Erick, The problem I am trying to solve is to filter invalid entities.
Users might mispell or enter a new entity name. This new/invalid entities
need to pass through a KeepWordFilter so that it won't pollute our
autocomplete result. 

I was looking into Luke. And it does seem to solve my use case, but is Luke
something I can use in a production setup?
Also when does copyField happens? Is the data being copied a result of
application of all filters or unmodified one?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3366987.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to apply filters to stored data

2011-09-25 Thread Erick Erickson
See below:

On Sun, Sep 25, 2011 at 9:53 AM, Jithin jithin1...@gmail.com wrote:
 Hi Erick, The problem I am trying to solve is to filter invalid entities.
 Users might mispell or enter a new entity name. This new/invalid entities
 need to pass through a KeepWordFilter so that it won't pollute our
 autocomplete result.


Right. But if you have a KeepWordFilter, that implies that you have a list
of known good words. Couldn't you use that file as your base for the
autosuggest component?

 I was looking into Luke. And it does seem to solve my use case, but is Luke
 something I can use in a production setup?

You'll find the performance unacceptably slow if you tried to do something
similar in production. The nature of an inverted index  makes reconstructing
a document from the various terms costly.


 Also when does copyField happens? Is the data being copied a result of
 application of all filters or unmodified one?

copyField happens to the raw input, not the result of your
analysis chain. And you can't chain copyField directives,
i.e.
copyField source=field1 dest=field2 /
copyField source=field2 dest=field3 /

would not put the contents of field1 into field3

Best
Erick


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3366987.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to apply filters to stored data

2011-09-25 Thread Jithin

Erick Erickson wrote:
 
 See below:
 
 On Sun, Sep 25, 2011 at 9:53 AM, Jithin lt;jithin1...@gmail.comgt;
 wrote:
 Hi Erick, The problem I am trying to solve is to filter invalid entities.
 Users might mispell or enter a new entity name. This new/invalid entities
 need to pass through a KeepWordFilter so that it won't pollute our
 autocomplete result.

 
 Right. But if you have a KeepWordFilter, that implies that you have a list
 of known good words. Couldn't you use that file as your base for the
 autosuggest component?
 

I think that is possible.
But is there any  other mechanism within solr/lucene to preprocess stored
data.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3367158.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: escaping HTML tags within XML file

2011-09-25 Thread pulkitsinghal
Assuming that the XML has the HTML as values inside fully formed tags like so:
nodeHTML/HTML/node then I think that using the HTML field type in 
schema.xml for indexing/storing will allow you to do meaningful searches on the 
content of the HTML without getting confused by the HTML syntax itself.

If you have absolutely no need for the entire stored HTML when presenting 
results to the user then stripping out the syntax at index time makes sense. 
This will adversely affect highlighting of  that document field as well so just 
know your requirements.

If you don't want to present anything at all then don't store, just index and 
use the right field type (HTML) such that search results find the right 
document. Just because a field is helpful in finding the doc, doesn't mean 
folks always want to present it or store it.

With Data Import Handler a HTML stripping transformer is present so that it is 
removed before the indexer gets it's hands on things. I can't be sure if that 
is how you get your data into Solr.

- Pulkit

Sent from my iPhone

On Sep 25, 2011, at 8:00 AM, okayndc bodymo...@gmail.com wrote:

 Hello,
 
 Was wondering if it is necessary to escape HTML tags within an XML file for
 indexing?  If so, seems like a large XML files with tons of HTML tags could
 get really messy (using CDATA).
 Has this been your experience?  Do you escape the HTML tags? If so, what
 technique do you use? Or do you leave the HTML tags in place without
 escaping them?
 
 Thanks!


Seek your wisdom for implementing 12 million docs..

2011-09-25 Thread Ikhsvaku S
Hi List,

We are pretty new to Solr  Lucene and have just starting indexing few 10K
documents using Solr. Before we attempt anything bigger we want to see what
should be the best approach..

Documents: We have close to ~12 million XML docs, of varying sizes average
size 20 KB. These documents have 150 fields, which should be searchable 
indexed. Of which over 80% fixed length string fields and few strings are
multivalued ones (e.g. title, headline, id, submitter, reviewers,
suggested-titles etc), there other 15% who are date specific (added-on,
reviewed-on etc). Rest are multivalued text documents, (E,g,
description, summary, comments, notes etc). Some of the documents do have
large number of these text fields (so we are leaning against storing these
in index). Approximately ~6000 such documents are updated  400-800 new ones
are added each day

Queries: A typical query would mainly be on string fields ~ 60% of queries
e.g. a simple one would be find document ids of documents whose author is
XYZ  submitted between [X-Z]  whose status is reviewed or pending review
 title has this string etc... the results of which are exacting nature
(found 300 docs). Rest of searches would include the text fields, where they
search quoted snippets or phrases... Almost all queries have multiple
operators. Also each one would want to grab as many result rows as possible
(we are limiting this to 2000). The output shall contain only 1-5 fields.
(No highlighting etc needed)

Available hardware:
Some of existing hardware we could find consists of existing ~300GB SAN each
on 4 Boxes with ~96Gig each. We do couple of older HP DL380s (mainly want to
use for offline indexing). All of this is on 10G Ethernet.

Questions:
Our priority is to provide results fast, and the new or updated documents
should be indexed within 2 hour. Users are also known to use complex queries
for data mining. Seeing all this any recommendations for indexing data,
fields?
How do we scale, what architecture should we follow here? Slave/master
servers? Any possible issues we may hit?

Thanks


Re: How to apply filters to stored data

2011-09-25 Thread Erick Erickson
Not that I know of...

On Sun, Sep 25, 2011 at 11:15 AM, Jithin jithin1...@gmail.com wrote:

 Erick Erickson wrote:

 See below:

 On Sun, Sep 25, 2011 at 9:53 AM, Jithin lt;jithin1...@gmail.comgt;
 wrote:
 Hi Erick, The problem I am trying to solve is to filter invalid entities.
 Users might mispell or enter a new entity name. This new/invalid entities
 need to pass through a KeepWordFilter so that it won't pollute our
 autocomplete result.


 Right. But if you have a KeepWordFilter, that implies that you have a list
 of known good words. Couldn't you use that file as your base for the
 autosuggest component?


 I think that is possible.
 But is there any  other mechanism within solr/lucene to preprocess stored
 data.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3367158.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Seek your wisdom for implementing 12 million docs..

2011-09-25 Thread Erick Erickson
Round N + 1 of it depends G. This isn't a very big
index as Solr indexes go, my first guess would be that
you can easily fit this on the machines you're talking
about. But, as always, how you implement things may
prove me wrong.

Really, about the only thing you can do is try it. Be
aware that the size of the index is a tricky concept.
For instance, if you store your data (stored=true), the
files in your index directory will NOT reflect the total memory
requirements since verbatim copies of your fields are
held in the *.fdt files and really don't affect searching speed.

Here's what I claim:
1 you can index these 12M document in a reasonable time. I
 index on my Mac book Pro 1.9M documents (Wikipedia
 dump) in just a few minutes ( 10 as I remember). So you can
 just try things.
2 use a Master/Slave architecture. You can control how fast
 the updates are available by the polling interval on the slave
 and how fast you commit. 2 hours is easy. 10 minutes is
 a reasonable goal here.
3 Consider edismax-style handlers. The point here is that
 they allow you to tune relevance much more finely than
 a bag of words approach in which you index many fields
 into a single text field.
4 You only really need to store the fields you intend to display
  as part of your search results. Assuming you're going to
  your system-of-record for the full document, your stored
  data may be very small
5 Be aware that the first few queries will often be much slower
 than later queries, as there are certain caches that need to
 be filled up. See the various warming parameters on the
 caches and the firstSearcher and newSearcher entries
 in the config files.
6 Create a mix of queries and use something like jMeter or
 SolrMeter to determine where your target hardware
 falls down. You have to take some care to create a reasonable
 query set, not just the same query over and over or you'll
 just get cached results. Fire enough queries at the searcher
 that it starts to perform poorly and tweak from there.
7 Really, really get familiar with two things,
a the admin/analysis page for understanding the analysis
  process.
b adding debugQuery=on to your queries when you don't
 understand what's happening. In particular, that will show
 you the parsed queries, you can defer digging into the
 scoring explanations for later.
8 string types aren't what you want very often. They're really
 suitable for things like IDs, serial numbers, etc. But they are
 NOT tokenized. So if your input is some stuff and you search
 for stuff, you won't get a match. This often confuses people.
 For tokenized processing, you'll probably want one of the
 text variants. String types are even case sensitive...

But all in all, I don't see what you've described as particularly
difficult, although you'll doubtlessly run into things you don't
expect.

Hope that helps
Erick

On Sun, Sep 25, 2011 at 1:00 PM, Ikhsvaku S ikhsv...@gmail.com wrote:
 Hi List,

 We are pretty new to Solr  Lucene and have just starting indexing few 10K
 documents using Solr. Before we attempt anything bigger we want to see what
 should be the best approach..

 Documents: We have close to ~12 million XML docs, of varying sizes average
 size 20 KB. These documents have 150 fields, which should be searchable 
 indexed. Of which over 80% fixed length string fields and few strings are
 multivalued ones (e.g. title, headline, id, submitter, reviewers,
 suggested-titles etc), there other 15% who are date specific (added-on,
 reviewed-on etc). Rest are multivalued text documents, (E,g,
 description, summary, comments, notes etc). Some of the documents do have
 large number of these text fields (so we are leaning against storing these
 in index). Approximately ~6000 such documents are updated  400-800 new ones
 are added each day

 Queries: A typical query would mainly be on string fields ~ 60% of queries
 e.g. a simple one would be find document ids of documents whose author is
 XYZ  submitted between [X-Z]  whose status is reviewed or pending review
  title has this string etc... the results of which are exacting nature
 (found 300 docs). Rest of searches would include the text fields, where they
 search quoted snippets or phrases... Almost all queries have multiple
 operators. Also each one would want to grab as many result rows as possible
 (we are limiting this to 2000). The output shall contain only 1-5 fields.
 (No highlighting etc needed)

 Available hardware:
 Some of existing hardware we could find consists of existing ~300GB SAN each
 on 4 Boxes with ~96Gig each. We do couple of older HP DL380s (mainly want to
 use for offline indexing). All of this is on 10G Ethernet.

 Questions:
 Our priority is to provide results fast, and the new or updated documents
 should be indexed within 2 hour. Users are also known to use complex queries
 for data mining. 

Re: escaping HTML tags within XML file

2011-09-25 Thread okayndc
Here is a representation of the XML file...

root
commenter
commentpText here/pimg src=image.gif /pMore text
here/p/comment
/commenter
/root

I want to keep the HTML tags because it keeps the formatting (paragraph
tags, etc) intact for the output.  Seems like you're saying that the HTML
can be kept intact with the use of a HTML field type without having to
escape the HTML tags?

On Sun, Sep 25, 2011 at 2:52 PM, pulkitsing...@gmail.com wrote:

 Assuming that the XML has the HTML as values inside fully formed tags like
 so:
 nodeHTML/HTML/node then I think that using the HTML field type in
 schema.xml for indexing/storing will allow you to do meaningful searches on
 the content of the HTML without getting confused by the HTML syntax itself.

 If you have absolutely no need for the entire stored HTML when presenting
 results to the user then stripping out the syntax at index time makes sense.
 This will adversely affect highlighting of  that document field as well so
 just know your requirements.

 If you don't want to present anything at all then don't store, just index
 and use the right field type (HTML) such that search results find the right
 document. Just because a field is helpful in finding the doc, doesn't mean
 folks always want to present it or store it.

 With Data Import Handler a HTML stripping transformer is present so that it
 is removed before the indexer gets it's hands on things. I can't be sure if
 that is how you get your data into Solr.

 - Pulkit

 Sent from my iPhone

 On Sep 25, 2011, at 8:00 AM, okayndc bodymo...@gmail.com wrote:

  Hello,
 
  Was wondering if it is necessary to escape HTML tags within an XML file
 for
  indexing?  If so, seems like a large XML files with tons of HTML tags
 could
  get really messy (using CDATA).
  Has this been your experience?  Do you escape the HTML tags? If so, what
  technique do you use? Or do you leave the HTML tags in place without
  escaping them?
 
  Thanks!



Re: How to apply filters to stored data

2011-09-25 Thread Erik Hatcher
Well ... DIH can. 

And update processors can. 

And of course client-side indexers.  

But yeah... elbow grease required. 
 
Erik

On Sep 25, 2011, at 16:32, Erick Erickson erickerick...@gmail.com wrote:

 Not that I know of...
 
 On Sun, Sep 25, 2011 at 11:15 AM, Jithin jithin1...@gmail.com wrote:
 
 Erick Erickson wrote:
 
 See below:
 
 On Sun, Sep 25, 2011 at 9:53 AM, Jithin lt;jithin1...@gmail.comgt;
 wrote:
 Hi Erick, The problem I am trying to solve is to filter invalid entities.
 Users might mispell or enter a new entity name. This new/invalid entities
 need to pass through a KeepWordFilter so that it won't pollute our
 autocomplete result.
 
 
 Right. But if you have a KeepWordFilter, that implies that you have a list
 of known good words. Couldn't you use that file as your base for the
 autosuggest component?
 
 
 I think that is possible.
 But is there any  other mechanism within solr/lucene to preprocess stored
 data.
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3367158.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


Re: escaping HTML tags within XML file

2011-09-25 Thread Michael Sokolov
Yes - you can index HTML text only while keeping the tags in place in 
the stored field using HTMLCharFilter (or possibly XMLCharFilter).  But 
you will find that embedding HTML inside XML can be problematic since 
HTML tags don't have to follow the well-formed constraints that XML 
requires.  For example, old-style paragraph tags in HTML were often not 
closed, just p with no /p.  If you have stuff like that, you won't 
be able to embed in XML without quoting the  character.  You never said 
why you are embedding HTML in XML though.


-Mike

On 9/25/2011 5:06 PM, okayndc wrote:

Here is a representation of the XML file...

root
commenter
commentpText here/pimg src=image.gif /pMore text
here/p/comment
/commenter
/root

I want to keep the HTML tags because it keeps the formatting (paragraph
tags, etc) intact for the output.  Seems like you're saying that the HTML
can be kept intact with the use of a HTML field type without having to
escape the HTML tags?





Re: escaping HTML tags within XML file

2011-09-25 Thread pulkitsinghal
Yes sir!

Sent from my iPhone

On Sep 25, 2011, at 4:06 PM, okayndc bodymo...@gmail.com wrote:

 Here is a representation of the XML file...
 
 root
 commenter
 commentpText here/pimg src=image.gif /pMore text
 here/p/comment
 /commenter
 /root
 
 I want to keep the HTML tags because it keeps the formatting (paragraph
 tags, etc) intact for the output.  Seems like you're saying that the HTML
 can be kept intact with the use of a HTML field type without having to
 escape the HTML tags?
 
 On Sun, Sep 25, 2011 at 2:52 PM, pulkitsing...@gmail.com wrote:
 
 Assuming that the XML has the HTML as values inside fully formed tags like
 so:
 nodeHTML/HTML/node then I think that using the HTML field type in
 schema.xml for indexing/storing will allow you to do meaningful searches on
 the content of the HTML without getting confused by the HTML syntax itself.
 
 If you have absolutely no need for the entire stored HTML when presenting
 results to the user then stripping out the syntax at index time makes sense.
 This will adversely affect highlighting of  that document field as well so
 just know your requirements.
 
 If you don't want to present anything at all then don't store, just index
 and use the right field type (HTML) such that search results find the right
 document. Just because a field is helpful in finding the doc, doesn't mean
 folks always want to present it or store it.
 
 With Data Import Handler a HTML stripping transformer is present so that it
 is removed before the indexer gets it's hands on things. I can't be sure if
 that is how you get your data into Solr.
 
 - Pulkit
 
 Sent from my iPhone
 
 On Sep 25, 2011, at 8:00 AM, okayndc bodymo...@gmail.com wrote:
 
 Hello,
 
 Was wondering if it is necessary to escape HTML tags within an XML file
 for
 indexing?  If so, seems like a large XML files with tons of HTML tags
 could
 get really messy (using CDATA).
 Has this been your experience?  Do you escape the HTML tags? If so, what
 technique do you use? Or do you leave the HTML tags in place without
 escaping them?
 
 Thanks!
 


Re: Sending pdf files to slor for indexing

2011-09-25 Thread Darx Oman
Hi there

you can use DIH with Tika