date:20120815

 When I send a scanned pdf to extraction request
 handler, below icon appears in my Dock.
 
 http://tinypic.com/r/2mpmo7o/6
 http://tinypic.com/r/28ukxhj/6

I found that text-extractable pdf files triggers above weird icon too.

curl 
http://localhost:8983/solr/update/extract?literal.id=solr-wordcommit=true; -F 
myfile=@solr-word.pdf

I wrote a standalone java program using tika. When text-extracting from all 
kinds of pdf files, that weird icon pops-up :)

I will ask tika-ML about this.

 AutoDetectParser _autoParser = new AutoDetectParser();
 File file = new File(solr-word.pdf);
 BodyContentHandler textHandler = new BodyContentHandler();
 Metadata metadata = new Metadata();
 ParseContext context = new ParseContext();
 InputStream input = new FileInputStream(file);

 _autoParser.parse(input, textHandler, metadata, context);

 System.out.println(text :  + textHandler.toString());
 input.close();
 while (true) { }

Re: scanned pdf with solr cell

2012-08-15 Thread Paul Libbrecht

Ahmet,

the dock icon appears when AWT starts, e.g. when a font is loaded.
You can prevent it using the headless mode but this is likely to trigger an 
exception.
Same if your user is not UI-logged-in.

hope it helps.

Paul

Le 15 août 2012 à 01:30, Ahmet Arslan a écrit :

 Hi All,
 
 I have set of rich documents. Some of them are scanned pdf files. When I send 
 a scanned pdf to extraction request handler, below icon appears in my Dock.
 
 http://tinypic.com/r/2mpmo7o/6
 http://tinypic.com/r/28ukxhj/6
 
 Does anyone know what this is?
 
 curl 
 http://localhost:8983/solr/documents/update/extract?literal.ID=ticaret_sicil_gazetesiliteral.URL=ticaret_sicil_gazetesicommit=true;
  -F myfile=@ticaret_sicil_gazetesi.pdf
 
 No exception is seen on solr logs. Doc is indexed, content field is: 
 
 xmpTPg:NPages 4   Creation-Date 2011-08-24T13:03:16Z   stream_source_info 
 myfile   created Wed Aug 24 16:03:16 EEST 2011   stream_content_type 
 application/octet-stream   stream_size 2302337   producer Image Recognition 
 Integrated Systems, Autoformat5,0,0,229   stream_name 
 ticaret_sicil_gazetesi.pdf   Content-Type application/pdf   creator I.R.I.S.  
  page page page page 
 
 Environment: solr-trunk, Mac OS X Version 10.7.4, Java HotSpot(TM) 64-Bit 
 Server VM (build 20.8-b03-424, mixed mode), jetty.
 
 Same thing happens with Solr 4.0-beta and Tomcat too.
 
 Thanks,

Re: scanned pdf with solr cell

 the dock icon appears when AWT starts, e.g. when a font is
 loaded.
 You can prevent it using the headless mode but this is
 likely to trigger an exception.
 Same if your user is not UI-logged-in.

Hi Paul, thanks for the explanation. So is it nothing to worry about?

Re: SOLR3.6:Field Collapsing/Grouping throws OOM

2012-08-15 Thread Tirthankar Chatterjee

Hi Erick,
You are so right on the memory calculations. I am happy that I know now that I
was doing something wrong. Yes I am getting confused with SQL.

I will back up and let you know the use case. I am tracking file versions. And
I want to give an option to browse your system for the latest files. So in
order to remove dups (same filename) I used grouping.

Also when you say Sharding is it okay if I do multi cores and does it mean that
each core needs a separate tomcat. I meant to say can I use the same machine?
150 mill docs have 120 mill unique paths too.

One more thing. If I need sharding and need a new box then it wont be great.
Because this system still have horsepower left which I can use.

Thanks a ton for explaining the issue.

Erick Erickson erickerick...@gmail.com wrote:

You'r putting a lot of data on a single box, then
asking to group on what I presume is a string
field. That's just going to eat up a _bunch_ of
memory.

let's say your average file name is 16 bytes long. Each
unique value will take up 58 + 32 bytes (58 bytes
of overhead, I'm presuming Solr 3.X and 16*2 bytes
for the chars). So, we're up to 90 bytes/string * number
of distinct file names) Say you have, for argument's
sake, 100M distinct file names. You're up to 9G
memory requirement for sorting alone. Solr's
sorting reads all the unique values into memory whether
or not they satisfy the query...

And Grouping can also be expensive. I don't think
you really want to group in this case, I'd simply use
a filter query something like:
fq=filefolder:E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307

Then you're also grouping on conv_sort which doesn't
make much sense, do you really want individual results returned
for _each_ file name?

What it looks like to me is you're confusing SQL with
solr search and getting into bad situations...

Also, 150M documents in a single shard is...really a lot.
You're probably at a point where you need to shard. Not
to mention that your 400G index is trying to be jammed
into 12G of memory.

This actually feels like an XY problem, can you back
up and let us know what the use-case you're
trying to solve is? Perhaps there are less memory-
consumptive solutions possible.

Best
Erick

On Tue, Aug 14, 2012 at 6:38 AM, Tirthankar Chatterjee
tchatter...@commvault.com wrote:
Editing the query...remove smb:. I don't know where it came from while
I did copy/paste

Tirthankar Chatterjee tchatter...@commvault.com wrote:

Hi,
I have a beefy box with 24Gb RAM (12GB for Tomcat7 which houses SOLR3.6) 2
Processors Intel Xeon 64 bit Server, 30TB HDD. JDK 1.7.0_03 x64 bit

Data Index Dir Size: 400GB
Metadata of files is stored in it. I have around 15 schema fields.
Total number of items:150million approx.

I have a scenario which I will try to explain to the best of my knowledge
here:

Let us consider the fields I am interested in

Url: Entire path of a file in windows file system including the filename.
ex:C:\Documents\A.txt
mtm: Modified Time of the file
Jid:JOb ID
conv_sort is string field type where the filename is stored.

I run a job where the following gets inserted

Total Items:2
Url:C:\personal\A1.txt
mtm:08/14/2012 12:00:00
Jid:1
Conv_sort:A1.txt
---
Url:C:\personal\B1.txt
mtm:08/14/2012 12:01:00
Jid:1
Conv_sort:B1.txt
In the second run only one item changes:

Url:C:\personal\A1.txt
mtm:08/15/2012 1:00:00
Jid:2
Conv_sort=A1.txt

When queried I would like to return the latest A1.txt and B1.txt back to the
end user. I am trying to use grouping with no luck. It keeps throwing OOM…
can someone please help… as it is critical for my project

The query I am trying is under a folder there are 1000 files and I putting a
filtered query param too asking it to group by filenames or url and none of
them work…what am I doing wrong here

http://172.19.108.78:8080/solr/select/?q=*:*version=2.2start=0rows=10indent=ongroup.query=filefolder:E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307smb://pd_dst//646c6907-a948-4b83-ac1d-d44742bb0307group=truegroup.limit=1group.field=conv_sortgroup.ngroup=true

The stack trace:

SEVERE: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Unknown Source)
at java.lang.String.init(Unknown Source)
at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
at
org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184
)
at
org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue(
FieldCacheImpl.java:882)
at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java
:233)
at
org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl
.java:856)
at
org.apache.lucene.search.grouping.TermFirstPassGroupingCollector.setN
extReader(TermFirstPassGroupingCollector.java:74)
at

Re: scanned pdf with solr cell

2012-08-15 Thread Paul Libbrecht



Le 15 août 2012 à 13:03, Ahmet Arslan a écrit :

 Hi Paul, thanks for the explanation. So is it nothing to worry about?

it is nothing to worry about except to remember that you can't run this step in 
a daemon-like process.
(on Linux, I had to set-up a VNC-server for similar tasks)

paul

Re: Switch from Sphinx to Solr - some basics please

 Because I have set a post in Stackoverflow, I wan't, that
 there is dublicate
 questions. Can you please read this post:
 
 http://stackoverflow.com/questions/11956608/sphinx-user-is-switching-to-solr

Your questions require Sphinx knowledge. I suggest you to read these book(s) 
http://lucene.apache.org/solr/books.html
http://www.manning.com/hatcher3/

I have in Sphinx: min_word_len ... How to use this in Solr?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/#solr.LengthFilterFactory

Re: Switch from Sphinx to Solr - some basics please

2012-08-15 Thread nnikolay

HI iorixxx, thanks for the reply.

Well you don't need sphinx knowledge to answer my questions.

I have write you what I want:

1. I need to have 2 seprate indexes. In Stackoverlfow I became the answer I
need to start 2 cores for example. How many cores can I run for solr? I have
for example over 100 different indexes, that they should seeing as separate
data. This indexes should be reindexed in different times and the data of
them should not mixed with each other.

You need to understand follow situation:

I have for example jobs form country A, jobs from country B and so on until
100 countries. I need to have for each country an separate index, because if
someone search for jobs in country A I need to query only the index for
country A. How to solve this problem?

How to do this? Is there are good tutorial? In the wiki of solr, it is very
bad explained.

2. When I become new data for example: Should I rotate the whole index
again, or can I include the new rows and delete the old rows. What is your
suggestion?

Thanks
Nik



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Switch-from-Sphinx-to-Solr-some-basics-please-tp4001234p4001379.html
Sent from the Solr - User mailing list archive at Nabble.com.

How to design index for related versioned database records

2012-08-15 Thread Stefan Burkard

Hi solr-users

I have a case where I need to build an index from a database.

***Data structure***
The data is spread across multiple tables and in each table the
records are versioned - this means that one real record can exist
multiple times in a table, each with different validFrom/validUntil
dates. Therefore it is possible to query the valid version of a record
for a given point in time.

The relations of the data are something like this:
Employee - LinkTable (=Employment) - Employer - LinkTable
(=offered services) - Service

That means I have data across 5 relations, each of them with versioned records.

***Search needs***
Now I need to be able to search for employees and employers based on
the services they offer for a given point in time.

Therefore I have built an index of all employees and employers with
their services as subentity. So I have one index entry for every
version of every employee/employer and each version collects the
offered services for the given timeframe of the employee/employer
version.

Problem: The offered services of an employee/employer can change
during its validity period. That means I do not only need to take the
version timespan of the employee/employer into account but also the
version timespans of services and the link-tables.

***Question***
I think I could continue with my strategy to have an index entry of an
employee/employer with its services for any given point in time. But
there are much more entries than now since every involved
validfrom/validuntil period (if they overlap) produces more entries.
But I am not sure if this is a good strategy, or if it would be better
to try to index the whole datastructure in an other way.

Are there any recommendations how to handle such a case?

Thanks for any help
Stephan

Re: Switch from Sphinx to Solr - some basics please


 1. I need to have 2 seprate indexes. In Stackoverlfow I
 became the answer I
 need to start 2 cores for example. How many cores can I run
 for solr? 

Please see : http://search-lucene.com/m/6rYti2ehFZ82


 I have for example jobs form country A, jobs from country B
 and so on until
 100 countries. I need to have for each country an separate
 index, because if
 someone search for jobs in country A I need to query only
 the index for
 country A. How to solve this problem?
 How to do this? Is there are good tutorial? In the wiki of
 solr, it is very
 bad explained.

http://wiki.apache.org/solr/MultipleIndexes talks about different solutions. 
One big index with fq is an option too.

 2. When I become new data for example: Should I rotate the
 whole index
 again, or can I include the new rows and delete the old
 rows. What is your
 suggestion?

I don't understand this. What do you mean by rotate the whole index?

Re: RAMDirectoryFactory bug

Hi, Lance,

Thanks for your reply!

It seems as if RAMDirectoryFactory is being passed the correct path to
the index, as it's being logged correctly. It just doesn't recognize
it as an index.

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Tue, Aug 14, 2012 at 9:57 PM, Lance Norskog goks...@gmail.com wrote:
 I can't remember the property name, but there is a Solr Java property
 that tells where to hunt for the data/ directory. You might be able to
 work around this bug using that property.

 On Tue, Aug 14, 2012 at 1:34 PM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:
 Hi everyone,

 It looks like I found a bug with RAMDirectoryFactory (I know, I know...)

 It doesn't seem to be able to load files off the disk. Everytime it
 starts up, it logs:

 WARNING: [] Solr index directory 'solr/./data/index' doesn't exist.
 Creating new index...

 Even if that filesystem path exists and there's a valid index there
 (verified by switching back to StandardDirectoryFactory).

 I experienced this first on our infrastructure on AWS, but I confirmed
 this by downloading the Solr 3.6.1 distribution fresh, indexing the
 exampledocs, stopping Jetty and reconfiguring for RAMDirectoryFactory,
 and restarting Jetty. The statement above gets logged, but otherwise
 the core comes up OK, but empty.

 Should I file a bug?

 Michael Della Bitta

 
 Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
 www.appinions.com
 Where Influence Isn’t a Game



 --
 Lance Norskog
 goks...@gmail.com

Re: How to design index for related versioned database records

2012-08-15 Thread Jack Krupansky

The date checking can be implemented using range query as a filter query, 
such as


fq=startDate:[* TO NOW] AND endDate:[NOW TO *]

(You can also use an frange query.)

Then you will have to flatten the database tables. Your Solr schema would 
have a single merged record type. You will have to decide whether the 
different record types (tables) will have common fields versus static 
qualification by adding a prefix or suffix, e.g., name vs. employee_name 
and employer_name. The latter has the advantage that you do not have to 
separately specify a table type field since the fields would be empty for 
records of other types.


-- Jack Krupansky

-Original Message- 
From: Stefan Burkard

Sent: Wednesday, August 15, 2012 8:12 AM
To: solr-user@lucene.apache.org
Subject: How to design index for related versioned database records

Hi solr-users

I have a case where I need to build an index from a database.

***Data structure***
The data is spread across multiple tables and in each table the
records are versioned - this means that one real record can exist
multiple times in a table, each with different validFrom/validUntil
dates. Therefore it is possible to query the valid version of a record
for a given point in time.

The relations of the data are something like this:
Employee - LinkTable (=Employment) - Employer - LinkTable
(=offered services) - Service

That means I have data across 5 relations, each of them with versioned 
records.


***Search needs***
Now I need to be able to search for employees and employers based on
the services they offer for a given point in time.

Therefore I have built an index of all employees and employers with
their services as subentity. So I have one index entry for every
version of every employee/employer and each version collects the
offered services for the given timeframe of the employee/employer
version.

Problem: The offered services of an employee/employer can change
during its validity period. That means I do not only need to take the
version timespan of the employee/employer into account but also the
version timespans of services and the link-tables.

***Question***
I think I could continue with my strategy to have an index entry of an
employee/employer with its services for any given point in time. But
there are much more entries than now since every involved
validfrom/validuntil period (if they overlap) produces more entries.
But I am not sure if this is a good strategy, or if it would be better
to try to index the whole datastructure in an other way.

Are there any recommendations how to handle such a case?

Thanks for any help
Stephan

Re: scanned pdf with solr cell

You can try passing -Djava.awt.headless=true as one of the arguments
when you start Jetty to see if you can get this to go away with no ill
effects.

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 15, 2012 at 7:07 AM, Paul Libbrecht p...@hoplahup.net wrote:


 Le 15 août 2012 à 13:03, Ahmet Arslan a écrit :

 Hi Paul, thanks for the explanation. So is it nothing to worry about?

 it is nothing to worry about except to remember that you can't run this step 
 in a daemon-like process.
 (on Linux, I had to set-up a VNC-server for similar tasks)

 paul

Re: scanned pdf with solr cell

 You can try passing
 -Djava.awt.headless=true as one of the arguments
 when you start Jetty to see if you can get this to go away
 with no ill
 effects.

I started jetty using : 'java -Djava.awt.headless=true -jar start.jar' and 
successfully indexed two pdf files. That icon didn't appeared :) Thanks!

Re: RAMDirectoryFactory bug

2012-08-15 Thread Mark Miller


On Aug 14, 2012, at 4:34 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 Hi everyone,
 
 It looks like I found a bug with RAMDirectoryFactory (I know, I know...)
 

Fair warning - RAMDir use in Solr is like a third class citizen. You probably 
should be using the mmap dir anyway.
See http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

 It doesn't seem to be able to load files off the disk. Everytime it
 starts up, it logs:
 
 WARNING: [] Solr index directory 'solr/./data/index' doesn't exist.
 Creating new index...
 
 Even if that filesystem path exists and there's a valid index there
 (verified by switching back to StandardDirectoryFactory).

I think it *should* work how you want, so does sound like a bug perhaps.

 
 I experienced this first on our infrastructure on AWS, but I confirmed
 this by downloading the Solr 3.6.1 distribution fresh, indexing the
 exampledocs, stopping Jetty and reconfiguring for RAMDirectoryFactory,
 and restarting Jetty. The statement above gets logged, but otherwise
 the core comes up OK, but empty.
 
 Should I file a bug?

Sure.

 
 Michael Della Bitta
 
 
 Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
 www.appinions.com
 Where Influence Isn’t a Game

- Mark Miller
lucidimagination.com

Re: RAMDirectoryFactory bug

Yes, moving to mmap was on our roadmap. I'm in the middle of moving
our infrastructure from 1.4 to 3.6.1, and didn't want to make too many
changes at the same time. However, this bug might push us over the
edge to mmap and away from ram.

I'll file a bug regardless.

Thanks!

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 15, 2012 at 9:05 AM, Mark Miller markrmil...@gmail.com wrote:

 On Aug 14, 2012, at 4:34 PM, Michael Della Bitta 
 michael.della.bi...@appinions.com wrote:

 Hi everyone,

 It looks like I found a bug with RAMDirectoryFactory (I know, I know...)


 Fair warning - RAMDir use in Solr is like a third class citizen. You probably 
 should be using the mmap dir anyway.
 See http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

 It doesn't seem to be able to load files off the disk. Everytime it
 starts up, it logs:

 WARNING: [] Solr index directory 'solr/./data/index' doesn't exist.
 Creating new index...

 Even if that filesystem path exists and there's a valid index there
 (verified by switching back to StandardDirectoryFactory).

 I think it *should* work how you want, so does sound like a bug perhaps.


 I experienced this first on our infrastructure on AWS, but I confirmed
 this by downloading the Solr 3.6.1 distribution fresh, indexing the
 exampledocs, stopping Jetty and reconfiguring for RAMDirectoryFactory,
 and restarting Jetty. The statement above gets logged, but otherwise
 the core comes up OK, but empty.

 Should I file a bug?

 Sure.


 Michael Della Bitta

 
 Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
 www.appinions.com
 Where Influence Isn’t a Game

 - Mark Miller
 lucidimagination.com

RE: Solr 4.0 - Join performance

2012-08-15 Thread David Smiley (@MITRE.org)

You would index rectangles of 0 height but that have a left edge 'x' of the
start time and a right edge 'x' of your end time.  You can index a variable
number of these per Solr document and then query by either a point or
another rectangle to find documents which intersect your query shape.  It
can't do a completely within based query, just intersection for now.  I
really look forward to seeing this wrapped up in some sort of RangeFieldType
so that users don't have to think in spatial terms.  



-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-0-Join-performance-tp3998827p4001404.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Index not loading

2012-08-15 Thread Jonatan Fournier

On Tue, Aug 14, 2012 at 5:37 PM, Jonatan Fournier
jonatan.fourn...@gmail.com wrote:
 On Tue, Aug 14, 2012 at 10:25 AM, Erick Erickson
 erickerick...@gmail.com wrote:
 This is quite odd, it really sounds like you're not
 actually committing. So, some questions.

 1 What happens if you search before you shut
 down your tomcat? Do you see docs then? If so,
 somehow you're doing soft commits and never
 doing a hard commit.

Yeah I just realized the behavior is the same as softCommit, is it the
default for commitWithin?

Cheers,

/jonathan


 2 What happens if, as the last statement in your SolrJ
 program you do a commit()?

 When using commitWithin, if I introduce server.commit() within the
 data load process the data gets commited ( I didn't reproduce with my
 89G of data...), if I shutdown my EmbeddedServer and restart it and
 send a commit, like on Tomcat, all data gets wiped out too. So I guess
 that there's state loss somewhere.

 Cheers,

 /jonathan


 3 While you're indexing, what do you see in your index
 directory? You should see multiple segments being
 created, and possibly merged so the number of
 files should go up and down. If you only have a single
 set of files, you're somehow not doing a commit.

 4 Is there something really silly going on like your
 restart scripts delete the index directory? Or you're
 using a VM that restores a blank image?

 5 When you do restart, are there any files at all
 in your index directory?

 I really suspect you've got some configuration problem
 here

 Best
 Erick



 On Mon, Aug 13, 2012 at 9:11 AM, Jonatan Fournier
 jonatan.fourn...@gmail.com wrote:
 Hi,

 I'm using Solr 4.0.0-ALPHA and the EmbeddedSolrServer.

 Within my SolrJ application, the documents are added to the server
 using the commitWithin parameter (in my case 60s). After 1 day my 125
 millions document are all added to the server and I can see 89G of
 index data files. I stop my SolrJ application and reload my Solr
 instance in Tomcat.

 From the Solr admin panel related to my Core (collection1) I see this info:


 Last Modified:
 Num Docs:0
 Max Doc:0
 Version:1
 Segment Count:0
 Optimized: (green check)
 Current:  (green check)
 Master:
 Version: 0
 Gen: 1
 Size: 88.14 GB


 From the general Core Admin panel I see:

 lastModified:
 version:1
 numDocs:0
 maxDoc:0
 optimized: (red circle)
 current: (green check)
 hasDeletions: (red circle)

 If I query my index for *:* I get 0 result. If I trigger optimize it
 wipes ALL my data inside the index and reset to empty. I've played
 around my EmbeddedServer initially using autoCommit/softCommit and it
 was working fine. Now that I've switched to commitWithin the document
 add query, it always do that! I'm never able to reload my index within
 Tomcat/Solr.

 Any idea?

 Cheers,

 /jonathan

Re: Switch from Sphinx to Solr - some basics please

2012-08-15 Thread Walter Underwood

These do require some Sphinx knowledge. I could answer them on StackOverflow
because I converted Chegg from Sphinx to Solr this year.

As I said there, read about Solr cores. They are independent search
configurations and indexes within one Solr server:
http://wiki.apache.org/solr/CoreAdmin

For your jobs example, I would use filter queries to limit the search to a
single country. Filter them to country:us or country:de or country:fr and you
will only get result from that country.

Solr does not use the term rotate for indexes. You can delete with a query,
so you could delete all the jobs for one country, reindex those, then commit.

Separate cores are best when you have different kinds of data. At Chegg, we
search books and college courses. Those are in different cores and have very
different schemas.

wunder

On Aug 15, 2012, at 5:11 AM, nnikolay wrote:

HI iorixxx, thanks for the reply.

Well you don't need sphinx knowledge to answer my questions.

I have write you what I want:

1. I need to have 2 seprate indexes. In Stackoverlfow I became the answer I
need to start 2 cores for example. How many cores can I run for solr? I have
for example over 100 different indexes, that they should seeing as separate
data. This indexes should be reindexed in different times and the data of
them should not mixed with each other.

You need to understand follow situation:

I have for example jobs form country A, jobs from country B and so on until
100 countries. I need to have for each country an separate index, because if
someone search for jobs in country A I need to query only the index for
country A. How to solve this problem?

How to do this? Is there are good tutorial? In the wiki of solr, it is very
bad explained.

2. When I become new data for example: Should I rotate the whole index
again, or can I include the new rows and delete the old rows. What is your
suggestion?

Thanks
Nik

--
View this message in context:
http://lucene.472066.n3.nabble.com/Switch-from-Sphinx-to-Solr-some-basics-please-tp4001234p4001379.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
Walter Underwood
wun...@wunderwood.org

Re: Duplicated facet counts in solr 4 beta: user error

No problem, and thanks for posting the resolution

If you have the time and energy, anyone can edit the Wiki if you
create a logon, so any clarification you'd like to provide to keep
others from having this problem would be most welcome!

Best
Erick

On Tue, Aug 14, 2012 at 6:13 PM, Buttler, David buttl...@llnl.gov wrote:
 Here are my steps:

 1)  Download apache-solr-4.0.0-BETA

 2)  Untar into a directory

 3)  cp -r example example2

 4)  cp -r example exampleB

 5)  cp -r example example2B

 6)  cd example;  java -Dbootstrap_confdir=./solr/collection1/conf 
 -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

 7)  cd example2; java -Djetty.port=7574 -DzkHost=localhost:9983 -jar 
 start.jar

 8)  cd exampleB; java -Djetty.port=8900 -DzkHost=localhost:9983 -jar 
 start.jar

 9)  cd example2B; java -Djetty.port=7500 -DzkHost=localhost:9983 -jar 
 start.jar

 10)   cd example/exampledocs; java 
 -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml

 http://localhost:8983/solr/collection1/select?q=*:*wt=xmlfq=cat:%22electronics%22
 14 results returned

 This is correct.  Let's try a slightly more circuitous route by running 
 through the solr tutorial first


 1)  Download apache-solr-4.0.0-BETA

 2)  Untar into a directory

 3)  cd example; java  -jar start.jar

 4)  cd example/exampledocs; java 
 -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml

 5)  kill jetty server

 6)  cp -r example example2

 7)  cp -r example exampleB

 8)  cp -r example example2B

 9)  cd example;  java -Dbootstrap_confdir=./solr/collection1/conf 
 -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

 10)   cd example2; java -Djetty.port=7574 -DzkHost=localhost:9983 -jar 
 start.jar

 11)   cd exampleB; java -Djetty.port=8900 -DzkHost=localhost:9983 -jar 
 start.jar

 12)   cd example2B; java -Djetty.port=7500 -DzkHost=localhost:9983 -jar 
 start.jar

 13)   cd example/exampledocs; java 
 -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml

 With the same query as above, 22 results are returned.

 Looking at this, it is somewhat obvious that what is happening is that the 
 index was copied over from the tutorial and was not cleaned up before running 
 the cloud examples.

 Adding the debug=query parameter to the query URL produces the following:
 lst name=debug
 str name=rawquerystring*:*/str
 str name=querystring*:*/str
 str name=parsedqueryMatchAllDocsQuery(*:*)/str
 str name=parsedquery_toString*:*/str
 str name=QParserLuceneQParser/str
 arr name=filter_queries
 strcat:electronics/str
 /arr
 arr name=parsed_filter_queries
 strcat:electronics/str
 /arr
 /lst

 So, Erick's diagnoses is correct: pilot error.  However, the straightforward 
 path through the tutorial and on to solr cloud makes it easy to make this 
 mistake. Maybe a small warning in the solr cloud page would help?

 Now, running a delete operations fixes things:
 cd example/exampledocs;
 java -Dcommit=false -Ddata=args -jar post.jar 
 deletequery*:*/query/delete
 causes the number of results to be zero.  So, let's reload the data:
 java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml
 now the number of results for our query
 http://localhost:8983/solr/collection1/select?q=*:*wt=xmlfq=cat:electronicshttp://localhost:8983/solr/collection1/select?q=*:*wt=xmlfq=cat:%22electronics
 is back to the correct 14 results.

 Dave

 PS apologizes for hijacking the thread earlier.

Re: Facet sort numeric values

the problem you're running into is that lexical ordering of
numeric data != numeric ordering. If you have a mixed
alpha and numeric data, you man not care if the alpha
stuff is first, i.e.

asdb456
asdf490

sorts fine. Problems happen with
9jsdf
100ukel

the 100ukel comes first.

So if you have a mixed alpha and numeric situation,
you have to either live with it or normalize the numeric
data so it lexical ordering == numeric ordering, the most
common way is to left-pad numeric data to a fixed-width,
i.e. rather than index asb9fg, index asb009fg. Of
course you have to know what the upper limit of any digit
is for this to work...

Best
Erick

On Wed, Aug 15, 2012 at 12:33 AM, Aleksander Akerø
solraleksan...@gmail.com wrote:
 Oh brilliant, didn't think of it being possible to configure that way.

 Had made my own untokenized type, so I guess it would be better for me to
 control datatype this way.

 Bonus question (hehe): What if these field values also contain alphanumeric
 values? E.g. Alpha, Bravo, Omega, ... 
 How would this affect the sorting? I guess the TrieIntField is not
 applicable then.

 Aleksander Akerø
 @ Gurusoft AS
 Mobil: 944 89 054
 QR-Code (Kontaktinfo)

 -Original Message-
 From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
 Sent: 14. august 2012 17:45
 To: solr-user@lucene.apache.org
 Subject: Re: Facet sort numeric values


 : I'm having a problem with sorting facets. I am using the facet.sort=index
 : parameter and it works fine for most of the values.
 ...
 : Eksample, when sorting 15, 6, 23, 7, 10, 90 it sorts like this: 10, 15,
 : 23, 6, 7, 90, but what I wanted was 6, 7, 10, 15, 23, 90.

 what field type are you using?

 If you use one of the Trie___Field types then the facet values should sort
 exactly as you describe.

 fieldType name=int class=solr.TrieIntField precisionStep=0
 positionIncrementGap=0/ fieldType name=float
 class=solr.TrieFloatField precisionStep=0 positionIncrementGap=0/
 fieldType name=long class=solr.TrieLongField precisionStep=0
 positionIncrementGap=0/ fieldType name=double
 class=solr.TrieDoubleField precisionStep=0 positionIncrementGap=0/



 -Hoss

Re: Solr 3.5 result grouping is failing

Please attach the results of adding debugQuery=on
to your query in both the success and failure case, there's
very little information to go on here. You might review:

http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Wed, Aug 15, 2012 at 12:57 AM, chethan chethan.p...@gmail.com wrote:
 Hi,

 I'm trying to group (field collapse) my search results on a field called
 site. The schema says that it has to be indexed: *field name=site
 type=string stored=false indexed=true/.*
 But when I try to query the results with *group.field=sitegroup.limit=100,
 *I see only 1 group of results being returned. And the group value is null.
 This seems to work on another solr instance which only has a few documents
 indexed. Seems to fail on bigger indexes. Help is appreciated.

 Thanks
 Chethan


 Sent this message again as it seemed to bounce the first time.

Re: SOLR3.6:Field Collapsing/Grouping throws OOM

No, sharding into multiple cores on the same machine still
is limited by the physical memory available. It's still lots
of stuf on a limited box.

But try backing up and re-thinking the problem a bit.
Some possibilities off the top of my head:

1 have a new field current. when you update a doc,
 reindex the old doc with current=0 and put current=
 1 in the new doc (boolean field). Getting one and
 only one is really simple.
2 Use external file fields (EFF) for the same purpose, that
 won't require you to re-index the doc. The trick
 here is you use the value in the EFF as a multiplier
 for the score (that's what function queries do). So older
 versions of the doc have scores of 0 and just don't
 show up.
3 Implement a custom collector that replaces older hits
 with newer hits. Actually I don't particularly like this
 because it would potentially replace a higher-scoring
document with a lower-scoring one in the results list...

Bottom line here is I don't think grouping is a good approach
for this problem

Best
Erick

On Wed, Aug 15, 2012 at 5:04 AM, Tirthankar Chatterjee
tchatter...@commvault.com wrote:
 Hi Erick,
 You are so right on the memory calculations. I am happy that I know now that 
 I was doing something wrong. Yes I am getting confused with SQL.

 I will back up and let you know the use case. I am tracking file versions. 
 And I want to give an option to browse your system for the latest files. So 
 in order to remove dups (same filename) I used grouping.

 Also when you say Sharding is it okay if I do multi cores and does it mean 
 that each core needs a separate tomcat. I meant to say can I use the same 
 machine? 150 mill docs have 120 mill unique paths too.

 One more thing. If I need sharding and need a new box then it wont be great. 
 Because this system still have horsepower left which I can use.

 Thanks a ton for explaining the issue.

 Erick Erickson erickerick...@gmail.com wrote:


 You'r putting  a lot of data on a single box, then
 asking to group on what I presume is a string
 field. That's just going to eat up a _bunch_ of
 memory.

 let's say your average file name is 16 bytes long. Each
 unique value will take up 58 + 32 bytes (58 bytes
 of overhead, I'm presuming Solr 3.X and 16*2 bytes
 for the chars). So, we're up to 90 bytes/string * number
 of distinct file names) Say you have, for argument's
 sake, 100M distinct file names. You're up to 9G
 memory requirement for sorting alone. Solr's
 sorting reads all the unique values into memory whether
 or not they satisfy the query...

 And Grouping can also be expensive. I don't think
 you really want to group in this case, I'd simply use
 a filter query something like:
 fq=filefolder:E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307

 Then you're also grouping on conv_sort which doesn't
 make much sense, do you really want individual results returned
 for _each_ file name?

 What it looks like to me is you're confusing SQL with
 solr search and getting into bad situations...

 Also, 150M documents in a single shard is...really a lot.
 You're probably at a point where you need to shard. Not
 to mention that your 400G index is trying to be jammed
 into 12G of memory.

 This actually feels like an XY problem, can you back
 up and let us know what the use-case you're
 trying to solve is? Perhaps there are less memory-
 consumptive solutions possible.

 Best
 Erick

 On Tue, Aug 14, 2012 at 6:38 AM, Tirthankar Chatterjee
 tchatter...@commvault.com wrote:
 Editing the query...remove smb:. I don't know where it came from while 
 I did copy/paste

 Tirthankar Chatterjee tchatter...@commvault.com wrote:


 Hi,
 I have a beefy box with 24Gb RAM (12GB for Tomcat7 which houses SOLR3.6)  2 
 Processors Intel Xeon 64 bit Server, 30TB HDD. JDK 1.7.0_03 x64 bit


 Data Index Dir Size: 400GB
 Metadata of files is stored in it. I have around 15 schema fields.
 Total number of items:150million approx.

 I have a scenario which I will try to explain to the best of my knowledge 
 here:

 Let us consider the fields I am interested in

 Url: Entire path of a file in windows file system including the filename. 
 ex:C:\Documents\A.txt
 mtm: Modified Time of the file
 Jid:JOb ID
 conv_sort is string field type where the filename is stored.

 I run a job where the following gets inserted

 Total Items:2
 Url:C:\personal\A1.txt
 mtm:08/14/2012 12:00:00
 Jid:1
 Conv_sort:A1.txt
 ---
 Url:C:\personal\B1.txt
 mtm:08/14/2012 12:01:00
 Jid:1
 Conv_sort:B1.txt
 In the second run only one item changes:

 Url:C:\personal\A1.txt
 mtm:08/15/2012 1:00:00
 Jid:2
 Conv_sort=A1.txt

 When queried I would like to return the latest A1.txt and B1.txt back to the 
 end user. I am trying to use grouping with no luck. It keeps throwing OOM… 
 can someone please help… as it is critical for my project

 The query I am trying is under a folder there are 1000 files and I putting a 
 filtered

Re: question(s) re lucene spatial toolkit aka LSP aka spatial4j

2012-08-15 Thread David Smiley (@MITRE.org)

Hey solr-user, are you by chance indexing LineStrings?  That is something I
never tried with this spatial index.  Depending on which iteration of LSP
you are using, I figure you'd either end up indexing a vast number of points
along the line which would be slow to index and make the index quite big, or
you might end up with a geohash granularity that will look more like a very
blocky (i.e. pixelated) approximation of the line that is much courser and
will thus trigger searches near the line to match the line.  I don't have
this use-case in my work so I haven't put that much thought into handling
lines -- I just do points  polygons  circles  rects.
~ David



-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757p4001486.html
Sent from the Solr - User mailing list archive at Nabble.com.

Does DataImportHandler do any sanitizing?

2012-08-15 Thread Jon Drukman

I am pulling some fields from a mysql database using DataImportHandler and
some of them have invalid XML in them.  Does DataImportHandler do any kind
of filtering/sanitizing to ensure that it will go in OK or is it all on me?

Example bad data:  orphaned ampersands (Peanut Butter  Jelly), curly
quotes (we’re)

-jsd-

Re: Does DataImportHandler do any sanitizing?

Hi, Jon,

As far as I know, DataImportHandler doesn't transfer data to the rest
of Solr via XML so it shouldn't be a problem...

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 15, 2012 at 5:03 PM, Jon Drukman jdruk...@gmail.com wrote:
 I am pulling some fields from a mysql database using DataImportHandler and
 some of them have invalid XML in them.  Does DataImportHandler do any kind
 of filtering/sanitizing to ensure that it will go in OK or is it all on me?

 Example bad data:  orphaned ampersands (Peanut Butter  Jelly), curly
 quotes (we’re)

 -jsd-

custom complex field - PolyField

2012-08-15 Thread Leonardo Souza

Hi,

I have to index a tuple like ('blah', 'more blah info') in a multivalued
field type.
I have read about the PolyField type and it seems the best solution so far
but i can't find documentation pointing how to use or implement a custom
field.
Any help is appreciated.


--
Leonardo S Souza

solr.xml entries got deleted when powered off

Hello,

  I created an index = all the schema.xml  solrconfig.xml files are
created with content (I checked that they have contents in the xml files).
But, if I poweroff the system  restart again - the contents of the files
are gone. It's like 0 bytes files.

Even, the solr.xml file which got updated when I created a new index (with a
core) has 0 bytes  all the previous entries are lost too.

I'm using Solr 4.0

Does anyone has any idea about the scenarios where it might happen.

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr.xml entries got deleted when powered off

2012-08-15 Thread Leonardo Souza

Just guessing,.
disk full?

--
Abraços,
Leonardo S Souza




2012/8/15 vempap phani.vemp...@emc.com

 Hello,

   I created an index = all the schema.xml  solrconfig.xml files are
 created with content (I checked that they have contents in the xml files).
 But, if I poweroff the system  restart again - the contents of the files
 are gone. It's like 0 bytes files.

 Even, the solr.xml file which got updated when I created a new index (with
 a
 core) has 0 bytes  all the previous entries are lost too.

 I'm using Solr 4.0

 Does anyone has any idea about the scenarios where it might happen.

 Thanks.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr.xml entries got deleted when powered off

nopes .. there is good amount of space left on disk



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001502.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr.xml entries got deleted when powered off

It's happening when I'm not doing a clean shutdown. Are there any more
scenarios it might happen ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001503.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: solr.xml entries got deleted when powered off

2012-08-15 Thread Buttler, David

You are not putting these files in /tmp are you?  That is sometimes wiped by 
different OS's on shutdown

-Original Message-
From: vempap [mailto:phani.vemp...@emc.com] 
Sent: Wednesday, August 15, 2012 3:31 PM
To: solr-user@lucene.apache.org
Subject: Re: solr.xml entries got deleted when powered off

It's happening when I'm not doing a clean shutdown. Are there any more
scenarios it might happen ?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001503.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: solr.xml entries got deleted when powered off