Solr startup script in version 4.10.3

2015-01-06 Thread Dominique Bejean
Hi,

In release 4.10.3, the following lines were removed from solr starting
script (bin/solr)

# TODO: see SOLR-3619, need to support server or example
# depending on the version of Solr
if [ -e $SOLR_TIP/server/start.jar ]; then
  DEFAULT_SERVER_DIR=$SOLR_TIP/server
else
  DEFAULT_SERVER_DIR=$SOLR_TIP/example
fi

However, the usage message always say

  -d dir  Specify the Solr server directory; defaults to server


Either the usage have to be fixed or the removed lines put back to the
script.

Personally, I like the default to server directory.

My installation process in order to have a clean empty solr instance is to
copy examples into server and remove directories like example-DIH,
example-shemaless, multicore and solr/collection1

Solr server (or node) can be started without the -d parameter.

If this makes sense, a Jira issue could be open.

Dominique
http://www.eolya.fr/


Re: FOSDEM Open source search devroom

2015-01-06 Thread Charlie Hull

On 02/01/2015 08:37, Bram Van Dam wrote:

Hi folks,

There will be an Open source search devroom[1] at this year's FOSDEM in
Brussels, 31st of January  1st of February.

I don't know if there will be a Lucene/Solr presence (there's no
schedule for the dev room yet), but this seems like a good place meet up
and talk shop.

I'll be there, and I hope some of you will as well.


Sadly I won't, but my colleague (and committer) Alan Woodward will, 
talking about text search for stream processing.


C


  - Bram

[1] https://fosdem.org/2015/schedule/track/open_source_search/




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Vertical search Engine

2015-01-06 Thread klunwebale
hello 

i want to create a vertical search engine like trovit.com.

I have installed solr  and solarium.

What else to i need can you recommend a suitable crawler 
and how to structure my data to be indexed 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Vertical-search-Engine-tp4177542.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Vertical search Engine

2015-01-06 Thread Furkan KAMACI
Hi,

You should estimate the size of the data you will index before you decide
crawler. Crawler is out of scope at this mail list. If you will crawl big
size of data you can check Apache Nutch user list.

Furkan KAMACI

2015-01-06 10:39 GMT+02:00 klunwebale klunweb...@gmail.com:

 hello

 i want to create a vertical search engine like trovit.com.

 I have installed solr  and solarium.

 What else to i need can you recommend a suitable crawler
 and how to structure my data to be indexed



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Vertical-search-Engine-tp4177542.html
 Sent from the Solr - User mailing list archive at Nabble.com.



edismax with multiple words for keyword tokenizer splitting on space

2015-01-06 Thread Sankalp Gupta
Hi
I come across this weird behaviour in solr. I'm not sure that why this is
desired in solr. I have filed this on stackoverflow. Please check
http://stackoverflow.com/questions/27795177/edismax-with-multiple-words-for-keyword-tokenizer-splitting-on-space

Thanks
Sankalp Gupta


Re: Running Multiple Solr Instances

2015-01-06 Thread Shawn Heisey
On 1/5/2015 9:31 PM, Nishanth S wrote:
 I  am running  multiple solr instances  (Solr 4.10.3 on tomcat 8).There are
 3 physical machines and  I have 4 solr instances running  on each machine
 on ports  8080,8081,8082 and 8083.The set up is well up to this point.Now I
 want to point each of these instance to a different  index directories.The
 drives in the machines are mounted as d/1,d/2,d/3 ,d/4 etc.Now if I define
 /d/1 as  the solr home all solr index directories  are created in /d/1
 where as the other drives remain un used.So how do I configure solr to
  make use of all the drives so that I can  get maximum storage for solr.I
 would really appreciate any help in this regard.

You should only run one Solr instance per machine.  One instance can
handle as many indexes as you want to run.  Running multiple instances
will waste a fair amount of system resources, and will also make the
entire setup a lot more complicated than it needs to be.

If you don't plan on setting up RAID (which would probably be a lot
easier to manage), here's an idea:

Set up the solr home somewhere on the root filesystem, then create
symlinks under that which will be the instance directories, pointed to
various directories under your other mount points.  When Solr starts, it
should begin core detection at the solr home and follow those symlinks
into the other locations.  I'm not aware of any problems with using
symlinks in this way.

If you're running SolrCloud, that can be a little more complicated,
because creating a new collection from scratch will create the cores
under the solr home ... but you can move them and symlink them after
they're created, then either reload the collection or restart Solr.
Just be sure that no indexing is happening when you begin the move.

Thanks,
Shawn



Re: Running Multiple Solr Instances

2015-01-06 Thread Michael Della Bitta

I would do one of either:

1. Set a different Solr home for each instance. I'd use the 
-Dsolr.solr.home=/d/2 command line switch when launching Solr to do so.


2. RAID 10 the drives. If you expect the Solr instances to get uneven 
traffic, pooling the drives will allow a given Solr instance to share 
the capacity of all of them.


On 1/5/15 23:31, Nishanth S wrote:

Hi folks,

I  am running  multiple solr instances  (Solr 4.10.3 on tomcat 8).There are
3 physical machines and  I have 4 solr instances running  on each machine
on ports  8080,8081,8082 and 8083.The set up is well up to this point.Now I
want to point each of these instance to a different  index directories.The
drives in the machines are mounted as d/1,d/2,d/3 ,d/4 etc.Now if I define
/d/1 as  the solr home all solr index directories  are created in /d/1
where as the other drives remain un used.So how do I configure solr to
  make use of all the drives so that I can  get maximum storage for solr.I
would really appreciate any help in this regard.

Thanks,
Nishanth





Re: edismax with multiple words for keyword tokenizer splitting on space

2015-01-06 Thread Jack Krupansky
You need to escape the space in your query (using backslash or quotes
around the term) - the query parser doesn't parse based on the
analyzer/tokenizer for each field.

-- Jack Krupansky

On Tue, Jan 6, 2015 at 4:05 AM, Sankalp Gupta sankalp.gu...@snapdeal.com
wrote:

 Hi
 I come across this weird behaviour in solr. I'm not sure that why this is
 desired in solr. I have filed this on stackoverflow. Please check

 http://stackoverflow.com/questions/27795177/edismax-with-multiple-words-for-keyword-tokenizer-splitting-on-space

 Thanks
 Sankalp Gupta



Re: Vertical search Engine

2015-01-06 Thread Jack Krupansky
Consider the Fusion product from LucidWorks:
http://lucidworks.com/product/fusion/

Structuring of your data should be driven by your queries and access
patterns - what are the most common queries and what are the most extreme
and complex queries that you expect to handle, both tin terms of the
queries are expressed and the results being returned.

-- Jack Krupansky

On Tue, Jan 6, 2015 at 3:39 AM, klunwebale klunweb...@gmail.com wrote:

 hello

 i want to create a vertical search engine like trovit.com.

 I have installed solr  and solarium.

 What else to i need can you recommend a suitable crawler
 and how to structure my data to be indexed



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Vertical-search-Engine-tp4177542.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to limit the number of result sets of the 'export' handler

2015-01-06 Thread Alexandre Rafalovitch
Export was specifically designed to get everything which is very
expensive otherwise.

If you just want the subset, you might be better off with normal
queries and/or with deep paging (cursor).

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 6 January 2015 at 00:30, Sandy Ding sandy.ding...@gmail.com wrote:
 Using rows=xxx doesn't seem to work.
 Is there a way to do this?


Re: Vertical search Engine

2015-01-06 Thread Ahmet Arslan
Hi,

http://manifoldcf.apache.org is another option to consider. 
It is useful to crawl projected pages.

Free resources :

http://www.manning.com/wright/ManifoldCFinAction_manuscript.pdf
https://manifoldcfinaction.googlecode.com/svn/trunk/pdfs/
 
Ahmet 



On Tuesday, January 6, 2015 1:56 PM, Jack Krupansky jack.krupan...@gmail.com 
wrote:
Consider the Fusion product from LucidWorks:
http://lucidworks.com/product/fusion/

Structuring of your data should be driven by your queries and access
patterns - what are the most common queries and what are the most extreme
and complex queries that you expect to handle, both tin terms of the
queries are expressed and the results being returned.

-- Jack Krupansky


On Tue, Jan 6, 2015 at 3:39 AM, klunwebale klunweb...@gmail.com wrote:

 hello

 i want to create a vertical search engine like trovit.com.

 I have installed solr  and solarium.

 What else to i need can you recommend a suitable crawler
 and how to structure my data to be indexed



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Vertical-search-Engine-tp4177542.html
 Sent from the Solr - User mailing list archive at Nabble.com.



IstvanKulcsar - Wiki Solr

2015-01-06 Thread ikulcsar

Hy,

I would like suggest pages which use SOLR and developed my company.

Please put this page this site:
http://wiki.apache.org/solr/PublicServers

http://www.odrportal.hu/kereso/
http://idea.unideb.hu/idealista/
http://www.jobmonitor.hu
http://www.profession.hu/
http://webicina.com/
http://www.cylex.hu/

Thanks for answer.

STeve


RE: Frequent deletions

2015-01-06 Thread Amey Jadiye
Well, we are doing same thing(in a way). we have to do frequent deletions in 
mass, at a time we are deleting around 20M+ documents.All i am doing is after 
deletion i am firing the below command on each of our solr node and keep some 
patience as it take way much time.

curl -vvv 
http://node1.solr.x.com/collection1/update?optimize=truedistrib=false;  
/tmp/__solr_clener_log

After finishing optimisation curl returns below xml :









?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime10268995/int/lst
/response

Regards,Amey

 Date: Wed, 31 Dec 2014 02:32:37 -0700
 From: inna.gel...@elbitsystems.com
 To: solr-user@lucene.apache.org
 Subject: Frequent deletions
 
 Hello,
 We perform frequent deletions from our index, which greatly increases the
 index size.
 How can we perform an optimization in order to reduce the size.
 Please advise,
 Thanks.
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689.html
 Sent from the Solr - User mailing list archive at Nabble.com.
  

RE: .htaccess / password

2015-01-06 Thread Ganesh.Yadav
Craig,

1.   What is .htaccess file meant for?

2.   What are the contents inside this file?

3.   How will you or how Solr knows that it needs to look for this file to 
bring in the needed security to this (which) area?

4.   What event is causing for you to re-index the engine every night?



Please share



Thanks

G



-Original Message-
From: Craig Hoffman [mailto:choff...@eclimb.net]
Sent: Tuesday, January 06, 2015 12:29 PM
To: Apache Solr
Subject: .htaccess / password



Quick question: If put a .htaccess file in 
www.mydomin.com/8983/solr/#/http://www.mydomin.com/8983/solr/#/ will Solr 
continue to function properly? One thing to note, I will have a CRON job that 
runs nightly that re-indexes the engine. In a nutshell I’m looking for a way to 
secure this area.



Thanks,

Craig

--

Craig Hoffman

w: http://www.craighoffmanphotography.com

FB: 
www.facebook.com/CraigHoffmanPhotographyhttp://www.facebook.com/CraigHoffmanPhotography

TW: https://twitter.com/craiglhoffman




























facet.contains

2015-01-06 Thread Will Butler
https://issues.apache.org/jira/browse/SOLR-1387 
https://issues.apache.org/jira/browse/SOLR-1387 contains a patch to support 
facet.contains and facet.contains.ignoreCase, making it possible to easily 
filter facet results without the facet.prefix limitations. I know that it is 
possible to approximate this using two fields, searching against one and 
displaying values from the other. However, this also has its limitations, and 
doesn’t work properly for multi-valued fields. Are there any other suggestions 
for being able to display facet values and counts based on a user supplied 
string?

Thanks,

Will

Re: htaccess

2015-01-06 Thread Gora Mohanty
Hi,

Your message seems quite confused (even the URL is not right for most
normal Solr setup), and it is not clear as to what you mean by function
properly. Solr is a search engine, and has no idea about .htacess files.

Are you asking whether Solr respects directives in .htaccess files? I am
pretty sure that cannot be the case.

With regards to Solr security, it is again normally not a Solr concern.
Please start from https://wiki.apache.org/solr/SolrSecurity

No offence, but it seems that your real concerns might lie elsewhere.
Please take a look at http://people.apache.org/~hossman/#xyproblem

Please do follow up on this list if your questions have not been addressed.

Regards,
Gora


On 6 January 2015 at 23:28, Craig Hoffman choff...@eclimb.net wrote:

 Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/
 will Solr continue to function properly? One thing to note, I will have a
 CRON job that runs nightly that re-indexes the engine. In a nutshell I’m
 looking for a way to secure this area.

 Thanks,
 Craig
 --
 Craig Hoffman
 w: http://www.craighoffmanphotography.com
 FB: www.facebook.com/CraigHoffmanPhotography
 TW: https://twitter.com/craiglhoffman
















RE: Solr Memory Usage - How to reduce memory footprint for solr

2015-01-06 Thread Toke Eskildsen
Abhishek Sharma [abhishe...@unbxd.com] wrote:

 *Q* - I am forced to set Java Xmx as high as 3.5g for my solr app.. If i
 keep this low, my CPU hits 100% and response time for indexing increases a
 lot.. And i have hit OOM Error as well when this value is low..

[...]

   2. Index Size - 2 g
   3. num. of Search Hits per sec - 10 [*IMP* - All search queries have
   faceting..]

Faceting is often the reason for high memory usage. If you are not already 
doing so, do enable DocValues for the fields you are faceting on. If you have a 
lot of unique values in your facets (millions), you might also consider 
limiting the amount of concurrent searches.

Still, 3.5GB heap seems like quite a bit for a 2GB index. How many documents do 
you have?

- Toke Eskildsen


RE: PDF search functionality using Solr Schema.xml and SolrConfig.xml question

2015-01-06 Thread Ganesh.Yadav
Still looking for answer on Schema.xml and SolrConfig.xml


1.   Do I need to tell Solr, to extract Title from PDF, go look for Title 
word and extract entire line after the Tag and collect all such occurrence’s 
from hundreds of PDFs and build the Title column data and index it?


2.   How to define my own schema to Solr

3.   Say I defined my fields Title, Ticket_number, Submitter, client and so 
on, How can I verify respective data is extracted in specific columns in Solr 
and indexed? Any suggestion on how the Analyzer, Tokenizer and Filter and which 
one will help for this purpose?


1.   I do not want to dump entire 4 GB PDF contents in one searchable field 
(ATTR_CONTENT) in Solr

2.   Even if entire PDF contents is extracted in above field as a default, 
I still want to extract specific searchable column data in their respective 
fields

3.   Rather I want to configure Solr to have column wise searchable 
contents such as Title, number, and so on

Any suggestions on performance? PDF database is 80 GB, will it be fast enough? 
Do I Need to divide in multiple cores and on multiple machines ? and on 
multiple web apps? And clustering?


I should have mentioned my PDFs are from Ticketing system like Jira which is 
already retired way back from production and all I have is the Ticketing system 
PDF database.


4.   My system will be used internally just by the selected number of very 
few people.

5.   They can wait 4 GB PDF to get loaded.

6.   I agree there will be many matches will be found in one large PDF, 
based on search criteria

7.   To make searches faster I want Solr to create more columns and column 
based indexes

8.   Solr underneath uses Tika which is extracting contents and getting rid 
of all the rich content formatting characters present in the PDF document.

9.   I believe resulting extraction size is 1/5th of the original PDF 
..just a random guess based on one sample extraction




From: Jürgen Wagner (DVT) [mailto:juergen.wag...@devoteam.com]
Sent: Tuesday, January 06, 2015 11:56 AM
To: solr-user@lucene.apache.org
Subject: Re: PDF search functionality using Solr

Hello,
  no matter which search platform you will use, this will pose two challenges:

- The size of the documents will render search less and less useful as the 
likelihood of matches increases with document size. So, without a proper 
semantic extraction (e.g., using decent NER or relationship extraction with a 
commercial text mining product), I doubt you will get the required precision to 
make this overly usefiul.

- PDFs can have their own character sets based on the characters actually used. 
Such file-specific character sets are almost impossible to parse, i.e., if your 
PDFs happen to use this feature of the PDF format, you won't be lucky getting 
any meaningful text out of them.

My suggestion is to use the Jira REST API to collect all necessary documents 
and index the resulting XML or attachment formats. As the REST API provides 
filtering capabilities, you could easily create incremental feeds to avoid 
humongous indexing every time there's new information in Jira. Dumping Jira 
stuff as PDF seems to me to be the least suitable way of handling this.

Best regards,
--Jürgen


On 06.01.2015 18:30, ganesh.ya...@sungard.commailto:ganesh.ya...@sungard.com 
wrote:

Hello Solr-users and developers,

Can you please suggest,



1.   What I should do to index PDF content information column wise?



2.   Do I need to extract the contents using one of the Analyzer, Tokenize 
and Filter combination and then add it to Index? How can test the results on 
command prompt? I do not know the selection of specific Analyzer, Tokenizer and 
Filter for this purpose



3.   How can I verify that the needed column info is extracted out of PDF 
and is indexed?



4.   So for example How to verify Ticket number is extracted in 
Ticket_number tag and is indexed?



5.   Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? 
I think I saw some posts complaining on how large size that can be posted ?



6.   What will enable Solr to search in any PDF out of many, with different 
words such as Runtime Error  and result will provide the link to the 
PDF



My PDFs are nothing but Jira ticket system.

PDF has info on

Ticket Number:

Desc:

Client:

Status:

Submitter:

And so on:





1.   I imported PDF document in Solr and it does the necessary searching 
and I can test some of it using the browse client interface provided.



2.   I have 80 GB worth of PDFs.



3.   Total number of PDFs are about 200



4.   Many PDFs are of size 4 GB



5.   What do you suggest me to import such a large PDFs? What tools can you 
suggest to extract PDF contents first in some XML format and later Post that 
XML to be indexed by Solr.?















Your early response is much appreciated.







Thanks



G





--

Mit freundlichen 

Re: How large is your solr index?

2015-01-06 Thread Erick Erickson
Have you considered pre-supposing SolrCloud and using the SPLITSHARD
API command?
Even after that's done, the sub-shard needs to be physically moved to
another machine
(probably), but that too could be scripted.

May not be desirable, but I thought I'd mention it.

Best,
Erick

On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge peter.stu...@gmail.com wrote:
 Yes, totally agree. We run 500m+ docs in a (non-cloud) Solr4, and it even
 performs reasonably well on commodity hardware with lots of faceting and
 concurrent indexing! Ok, you need a lot of RAM to keep faceting happy, but
 it works.

 ++1 for the automagic shard creator. We've been looking into doing this
 sort of thing internally - i.e. when a shard reaches a certain size/num
 docs, it creates 'sub-shards' to which new commits are sent and queries to
 the 'parent' shard are included. The concept works, as long as you don't
 try any non-dist stuff - it's one reason why all our fields are always
 single valued. There are also other implications like cleanup, deletes and
 security to take into account, to name a few.
 A cool side-effect of sub-sharding (for lack of a snappy term) is that the
 parent shard then stops suffering from auto-warming latency due to commits
 (we do a fair amount of committing). In theory, you could carry on
 sub-sharding until your hardware starts gasping for air.


 On Sun, Jan 4, 2015 at 1:44 PM, Bram Van Dam bram.van...@intix.eu wrote:

 On 01/04/2015 02:22 AM, Jack Krupansky wrote:

 The reality doesn't seem to
 be there today. 50 to 100 million documents, yes, but beyond that takes
 some kind of heroic effort, whether a much beefier box, very careful and
 limited data modeling or limiting of query capabilities or tolerance of
 higher latency, expert tuning, etc.


 I disagree. On the scale, at least. Up until 500M Solr performs well
 (read: well enough considering the scale) in a single shard on a single box
 of commodity hardware. Without any tuning or heroic efforts. Sure, some
 queries aren't as snappy as you'd like, and sure, indexing and querying at
 the same time will be somewhat unpleasant, but it will work, and it will
 work well enough.

 Will it work for thousands of concurrent users? Of course not. Anyone who
 is after that sort of thing won't find themselves in this scenario -- they
 will throw hardware at the problem.

 There is something to be said for making sharding less painful. It would
 be nice if, for instance, Solr would automagically create a new shard once
 some magic number was reached (2B at the latest, I guess). But then that'll
 break some query features ... :-(

 The reason we're using single large instances (sometimes on beefy
 hardware) is that SolrCloud is a pain. Not just from an administrative
 point of view (though that seems to be getting better, kudos for that!),
 but mostly because some queries cannot be executed with distributed=true.
 Our users, at least, prefer a slow query over an impossible query.

 Actually, this 2B limit is a good thing. It'll help me convince
 $management to donate some of our time to Solr :-)

  - Bram



Re: PDF search functionality using Solr

2015-01-06 Thread Erick Erickson
Seconding Jürgen's comment. 4G docs are almost, but not quite totally
useless to search How many JIRA's each? That's _one_ document unless
you do some fancy dancing. Pulling the data directly using the JIRA
API sounds far superior.

If you _must_ use the JIRA-PDF-Solr option, consider the following:
Use Tika on the client to parse the doc and taking control of the
mapping of the meta-data
and, probably, breaking thins up into individual document, one Solr
document per JIRA.

That'll give you a chance to deal with charset issues and the like.
Here's an example:

https://lucidworks.com/blog/indexing-with-solrj/

That one has both Tika and database connectivity but should be pretty
straight-forward to adapt, just pull the database junk out.

Best,
Erick

On Tue, Jan 6, 2015 at 9:55 AM, Jürgen Wagner (DVT)
juergen.wag...@devoteam.com wrote:
 Hello,
   no matter which search platform you will use, this will pose two
 challenges:

 - The size of the documents will render search less and less useful as the
 likelihood of matches increases with document size. So, without a proper
 semantic extraction (e.g., using decent NER or relationship extraction with
 a commercial text mining product), I doubt you will get the required
 precision to make this overly usefiul.

 - PDFs can have their own character sets based on the characters actually
 used. Such file-specific character sets are almost impossible to parse,
 i.e., if your PDFs happen to use this feature of the PDF format, you won't
 be lucky getting any meaningful text out of them.

 My suggestion is to use the Jira REST API to collect all necessary documents
 and index the resulting XML or attachment formats. As the REST API provides
 filtering capabilities, you could easily create incremental feeds to avoid
 humongous indexing every time there's new information in Jira. Dumping Jira
 stuff as PDF seems to me to be the least suitable way of handling this.

 Best regards,
 --Jürgen



 On 06.01.2015 18:30, ganesh.ya...@sungard.com wrote:

 Hello Solr-users and developers,
 Can you please suggest,

 1.   What I should do to index PDF content information column wise?

 2.   Do I need to extract the contents using one of the Analyzer,
 Tokenize and Filter combination and then add it to Index? How can test the
 results on command prompt? I do not know the selection of specific Analyzer,
 Tokenizer and Filter for this purpose

 3.   How can I verify that the needed column info is extracted out of
 PDF and is indexed?

 4.   So for example How to verify Ticket number is extracted in
 Ticket_number tag and is indexed?

 5.   Is it ok to post 4 GB worth of PDF to be imported and indexed by
 Solr? I think I saw some posts complaining on how large size that can be
 posted ?

 6.   What will enable Solr to search in any PDF out of many, with
 different words such as Runtime Error  and result will provide the
 link to the PDF

 My PDFs are nothing but Jira ticket system.
 PDF has info on
 Ticket Number:
 Desc:
 Client:
 Status:
 Submitter:
 And so on:


 1.   I imported PDF document in Solr and it does the necessary searching
 and I can test some of it using the browse client interface provided.

 2.   I have 80 GB worth of PDFs.

 3.   Total number of PDFs are about 200

 4.   Many PDFs are of size 4 GB

 5.   What do you suggest me to import such a large PDFs? What tools can
 you suggest to extract PDF contents first in some XML format and later Post
 that XML to be indexed by Solr.?







 Your early response is much appreciated.



 Thanks

 G




 --

 Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
 уважением
 i.A. Jürgen Wagner
 Head of Competence Center Intelligence
  Senior Cloud Consultant

 Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
 Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
 E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de

 
 Managing Board: Jürgen Hatzipantelis (CEO)
 Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
 Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Solr Memory Usage - How to reduce memory footprint for solr

2015-01-06 Thread Abhishek Sharma
*Q* - I am forced to set Java Xmx as high as 3.5g for my solr app.. If i
keep this low, my CPU hits 100% and response time for indexing increases a
lot.. And i have hit OOM Error as well when this value is low..

Is this too high? If so, how can I reduce this?

*Machine Details* 4 G RAM, SSD

*Solr App Details* (Standalone solr app, no shards)

   1. num. of Solr Cores = 5
   2. Index Size - 2 g
   3. num. of Search Hits per sec - 10 [*IMP* - All search queries have
   faceting..]
   4. num. of times Re-Indexing per hour per core - 10 (it may happen at
   the same time at a moment for all the 5 cores)
   5. Query Result Cache, Document cache and Filter Cache are all default
   size - 4 kb.

*top* stats -

  VIRTRESSHR S %CPU %MEM
6446600 3.478g  18308 S 11.3 94.6

*iotop* stats

 DISK READ  DISK WRITE  SWAPIN IO
0-1200 K/s0-100 K/s  0  0-5%


SOLR - any open source framework

2015-01-06 Thread Vishal Swaroop
I am new to SOLR and was able to configure, run samples as well as able to
index data using DIH (from database).

Just wondering if there are open source framework to query and
display/visualize.

Regards


Re: SOLR - any open source framework

2015-01-06 Thread Roman Chyla
We've compared several projects before starting - AngularJS was on them,
 it is great for stuff where you could find components (already prepared)
but writing custom components was easier in other framworks (you need to
take this statement with grain of salt: it was specific to our situation),
but that was one year ago...

On Tue, Jan 6, 2015 at 5:20 PM, Vishal Swaroop vishal@gmail.com wrote:

 Thanks Roman... I will check it... Maybe it's off topic but how about
 Angular...
 On Jan 6, 2015 5:17 PM, Roman Chyla roman.ch...@gmail.com wrote:

  Hi Vishal, Alexandre,
 
  Here is another one, using Backbone, just released v1.0.16
 
  https://github.com/adsabs/bumblebee
 
  you can see it in action: http://ui.adslabs.org/
 
  While it primarily serves our own needs, I tried to architect it to be
  extendible (within reasonable limits of code, man power)
 
  Roman
 
  On Tue, Jan 6, 2015 at 4:58 PM, Alexandre Rafalovitch 
 arafa...@gmail.com
  wrote:
 
   That's very general question. So, the following are three random ideas
   just to get you started to think of options.
  
   *) spring.io (Spring Data Solr) + Vaadin
   *)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder
   too)
   *) http://projectblacklight.org/
  
   Regards,
  Alex.
   
   Sign up for my Solr resources newsletter at http://www.solr-start.com/
  
  
   On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com
 wrote:
I am new to SOLR and was able to configure, run samples as well as
 able
   to
index data using DIH (from database).
   
Just wondering if there are open source framework to query and
display/visualize.
   
Regards
  
 



Re: solrcloud without faceting, i.e. for failover only

2015-01-06 Thread Michael Della Bitta

The downsides that come to mind:

1. Every write gets amplified by the number of nodes in the cloud. 1000 
write requests end up creating 1000*N HTTP calls as the leader forwards 
those writes individually to all of the followers in the cloud. Contrast 
that with classical replication where only changed index segments get 
replicated asynchronously.


2. Slightly more complicated infrastructure in terms of having to run a 
zookeeper cluster.


#1 is a trade off against being possibly more available to writes in the 
case of a single down node. In the cloud case, you're still open for 
business. In the classical replication case, you're no longer available 
for writes if the downed node is the master.


My two cents.

On 1/6/15 16:30, Will Milspec wrote:

Hi all,

We have a smallish index that performs well for searches and are
considering using solrcloud --but just for high availability/redundancy,
i.e. without any sharding.

The indexes would be replicated, but not distributed.

I know that there are no stupid questions..Only stupid people...but here
goes:

-is solrcloud w/o sharding done?( I.e. it's just not done!! )
-any downside (i.e. aside from the lack of horizontal scalability )

will





Re: SOLR - any open source framework

2015-01-06 Thread Alexandre Rafalovitch
That's very general question. So, the following are three random ideas
just to get you started to think of options.

*) spring.io (Spring Data Solr) + Vaadin
*)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder too)
*) http://projectblacklight.org/

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com wrote:
 I am new to SOLR and was able to configure, run samples as well as able to
 index data using DIH (from database).

 Just wondering if there are open source framework to query and
 display/visualize.

 Regards


Re: SOLR - any open source framework

2015-01-06 Thread Vishal Swaroop
Thanks a lot... We are in the process of analyzing what to use with SOLR...
On Jan 6, 2015 5:30 PM, Roman Chyla roman.ch...@gmail.com wrote:

 We've compared several projects before starting - AngularJS was on them,
  it is great for stuff where you could find components (already prepared)
 but writing custom components was easier in other framworks (you need to
 take this statement with grain of salt: it was specific to our situation),
 but that was one year ago...

 On Tue, Jan 6, 2015 at 5:20 PM, Vishal Swaroop vishal@gmail.com
 wrote:

  Thanks Roman... I will check it... Maybe it's off topic but how about
  Angular...
  On Jan 6, 2015 5:17 PM, Roman Chyla roman.ch...@gmail.com wrote:
 
   Hi Vishal, Alexandre,
  
   Here is another one, using Backbone, just released v1.0.16
  
   https://github.com/adsabs/bumblebee
  
   you can see it in action: http://ui.adslabs.org/
  
   While it primarily serves our own needs, I tried to architect it to be
   extendible (within reasonable limits of code, man power)
  
   Roman
  
   On Tue, Jan 6, 2015 at 4:58 PM, Alexandre Rafalovitch 
  arafa...@gmail.com
   wrote:
  
That's very general question. So, the following are three random
 ideas
just to get you started to think of options.
   
*) spring.io (Spring Data Solr) + Vaadin
*)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI
 builder
too)
*) http://projectblacklight.org/
   
Regards,
   Alex.

Sign up for my Solr resources newsletter at
 http://www.solr-start.com/
   
   
On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com
  wrote:
 I am new to SOLR and was able to configure, run samples as well as
  able
to
 index data using DIH (from database).

 Just wondering if there are open source framework to query and
 display/visualize.

 Regards
   
  
 



Re: Vertical search Engine

2015-01-06 Thread Dominique Bejean
Hi,

You can have a look at www.crawl-anywhere.com
A web crawler on top of Solr. Used for following vertical search engines :

http://www.hurisearch.org/
http://www.searchamnesty.org/

Regards

Dominique


2015-01-06 15:22 GMT+01:00 Ahmet Arslan iori...@yahoo.com.invalid:

 Hi,

 http://manifoldcf.apache.org is another option to consider.
 It is useful to crawl projected pages.

 Free resources :

 http://www.manning.com/wright/ManifoldCFinAction_manuscript.pdf
 https://manifoldcfinaction.googlecode.com/svn/trunk/pdfs/

 Ahmet



 On Tuesday, January 6, 2015 1:56 PM, Jack Krupansky 
 jack.krupan...@gmail.com wrote:
 Consider the Fusion product from LucidWorks:
 http://lucidworks.com/product/fusion/

 Structuring of your data should be driven by your queries and access
 patterns - what are the most common queries and what are the most extreme
 and complex queries that you expect to handle, both tin terms of the
 queries are expressed and the results being returned.

 -- Jack Krupansky


 On Tue, Jan 6, 2015 at 3:39 AM, klunwebale klunweb...@gmail.com wrote:

  hello
 
  i want to create a vertical search engine like trovit.com.
 
  I have installed solr  and solarium.
 
  What else to i need can you recommend a suitable crawler
  and how to structure my data to be indexed
 
 
 
  --
  View this message in context:
  http://lucene.472066.n3.nabble.com/Vertical-search-Engine-tp4177542.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 



Re: cloudsolrserver

2015-01-06 Thread Anshum Gupta
To get started, the ref guide should be helpful.

https://cwiki.apache.org/confluence/display/solr/Using+SolrJ

You just need to pass the Zk host string to the constructor and then use
the server.

Also, what do you mean by *connect to CloudSolrServer*? you mean connect
using, right?


On Tue, Jan 6, 2015 at 2:58 PM, tharpa 7kavsn...@sneakemail.com wrote:

 We are switching from a direct HTTP connection to use cloudsolrserver.  I
 have looked and failed for an example of code for connecting to
 cloudsolrserver.  Are there any tutorials or code examples?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/cloudsolrserver-tp4177724.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Anshum Gupta
http://about.me/anshumgupta


Re: cloudsolrserver

2015-01-06 Thread tharpa
Thanks Anshum.

If you say that connect using CloudSolrServer is more correct than saying,
connect to CloudSolrServer, I believe you.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/cloudsolrserver-tp4177724p4177728.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR - any open source framework

2015-01-06 Thread Vishal Swaroop
Thanks Roman... I will check it... Maybe it's off topic but how about
Angular...
On Jan 6, 2015 5:17 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi Vishal, Alexandre,

 Here is another one, using Backbone, just released v1.0.16

 https://github.com/adsabs/bumblebee

 you can see it in action: http://ui.adslabs.org/

 While it primarily serves our own needs, I tried to architect it to be
 extendible (within reasonable limits of code, man power)

 Roman

 On Tue, Jan 6, 2015 at 4:58 PM, Alexandre Rafalovitch arafa...@gmail.com
 wrote:

  That's very general question. So, the following are three random ideas
  just to get you started to think of options.
 
  *) spring.io (Spring Data Solr) + Vaadin
  *)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder
  too)
  *) http://projectblacklight.org/
 
  Regards,
 Alex.
  
  Sign up for my Solr resources newsletter at http://www.solr-start.com/
 
 
  On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com wrote:
   I am new to SOLR and was able to configure, run samples as well as able
  to
   index data using DIH (from database).
  
   Just wondering if there are open source framework to query and
   display/visualize.
  
   Regards
 



Re: SOLR - any open source framework

2015-01-06 Thread Roman Chyla
Hi Vishal, Alexandre,

Here is another one, using Backbone, just released v1.0.16

https://github.com/adsabs/bumblebee

you can see it in action: http://ui.adslabs.org/

While it primarily serves our own needs, I tried to architect it to be
extendible (within reasonable limits of code, man power)

Roman

On Tue, Jan 6, 2015 at 4:58 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 That's very general question. So, the following are three random ideas
 just to get you started to think of options.

 *) spring.io (Spring Data Solr) + Vaadin
 *)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder
 too)
 *) http://projectblacklight.org/

 Regards,
Alex.
 
 Sign up for my Solr resources newsletter at http://www.solr-start.com/


 On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com wrote:
  I am new to SOLR and was able to configure, run samples as well as able
 to
  index data using DIH (from database).
 
  Just wondering if there are open source framework to query and
  display/visualize.
 
  Regards



Re: .htaccess / password

2015-01-06 Thread Craig Hoffman
Thanks Otis. Do think a .htaccess / .passwd file in the Solr admin dir would 
interfere with its operation?
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman













 On Jan 6, 2015, at 1:09 PM, Otis Gospodnetic otis.gospodne...@gmail.com 
 wrote:
 
 Hi Craig,
 
 If you want to protect Solr, put it behind something like Apache / Nginx /
 HAProxy and put .htaccess at that level, in front of Solr.
 Or try something like
 http://blog.jelastic.com/2013/06/17/secure-access-to-your-jetty-web-application/
 
 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/
 
 
 On Tue, Jan 6, 2015 at 1:28 PM, Craig Hoffman choff...@eclimb.net wrote:
 
 Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/
 will Solr continue to function properly? One thing to note, I will have a
 CRON job that runs nightly that re-indexes the engine. In a nutshell I’m
 looking for a way to secure this area.
 
 Thanks,
 Craig
 --
 Craig Hoffman
 w: http://www.craighoffmanphotography.com
 FB: www.facebook.com/CraigHoffmanPhotography
 TW: https://twitter.com/craiglhoffman
 
 
 
 
 
 
 
 
 
 
 
 
 
 



Re: .htaccess / password

2015-01-06 Thread Michael Della Bitta
The Jetty servlet container that Solr uses doesn't understand those 
files. It would not use them to determine access, and would likely make 
them accessible to web requests in plain text.


On 1/6/15 16:01, Craig Hoffman wrote:

Thanks Otis. Do think a .htaccess / .passwd file in the Solr admin dir would 
interfere with its operation?
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman














On Jan 6, 2015, at 1:09 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote:

Hi Craig,

If you want to protect Solr, put it behind something like Apache / Nginx /
HAProxy and put .htaccess at that level, in front of Solr.
Or try something like
http://blog.jelastic.com/2013/06/17/secure-access-to-your-jetty-web-application/

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 1:28 PM, Craig Hoffman choff...@eclimb.net wrote:


Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/
will Solr continue to function properly? One thing to note, I will have a
CRON job that runs nightly that re-indexes the engine. In a nutshell I’m
looking for a way to secure this area.

Thanks,
Craig
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman


















Re: solrcloud without faceting, i.e. for failover only

2015-01-06 Thread Chris Hostetter

: #1 is a trade off against being possibly more available to writes in the case
: of a single down node. In the cloud case, you're still open for business. In
: the classical replication case, you're no longer available for writes if the
: downed node is the master.

or to put it another way: classic replication lets you use N nodes for 
high availability reads, but you have a single point of failure for 
writes.

solr cloud gives you high availability for reads and writes -- including 
NRT support -- at the expense of more network overhead when writes happen.

:  -is solrcloud w/o sharding done?( I.e. it's just not done!! )
:  -any downside (i.e. aside from the lack of horizontal scalability )

it is certainly done -- specifically it is a matter of creating a 
collection with numShards=1 and replicationFactor=N.



-Hoss
http://www.lucidworks.com/


Re: SOLR - any open source framework

2015-01-06 Thread Erick Erickson
There's also the VelocityResponseWriter that comes with Solr. It takes
some effort to modify, but not a lot. It's useful for very fast iterations.

Best,
Erick

On Tue, Jan 6, 2015 at 1:58 PM, Alexandre Rafalovitch
arafa...@gmail.com wrote:
 That's very general question. So, the following are three random ideas
 just to get you started to think of options.

 *) spring.io (Spring Data Solr) + Vaadin
 *)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder too)
 *) http://projectblacklight.org/

 Regards,
Alex.
 
 Sign up for my Solr resources newsletter at http://www.solr-start.com/


 On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com wrote:
 I am new to SOLR and was able to configure, run samples as well as able to
 index data using DIH (from database).

 Just wondering if there are open source framework to query and
 display/visualize.

 Regards


Re: .htaccess / password

2015-01-06 Thread Otis Gospodnetic
Hi Craig,

If you want to protect Solr, put it behind something like Apache / Nginx /
HAProxy and put .htaccess at that level, in front of Solr.
Or try something like
http://blog.jelastic.com/2013/06/17/secure-access-to-your-jetty-web-application/

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 1:28 PM, Craig Hoffman choff...@eclimb.net wrote:

 Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/
 will Solr continue to function properly? One thing to note, I will have a
 CRON job that runs nightly that re-indexes the engine. In a nutshell I’m
 looking for a way to secure this area.

 Thanks,
 Craig
 --
 Craig Hoffman
 w: http://www.craighoffmanphotography.com
 FB: www.facebook.com/CraigHoffmanPhotography
 TW: https://twitter.com/craiglhoffman
















solrcloud without faceting, i.e. for failover only

2015-01-06 Thread Will Milspec
Hi all,

We have a smallish index that performs well for searches and are
considering using solrcloud --but just for high availability/redundancy,
i.e. without any sharding.

The indexes would be replicated, but not distributed.

I know that there are no stupid questions..Only stupid people...but here
goes:

-is solrcloud w/o sharding done?( I.e. it's just not done!! )
-any downside (i.e. aside from the lack of horizontal scalability )

will


Re: SOLR - any open source framework

2015-01-06 Thread Vishal Swaroop
Great... Thanks for the inputs... I explored Velocity respond writer some
posts suggest it is good for prototyping but not for production...
On Jan 6, 2015 4:59 PM, Alexandre Rafalovitch arafa...@gmail.com wrote:

 That's very general question. So, the following are three random ideas
 just to get you started to think of options.

 *) spring.io (Spring Data Solr) + Vaadin
 *)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder
 too)
 *) http://projectblacklight.org/

 Regards,
Alex.
 
 Sign up for my Solr resources newsletter at http://www.solr-start.com/


 On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com wrote:
  I am new to SOLR and was able to configure, run samples as well as able
 to
  index data using DIH (from database).
 
  Just wondering if there are open source framework to query and
  display/visualize.
 
  Regards



cloudsolrserver

2015-01-06 Thread tharpa
We are switching from a direct HTTP connection to use cloudsolrserver.  I
have looked and failed for an example of code for connecting to
cloudsolrserver.  Are there any tutorials or code examples?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/cloudsolrserver-tp4177724.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to limit the number of result sets of the 'export' handler

2015-01-06 Thread Sandy Ding
Thanks Alexandre.
I actually need the whole result set. But it is large(perhaps 10m-100m) and
I find select is slow.
How does export differ from select except that select will make distributed
requests and do the merge?
Will select with ‘distrib=false’ have comparable performance with export?


2015-01-06 20:55 GMT+08:00 Alexandre Rafalovitch arafa...@gmail.com:

 Export was specifically designed to get everything which is very
 expensive otherwise.

 If you just want the subset, you might be better off with normal
 queries and/or with deep paging (cursor).

 Regards,
Alex.
 
 Sign up for my Solr resources newsletter at http://www.solr-start.com/


 On 6 January 2015 at 00:30, Sandy Ding sandy.ding...@gmail.com wrote:
  Using rows=xxx doesn't seem to work.
  Is there a way to do this?



PDF search functionality using Solr

2015-01-06 Thread Ganesh.Yadav
Hello Solr-users and developers,
Can you please suggest,

1.   What I should do to index PDF content information column wise?

2.   Do I need to extract the contents using one of the Analyzer, Tokenize 
and Filter combination and then add it to Index? How can test the results on 
command prompt? I do not know the selection of specific Analyzer, Tokenizer and 
Filter for this purpose

3.   How can I verify that the needed column info is extracted out of PDF 
and is indexed?

4.   So for example How to verify Ticket number is extracted in 
Ticket_number tag and is indexed?

5.   Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? 
I think I saw some posts complaining on how large size that can be posted ?

6.   What will enable Solr to search in any PDF out of many, with different 
words such as Runtime Error  and result will provide the link to the 
PDF

My PDFs are nothing but Jira ticket system.
PDF has info on
Ticket Number:
Desc:
Client:
Status:
Submitter:
And so on:


1.   I imported PDF document in Solr and it does the necessary searching 
and I can test some of it using the browse client interface provided.

2.   I have 80 GB worth of PDFs.

3.   Total number of PDFs are about 200

4.   Many PDFs are of size 4 GB

5.   What do you suggest me to import such a large PDFs? What tools can you 
suggest to extract PDF contents first in some XML format and later Post that 
XML to be indexed by Solr.?







Your early response is much appreciated.



Thanks

G



htaccess

2015-01-06 Thread Craig Hoffman
Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/ will 
Solr continue to function properly? One thing to note, I will have a CRON job 
that runs nightly that re-indexes the engine. In a nutshell I’m looking for a 
way to secure this area. 

Thanks,
Craig
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman















Re: Solr on HDFS in a Hadoop cluster

2015-01-06 Thread Otis Gospodnetic
Hi Charles,

See http://search-lucene.com/?q=solr+hdfs and
https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE charles.val...@edf.fr
wrote:

 I am considering using *Solr* to extend *Hortonworks Data Platform*
 capabilities to search.

 - I found tutorials to index documents into a Solr instance from *HDFS*,
 but I guess this solution would require a Solr cluster distinct to the
 Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop
 cluster instead? - *With the index stored in HDFS?*

 - Where would the processing take place (could it be handed down to
 Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to
 integrate with *Yarn*?

 - What about *SolrCloud*: what does it bring regarding Hadoop based
 use-cases? Does it stand for a Solr-only cluster?

 - Well, if that could lead to something working with a roles-based
 authorization-compliant *Banana*, it would be Christmass again!

 Thanks a lot for any help!

 Charles



 Ce message et toutes les pièces jointes (ci-après le 'Message') sont
 établis à l'intention exclusive des destinataires et les informations qui y
 figurent sont strictement confidentielles. Toute utilisation de ce Message
 non conforme à sa destination, toute diffusion ou toute publication totale
 ou partielle, est interdite sauf autorisation expresse.

 Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de
 le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou
 partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de
 votre système, ainsi que toutes ses copies, et de n'en garder aucune trace
 sur quelque support que ce soit. Nous vous remercions également d'en
 avertir immédiatement l'expéditeur par retour du message.

 Il est impossible de garantir que les communications par messagerie
 électronique arrivent en temps utile, sont sécurisées ou dénuées de toute
 erreur ou virus.
 

 This message and any attachments (the 'Message') are intended solely for
 the addressees. The information contained in this Message is confidential.
 Any use of information contained in this Message not in accord with its
 purpose, any dissemination or disclosure, either whole or partial, is
 prohibited except formal approval.

 If you are not the addressee, you may not copy, forward, disclose or use
 any part of it. If you have received this message in error, please delete
 it and all copies from your system and notify the sender immediately by
 return message.

 E-mail communication cannot be guaranteed to be timely secure, error or
 virus-free.




Re: Running Multiple Solr Instances

2015-01-06 Thread Nishanth S
Thanks a lot guys.As a begineer these are very helpful fo rme.

Thanks,
Nishanth

On Tue, Jan 6, 2015 at 5:12 AM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 I would do one of either:

 1. Set a different Solr home for each instance. I'd use the
 -Dsolr.solr.home=/d/2 command line switch when launching Solr to do so.

 2. RAID 10 the drives. If you expect the Solr instances to get uneven
 traffic, pooling the drives will allow a given Solr instance to share the
 capacity of all of them.


 On 1/5/15 23:31, Nishanth S wrote:

 Hi folks,

 I  am running  multiple solr instances  (Solr 4.10.3 on tomcat 8).There
 are
 3 physical machines and  I have 4 solr instances running  on each machine
 on ports  8080,8081,8082 and 8083.The set up is well up to this point.Now
 I
 want to point each of these instance to a different  index directories.The
 drives in the machines are mounted as d/1,d/2,d/3 ,d/4 etc.Now if I define
 /d/1 as  the solr home all solr index directories  are created in /d/1
 where as the other drives remain un used.So how do I configure solr to
   make use of all the drives so that I can  get maximum storage for solr.I
 would really appreciate any help in this regard.

 Thanks,
 Nishanth





Solr on HDFS in a Hadoop cluster

2015-01-06 Thread Charles VALLEE
I am considering using Solr to extend Hortonworks Data Platform 
capabilities to search.

- I found tutorials to index documents into a Solr instance from HDFS, but 
I guess this solution would require a Solr cluster distinct to the Hadoop 
cluster. Is it possible to have a Solr integrated into the Hadoop cluster 
instead? - With the index stored in HDFS?
- Where would the processing take place (could it be handed down to 
Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to 
integrate with Yarn?
- What about SolrCloud: what does it bring regarding Hadoop based 
use-cases? Does it stand for a Solr-only cluster?
- Well, if that could lead to something working with a roles-based 
authorization-compliant Banana, it would be Christmass again!
Thanks a lot for any help!
Charles


Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.


Re: How large is your solr index?

2015-01-06 Thread Peter Sturge
Yes, totally agree. We run 500m+ docs in a (non-cloud) Solr4, and it even
performs reasonably well on commodity hardware with lots of faceting and
concurrent indexing! Ok, you need a lot of RAM to keep faceting happy, but
it works.

++1 for the automagic shard creator. We've been looking into doing this
sort of thing internally - i.e. when a shard reaches a certain size/num
docs, it creates 'sub-shards' to which new commits are sent and queries to
the 'parent' shard are included. The concept works, as long as you don't
try any non-dist stuff - it's one reason why all our fields are always
single valued. There are also other implications like cleanup, deletes and
security to take into account, to name a few.
A cool side-effect of sub-sharding (for lack of a snappy term) is that the
parent shard then stops suffering from auto-warming latency due to commits
(we do a fair amount of committing). In theory, you could carry on
sub-sharding until your hardware starts gasping for air.


On Sun, Jan 4, 2015 at 1:44 PM, Bram Van Dam bram.van...@intix.eu wrote:

 On 01/04/2015 02:22 AM, Jack Krupansky wrote:

 The reality doesn't seem to
 be there today. 50 to 100 million documents, yes, but beyond that takes
 some kind of heroic effort, whether a much beefier box, very careful and
 limited data modeling or limiting of query capabilities or tolerance of
 higher latency, expert tuning, etc.


 I disagree. On the scale, at least. Up until 500M Solr performs well
 (read: well enough considering the scale) in a single shard on a single box
 of commodity hardware. Without any tuning or heroic efforts. Sure, some
 queries aren't as snappy as you'd like, and sure, indexing and querying at
 the same time will be somewhat unpleasant, but it will work, and it will
 work well enough.

 Will it work for thousands of concurrent users? Of course not. Anyone who
 is after that sort of thing won't find themselves in this scenario -- they
 will throw hardware at the problem.

 There is something to be said for making sharding less painful. It would
 be nice if, for instance, Solr would automagically create a new shard once
 some magic number was reached (2B at the latest, I guess). But then that'll
 break some query features ... :-(

 The reason we're using single large instances (sometimes on beefy
 hardware) is that SolrCloud is a pain. Not just from an administrative
 point of view (though that seems to be getting better, kudos for that!),
 but mostly because some queries cannot be executed with distributed=true.
 Our users, at least, prefer a slow query over an impossible query.

 Actually, this 2B limit is a good thing. It'll help me convince
 $management to donate some of our time to Solr :-)

  - Bram



.htaccess / password

2015-01-06 Thread Craig Hoffman
Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/ will 
Solr continue to function properly? One thing to note, I will have a CRON job 
that runs nightly that re-indexes the engine. In a nutshell I’m looking for a 
way to secure this area. 

Thanks,
Craig
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman















Re: Solr on HDFS in a Hadoop cluster

2015-01-06 Thread Otis Gospodnetic
Oh, and https://issues.apache.org/jira/browse/SOLR-6743

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 12:52 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi Charles,

 See http://search-lucene.com/?q=solr+hdfs and
 https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS

 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/


 On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE charles.val...@edf.fr
 wrote:

 I am considering using *Solr* to extend *Hortonworks Data Platform*
 capabilities to search.

 - I found tutorials to index documents into a Solr instance from *HDFS*,
 but I guess this solution would require a Solr cluster distinct to the
 Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop
 cluster instead? - *With the index stored in HDFS?*

 - Where would the processing take place (could it be handed down to
 Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to
 integrate with *Yarn*?

 - What about *SolrCloud*: what does it bring regarding Hadoop based
 use-cases? Does it stand for a Solr-only cluster?

 - Well, if that could lead to something working with a roles-based
 authorization-compliant *Banana*, it would be Christmass again!

 Thanks a lot for any help!

 Charles



 Ce message et toutes les pièces jointes (ci-après le 'Message') sont
 établis à l'intention exclusive des destinataires et les informations qui y
 figurent sont strictement confidentielles. Toute utilisation de ce Message
 non conforme à sa destination, toute diffusion ou toute publication totale
 ou partielle, est interdite sauf autorisation expresse.

 Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de
 le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou
 partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de
 votre système, ainsi que toutes ses copies, et de n'en garder aucune trace
 sur quelque support que ce soit. Nous vous remercions également d'en
 avertir immédiatement l'expéditeur par retour du message.

 Il est impossible de garantir que les communications par messagerie
 électronique arrivent en temps utile, sont sécurisées ou dénuées de toute
 erreur ou virus.
 

 This message and any attachments (the 'Message') are intended solely for
 the addressees. The information contained in this Message is confidential.
 Any use of information contained in this Message not in accord with its
 purpose, any dissemination or disclosure, either whole or partial, is
 prohibited except formal approval.

 If you are not the addressee, you may not copy, forward, disclose or use
 any part of it. If you have received this message in error, please delete
 it and all copies from your system and notify the sender immediately by
 return message.

 E-mail communication cannot be guaranteed to be timely secure, error or
 virus-free.





RE: Running Multiple Solr Instances

2015-01-06 Thread Ganesh.Yadav
Nishanth,

1.   I understand you are implementing clustering for the web apps which is 
running the same application on multiple different instances on one or more 
machines.

2.   If each of your web apps start pointing to the different index 
directory, how it will switch to the next web App with different index if 
search term is not found in the first index directory?

3.   Or will the web app collect the result sequentially from all the Index 
directories and will present the resulting collection to the user?



Please share your thoughts



Thanks

G







-Original Message-
From: Nishanth S [mailto:nishanth.2...@gmail.com]
Sent: Tuesday, January 06, 2015 12:17 PM
To: solr-user@lucene.apache.org
Subject: Re: Running Multiple Solr Instances



Thanks a lot guys.As a begineer these are very helpful fo rme.



Thanks,

Nishanth



On Tue, Jan 6, 2015 at 5:12 AM, Michael Della Bitta  
michael.della.bi...@appinions.commailto:michael.della.bi...@appinions.com 
wrote:



 I would do one of either:



 1. Set a different Solr home for each instance. I'd use the

 -Dsolr.solr.home=/d/2 command line switch when launching Solr to do so.



 2. RAID 10 the drives. If you expect the Solr instances to get uneven

 traffic, pooling the drives will allow a given Solr instance to share

 the capacity of all of them.





 On 1/5/15 23:31, Nishanth S wrote:



 Hi folks,



 I  am running  multiple solr instances  (Solr 4.10.3 on tomcat

 8).There are

 3 physical machines and  I have 4 solr instances running  on each

 machine on ports  8080,8081,8082 and 8083.The set up is well up to

 this point.Now I want to point each of these instance to a different

 index directories.The drives in the machines are mounted as

 d/1,d/2,d/3 ,d/4 etc.Now if I define

 /d/1 as  the solr home all solr index directories  are created in

 /d/1 where as the other drives remain un used.So how do I configure solr to

   make use of all the drives so that I can  get maximum storage for

 solr.I would really appreciate any help in this regard.



 Thanks,

 Nishanth








Re: IstvanKulcsar - Wiki Solr

2015-01-06 Thread Shawn Heisey
On 1/6/2015 7:28 AM, ikulc...@precognox.com wrote:
 I would like suggest pages which use SOLR and developed my company.
 
 Please put this page this site:
 http://wiki.apache.org/solr/PublicServers
 
 http://www.odrportal.hu/kereso/
 http://idea.unideb.hu/idealista/
 http://www.jobmonitor.hu
 http://www.profession.hu/
 http://webicina.com/
 http://www.cylex.hu/

Create a user on the Solr wiki and let us know what your username is.
We will get your username added to the group that allows you to edit the
wiki.  Is IstvanKulcsar (first thing in the subject) your username on
the wiki?

Thanks,
Shawn



Re: PDF search functionality using Solr

2015-01-06 Thread Jürgen Wagner (DVT)
Hello,
  no matter which search platform you will use, this will pose two
challenges:

- The size of the documents will render search less and less useful as
the likelihood of matches increases with document size. So, without a
proper semantic extraction (e.g., using decent NER or relationship
extraction with a commercial text mining product), I doubt you will get
the required precision to make this overly usefiul.

- PDFs can have their own character sets based on the characters
actually used. Such file-specific character sets are almost impossible
to parse, i.e., if your PDFs happen to use this feature of the PDF
format, you won't be lucky getting any meaningful text out of them.

My suggestion is to use the Jira REST API to collect all necessary
documents and index the resulting XML or attachment formats. As the REST
API provides filtering capabilities, you could easily create incremental
feeds to avoid humongous indexing every time there's new information in
Jira. Dumping Jira stuff as PDF seems to me to be the least suitable way
of handling this.

Best regards,
--Jürgen


On 06.01.2015 18:30, ganesh.ya...@sungard.com wrote:
 Hello Solr-users and developers,
 Can you please suggest,

 1.   What I should do to index PDF content information column wise?

 2.   Do I need to extract the contents using one of the Analyzer, 
 Tokenize and Filter combination and then add it to Index? How can test the 
 results on command prompt? I do not know the selection of specific Analyzer, 
 Tokenizer and Filter for this purpose

 3.   How can I verify that the needed column info is extracted out of PDF 
 and is indexed?

 4.   So for example How to verify Ticket number is extracted in 
 Ticket_number tag and is indexed?

 5.   Is it ok to post 4 GB worth of PDF to be imported and indexed by 
 Solr? I think I saw some posts complaining on how large size that can be 
 posted ?

 6.   What will enable Solr to search in any PDF out of many, with 
 different words such as Runtime Error  and result will provide the 
 link to the PDF

 My PDFs are nothing but Jira ticket system.
 PDF has info on
 Ticket Number:
 Desc:
 Client:
 Status:
 Submitter:
 And so on:


 1.   I imported PDF document in Solr and it does the necessary searching 
 and I can test some of it using the browse client interface provided.

 2.   I have 80 GB worth of PDFs.

 3.   Total number of PDFs are about 200

 4.   Many PDFs are of size 4 GB

 5.   What do you suggest me to import such a large PDFs? What tools can 
 you suggest to extract PDF contents first in some XML format and later Post 
 that XML to be indexed by Solr.?







 Your early response is much appreciated.



 Thanks

 G




-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center Intelligence
 Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
mailto:juergen.wag...@devoteam.com, URL: www.devoteam.de
http://www.devoteam.de/


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




RE: PDF search functionality using Solr Schema.xml and SolrConfig.xml question

2015-01-06 Thread Ganesh.Yadav
Thanks Jürgen  for your quick reply.

Still looking for answer on Schema.xml and SolrConfig.xml


1.   Do I need to tell Solr, to extract Title from PDF, go look for Title 
word and extract entire line after the Tag and collect all such occurrence’s 
from hundreds of PDFs and build the Title column data and index it?


2.   How to define my own schema to Solr

3.   Say I defined my fields Title, Ticket_number, Submitter, client and so 
on, How can I verify respective data is extracted in specific columns in Solr 
and indexed? Any suggestion on how the Analyzer, Tokenizer and Filter and which 
one will help for this purpose?


1.   I do not want to dump entire 4 GB PDF contents in one searchable field 
(ATTR_CONTENT) in Solr

2.   Even if entire PDF contents is extracted in above field as a default, 
I still want to extract specific searchable column data in their respective 
fields

3.   Rather I want to configure Solr to have column wise searchable 
contents such as Title, number, and so on

Any suggestions on performance? PDF database is 80 GB, will it be fast enough? 
Do I Need to divide in multiple cores and on multiple machines ? and on 
multiple web apps? And clustering?


I should have mentioned my PDFs are from Ticketing system like Jira which is 
already retired way back from production and all I have is the Ticketing system 
PDF database.


4.   My system will be used internally just by the selected number of very 
few people.

5.   They can wait 4 GB PDF to get loaded.

6.   I agree there will be many matches will be found in one large PDF, 
based on search criteria

7.   To make searches faster I want Solr to create more columns and column 
based indexes

8.   Solr underneath uses Tika which is extracting contents and getting rid 
of all the rich content formatting characters present in the PDF document.

9.   I believe resulting extraction size is 1/5th of the original PDF 
..just a random guess based on one sample extraction




From: Jürgen Wagner (DVT) [mailto:juergen.wag...@devoteam.com]
Sent: Tuesday, January 06, 2015 11:56 AM
To: solr-user@lucene.apache.org
Subject: Re: PDF search functionality using Solr

Hello,
  no matter which search platform you will use, this will pose two challenges:

- The size of the documents will render search less and less useful as the 
likelihood of matches increases with document size. So, without a proper 
semantic extraction (e.g., using decent NER or relationship extraction with a 
commercial text mining product), I doubt you will get the required precision to 
make this overly usefiul.

- PDFs can have their own character sets based on the characters actually used. 
Such file-specific character sets are almost impossible to parse, i.e., if your 
PDFs happen to use this feature of the PDF format, you won't be lucky getting 
any meaningful text out of them.

My suggestion is to use the Jira REST API to collect all necessary documents 
and index the resulting XML or attachment formats. As the REST API provides 
filtering capabilities, you could easily create incremental feeds to avoid 
humongous indexing every time there's new information in Jira. Dumping Jira 
stuff as PDF seems to me to be the least suitable way of handling this.

Best regards,
--Jürgen


On 06.01.2015 18:30, ganesh.ya...@sungard.commailto:ganesh.ya...@sungard.com 
wrote:

Hello Solr-users and developers,

Can you please suggest,



1.   What I should do to index PDF content information column wise?



2.   Do I need to extract the contents using one of the Analyzer, Tokenize 
and Filter combination and then add it to Index? How can test the results on 
command prompt? I do not know the selection of specific Analyzer, Tokenizer and 
Filter for this purpose



3.   How can I verify that the needed column info is extracted out of PDF 
and is indexed?



4.   So for example How to verify Ticket number is extracted in 
Ticket_number tag and is indexed?



5.   Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? 
I think I saw some posts complaining on how large size that can be posted ?



6.   What will enable Solr to search in any PDF out of many, with different 
words such as Runtime Error  and result will provide the link to the 
PDF



My PDFs are nothing but Jira ticket system.

PDF has info on

Ticket Number:

Desc:

Client:

Status:

Submitter:

And so on:





1.   I imported PDF document in Solr and it does the necessary searching 
and I can test some of it using the browse client interface provided.



2.   I have 80 GB worth of PDFs.



3.   Total number of PDFs are about 200



4.   Many PDFs are of size 4 GB



5.   What do you suggest me to import such a large PDFs? What tools can you 
suggest to extract PDF contents first in some XML format and later Post that 
XML to be indexed by Solr.?















Your early response is much appreciated.







Thanks