RE: Korean script conversion

2015-04-14 Thread Eyal Naamati
Trying again since I don't have an answer yet.
Thanks!

Eyal Naamati
Alma Developer
Tel: +972-2-6499313
Mobile: +972-547915255
eyal.naam...@exlibrisgroup.com
[Description: Description: Description: Description: C://signature/exlibris.jpg]
www.exlibrisgroup.com

From: Eyal Naamati
Sent: Sunday, March 29, 2015 7:52 AM
To: solr-user@lucene.apache.org
Subject: Korean script conversion

Hi,

We are starting to index records in Korean. Korean text can be written in two 
scripts: Han characters (Chinese) and Hangul characters (Korean).
We are looking for some solr filter or another built in solr component that 
converts between Han and Hangul characters (transliteration).
I know there is the ICUTransformFilterFactory that can convert between Japanese 
or chinese scripts, for example:
 for 
Japanese script conversions
So far I couldn't find anything readymade for Korean scripts, but perhaps 
someone knows of one?

Thanks!
Eyal Naamati
Alma Developer
Tel: +972-2-6499313
Mobile: +972-547915255
eyal.naam...@exlibrisgroup.com
[Description: Description: Description: Description: C://signature/exlibris.jpg]
www.exlibrisgroup.com



Re: Indexing PDF and MS Office files

2015-04-14 Thread Shyam R
Vijay,

You could try different excel files with different formats to rule out the
issue is with TIKA version being used.

Thanks
Murthy

On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes  wrote:

> Perhaps the PDF is protected and the content can not be extracted?
>
> i have an unverified suspicion that the tika shipped with solr 4.10.2 may
> not support some/all office 2013 document formats.
>
>
>
>
>
> On 4/14/2015 8:18 PM, Jack Krupansky wrote:
>
>> Try doing a manual extraction request directly to Solr (not via SolrJ) and
>> use the extractOnly option to see if the content is actually extracted.
>>
>> See:
>> https://cwiki.apache.org/confluence/display/solr/
>> Uploading+Data+with+Solr+Cell+using+Apache+Tika
>>
>> Also, some PDF files actually have the content as a bitmap image, so no
>> text is extracted.
>>
>>
>> -- Jack Krupansky
>>
>> On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy <
>> vijaya.bhoomire...@whishworks.com> wrote:
>>
>>  Hi,
>>>
>>> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
>>> .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
>>> Request to please let me know what is going wrong with the indexing
>>> process.
>>>
>>> I am using solr 4.10.2 and using the default example server configuration
>>> that comes with Solr distribution.
>>>
>>> PDF Files - Indexing as such works fine, but when I query using *.* in
>>> the
>>> Solr Query console, metadata information is displayed properly. However,
>>> the PDF content field is empty. This is happening for all PDF files I
>>> have
>>> tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
>>> be the PDF file, content is not being displayed.
>>>
>>> MS Office files -  For some office files, everything works perfect and
>>> the
>>> extracted content is visible in the query console. However, for others, I
>>> see the below error message during the indexing process.
>>>
>>> *Exception in thread "main"
>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>>> from
>>> org.apache.tika.parser.microsoft.OfficeParser*
>>>
>>>
>>> I am using SolrJ to index the documents and below is the code snippet
>>> related to indexing. Please let me know where the issue is occurring.
>>>
>>>  static String solrServerURL = "
>>> http://localhost:8983/solr";;
>>> static SolrServer solrServer = new HttpSolrServer(solrServerURL);
>>>  static ContentStreamUpdateRequest indexingReq =
>>> new
>>>
>>>  ContentStreamUpdateRequest("/update/extract");
>>>
>>>  indexingReq.addFile(file, fileType);
>>> indexingReq.setParam("literal.id", literalId);
>>> indexingReq.setParam("uprefix", "attr_");
>>> indexingReq.setParam("fmap.content", "content");
>>> indexingReq.setParam("literal.fileurl", fileURL);
>>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>>> solrServer.request(indexingReq);
>>>
>>> Thanks & Regards
>>> Vijay
>>>
>>> --
>>> The contents of this e-mail are confidential and for the exclusive use of
>>> the intended recipient. If you receive this e-mail in error please delete
>>> it from your system immediately and notify us either by e-mail or
>>> telephone. You should not copy, forward or otherwise disclose the content
>>> of the e-mail. The views expressed in this communication may not
>>> necessarily be the view held by WHISHWORKS.
>>>
>>>
>


-- 
Ph: 9845704792


Re: Indexing PDF and MS Office files

2015-04-14 Thread Terry Rhodes

Perhaps the PDF is protected and the content can not be extracted?

i have an unverified suspicion that the tika shipped with solr 4.10.2 
may not support some/all office 2013 document formats.





On 4/14/2015 8:18 PM, Jack Krupansky wrote:

Try doing a manual extraction request directly to Solr (not via SolrJ) and
use the extractOnly option to see if the content is actually extracted.

See:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Also, some PDF files actually have the content as a bitmap image, so no
text is extracted.


-- Jack Krupansky

On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy <
vijaya.bhoomire...@whishworks.com> wrote:


Hi,

I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
.pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server configuration
that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.* in the
Solr Query console, metadata information is displayed properly. However,
the PDF content field is empty. This is happening for all PDF files I have
tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect and the
extracted content is visible in the query console. However, for others, I
see the below error message during the indexing process.

*Exception in thread "main"
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser*


I am using SolrJ to index the documents and below is the code snippet
related to indexing. Please let me know where the issue is occurring.

 static String solrServerURL = "
http://localhost:8983/solr";;
static SolrServer solrServer = new HttpSolrServer(solrServerURL);
 static ContentStreamUpdateRequest indexingReq = new

 ContentStreamUpdateRequest("/update/extract");

 indexingReq.addFile(file, fileType);
indexingReq.setParam("literal.id", literalId);
indexingReq.setParam("uprefix", "attr_");
indexingReq.setParam("fmap.content", "content");
indexingReq.setParam("literal.fileurl", fileURL);
indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solrServer.request(indexingReq);

Thanks & Regards
Vijay

--
The contents of this e-mail are confidential and for the exclusive use of
the intended recipient. If you receive this e-mail in error please delete
it from your system immediately and notify us either by e-mail or
telephone. You should not copy, forward or otherwise disclose the content
of the e-mail. The views expressed in this communication may not
necessarily be the view held by WHISHWORKS.





Re: Indexing PDF and MS Office files

2015-04-14 Thread Jack Krupansky
Try doing a manual extraction request directly to Solr (not via SolrJ) and
use the extractOnly option to see if the content is actually extracted.

See:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Also, some PDF files actually have the content as a bitmap image, so no
text is extracted.


-- Jack Krupansky

On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy <
vijaya.bhoomire...@whishworks.com> wrote:

> Hi,
>
> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
> .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
> Request to please let me know what is going wrong with the indexing
> process.
>
> I am using solr 4.10.2 and using the default example server configuration
> that comes with Solr distribution.
>
> PDF Files - Indexing as such works fine, but when I query using *.* in the
> Solr Query console, metadata information is displayed properly. However,
> the PDF content field is empty. This is happening for all PDF files I have
> tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
> be the PDF file, content is not being displayed.
>
> MS Office files -  For some office files, everything works perfect and the
> extracted content is visible in the query console. However, for others, I
> see the below error message during the indexing process.
>
> *Exception in thread "main"
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser*
>
>
> I am using SolrJ to index the documents and below is the code snippet
> related to indexing. Please let me know where the issue is occurring.
>
> static String solrServerURL = "
> http://localhost:8983/solr";;
> static SolrServer solrServer = new HttpSolrServer(solrServerURL);
> static ContentStreamUpdateRequest indexingReq = new
>
> ContentStreamUpdateRequest("/update/extract");
>
> indexingReq.addFile(file, fileType);
> indexingReq.setParam("literal.id", literalId);
> indexingReq.setParam("uprefix", "attr_");
> indexingReq.setParam("fmap.content", "content");
> indexingReq.setParam("literal.fileurl", fileURL);
> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
> solrServer.request(indexingReq);
>
> Thanks & Regards
> Vijay
>
> --
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.
>


Re: [ANNOUNCE] Apache Solr 5.1.0 released

2015-04-14 Thread Anshum Gupta
Hi Joe,

This should help you:
http://lucene.apache.org/solr/5_1_0/changes/Changes.html#v5.1.0.upgrading_from_solr_5.0

On Tue, Apr 14, 2015 at 12:47 PM, Joseph Obernberger <
j...@lovehorsepower.com> wrote:

> Great news!
> Any tips on how to do an upgrade from 5.0.0 to 5.1.0?
> Thank you!
>
> -Joe
>
>
> On 4/14/2015 2:39 PM, Timothy Potter wrote:
>
>> I apologize - Yonik prepared these nice release notes for 5.1 and I
>> neglected to include them:
>>
>> Solr 5.1 Release Highlights:
>>
>>   * The new Facet Module, including the JSON Facet API.
>> This module is currently marked as experimental to allow for
>> further API feedback and improvements.
>>
>>   * A new JSON request API.
>> This feature is currently marked as experimental to allow for
>> further API feedback and improvements.
>>
>>   * The ability to upload and download Solr configurations via SolrJ
>> (CloudSolrClient).
>>
>>   * First-class support for Real-Time Get in SolrJ.
>>
>>   * Spatial 2D heat-map faceting.
>>
>>   * EnumField now has docValues support.
>>
>>   * API to dynamically add Jars to Solr's classpath for plugins.
>>
>>   * Ability to enable/disable individual stats in the StatsComponent.
>>
>>   * lucene/solr query syntax to give any query clause a constant score.
>>
>>   * Schema API enhancements to remove or replace fields, dynamic
>> fields, field types and copy fields.
>>
>>   * When posting XML or JSON to Solr with curl, there is no need to
>> specify the content type.
>>
>>   * A list of update processors to be used for an update can be
>> specified dynamically for any given request.
>>
>>   * StatsComponent now supports Percentiles.
>>
>>   * facet.contains option to limit which constraints are returned.
>>
>>   * Streaming Aggregation for SolrCloud.
>>
>>   * The admin UI now visualizes Lucene segment information.
>>
>>   * Parameter substitution / macro expansion across entire request
>>
>>
>> On Tue, Apr 14, 2015 at 11:42 AM, Timothy Potter 
>> wrote:
>>
>>> 14 April 2015 - The Lucene PMC is pleased to announce the release of
>>> Apache Solr 5.1.0.
>>>
>>> Solr 5.1.0 is available for immediate download at:
>>> http://www.apache.org/dyn/closer.cgi/lucene/solr/5.1.0
>>>
>>> Solr 5.1.0 includes 39 new features, 40 bug fixes, and 36 optimizations
>>> / other changes from over 60 unique contributors.
>>>
>>> For detailed information about what is included in 5.1.0 release,
>>> please see: http://lucene.apache.org/solr/5_1_0/changes/Changes.html
>>>
>>> Enjoy!
>>>
>>
>


-- 
Anshum Gupta


JSON Facet & Analytics API in Solr 5.1

2015-04-14 Thread Yonik Seeley
Folks, there's a new JSON Facet API in the just released Solr 5.1
(actually, a new facet module under the covers too).

It's marked as experimental so we have time to change the API based on
your feedback.  So let us know what you like, what you would change,
what's missing, or any other ideas you may have!

I've just started the documentation for the reference guide (on our
confluence wiki), so for now the best doc is on my blog:

http://yonik.com/json-facet-api/
http://yonik.com/solr-facet-functions/
http://yonik.com/solr-subfacets/

I'll also be hanging out more on the #solr-dev IRC channel on freenode
if you want to hit me up there about any development ideas.

-Yonik


Re: proper routing (from non-Java client) in solr cloud 5.0.0

2015-04-14 Thread Ian Rose
Hi Hrishikesh,

Thanks for the pointers - I had not looked at SOLR-5474
 previously.  Interesting
approach...  I think we will stick with trying to keep zk watches open from
all clients to all collections for now, but if that starts to be a
bottleneck its good to know how the route that Solrj has chosen...

cheers,
Ian



On Tue, Apr 14, 2015 at 3:56 PM, Hrishikesh Gadre 
wrote:

> Hi Ian,
>
> As per my understanding, Solrj does not use Zookeeper watches but instead
> caches the information (along with a TTL). You can find more information
> here,
>
> https://issues.apache.org/jira/browse/SOLR-5473
> https://issues.apache.org/jira/browse/SOLR-5474
>
> Regards
> Hrishikesh
>
>
> On Tue, Apr 14, 2015 at 8:49 AM, Ian Rose  wrote:
>
> > Hi all -
> >
> > I've just upgraded my dev install of Solr (cloud) from 4.10 to 5.0.  Our
> > client is written in Go, for which I am not aware of a client, so we
> wrote
> > our own.  One tricky bit for this was the routing logic; if a document
> has
> > routing prefix X and belong to collection Y, we need to know which solr
> > node to connect to.  Previously we accomplished this by watching the
> > clusterstate.json
> > file in zookeeper - at startup and whenever it changes, the client parses
> > the file contents to build a routing table.
> >
> > However in 5.0 newly create collections do not show up in
> clusterstate.json
> > but instead have their own state.json document.  Are there any
> > recommendations for how to handle this from the client?  The obvious
> answer
> > is to watch every collection's state.json document, but we run a lot of
> > collections (~1000 currently, and growing) so I'm concerned about keeping
> > that many watches open at the same time (should I be?).  How does the
> SolrJ
> > client handle this?
> >
> > Thanks!
> > - Ian
> >
>


Re: proper routing (from non-Java client) in solr cloud 5.0.0

2015-04-14 Thread Hrishikesh Gadre
Hi Ian,

As per my understanding, Solrj does not use Zookeeper watches but instead
caches the information (along with a TTL). You can find more information
here,

https://issues.apache.org/jira/browse/SOLR-5473
https://issues.apache.org/jira/browse/SOLR-5474

Regards
Hrishikesh


On Tue, Apr 14, 2015 at 8:49 AM, Ian Rose  wrote:

> Hi all -
>
> I've just upgraded my dev install of Solr (cloud) from 4.10 to 5.0.  Our
> client is written in Go, for which I am not aware of a client, so we wrote
> our own.  One tricky bit for this was the routing logic; if a document has
> routing prefix X and belong to collection Y, we need to know which solr
> node to connect to.  Previously we accomplished this by watching the
> clusterstate.json
> file in zookeeper - at startup and whenever it changes, the client parses
> the file contents to build a routing table.
>
> However in 5.0 newly create collections do not show up in clusterstate.json
> but instead have their own state.json document.  Are there any
> recommendations for how to handle this from the client?  The obvious answer
> is to watch every collection's state.json document, but we run a lot of
> collections (~1000 currently, and growing) so I'm concerned about keeping
> that many watches open at the same time (should I be?).  How does the SolrJ
> client handle this?
>
> Thanks!
> - Ian
>


Re: [ANNOUNCE] Apache Solr 5.1.0 released

2015-04-14 Thread Joseph Obernberger

Great news!
Any tips on how to do an upgrade from 5.0.0 to 5.1.0?
Thank you!

-Joe

On 4/14/2015 2:39 PM, Timothy Potter wrote:

I apologize - Yonik prepared these nice release notes for 5.1 and I
neglected to include them:

Solr 5.1 Release Highlights:

  * The new Facet Module, including the JSON Facet API.
This module is currently marked as experimental to allow for
further API feedback and improvements.

  * A new JSON request API.
This feature is currently marked as experimental to allow for
further API feedback and improvements.

  * The ability to upload and download Solr configurations via SolrJ
(CloudSolrClient).

  * First-class support for Real-Time Get in SolrJ.

  * Spatial 2D heat-map faceting.

  * EnumField now has docValues support.

  * API to dynamically add Jars to Solr's classpath for plugins.

  * Ability to enable/disable individual stats in the StatsComponent.

  * lucene/solr query syntax to give any query clause a constant score.

  * Schema API enhancements to remove or replace fields, dynamic
fields, field types and copy fields.

  * When posting XML or JSON to Solr with curl, there is no need to
specify the content type.

  * A list of update processors to be used for an update can be
specified dynamically for any given request.

  * StatsComponent now supports Percentiles.

  * facet.contains option to limit which constraints are returned.

  * Streaming Aggregation for SolrCloud.

  * The admin UI now visualizes Lucene segment information.

  * Parameter substitution / macro expansion across entire request


On Tue, Apr 14, 2015 at 11:42 AM, Timothy Potter  wrote:

14 April 2015 - The Lucene PMC is pleased to announce the release of
Apache Solr 5.1.0.

Solr 5.1.0 is available for immediate download at:
http://www.apache.org/dyn/closer.cgi/lucene/solr/5.1.0

Solr 5.1.0 includes 39 new features, 40 bug fixes, and 36 optimizations
/ other changes from over 60 unique contributors.

For detailed information about what is included in 5.1.0 release,
please see: http://lucene.apache.org/solr/5_1_0/changes/Changes.html

Enjoy!




Re: Disable or limit the size of Lucene field cache

2015-04-14 Thread pras.venkatesh
Thank you.. This really helps. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Disable-or-limit-the-size-of-Lucene-field-cache-tp4198798p4199646.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: sort by a copy field error

2015-04-14 Thread Andrea Gazzarini
Hi Pedro
Please post the request that produces that error

Andrea
On 14 Apr 2015 19:33, "Pedro Figueiredo" 
wrote:

> Hello,
>
>
>
> I have a pretty basic question:  how can I sort by a copyfield?
>
>
>
> My schema conf is:
>
>
>
>  stored="true" omitNorms="true" termVectors="true"/>
>
> 
>
> 
>
>
>
> And when I try to sort by "name_sort" the following error is raised:
>
> "error": {
>
> "msg": "sort param field can't be found: name_sort",
>
> "code": 400
>
>   }
>
>
>
> Thanks in advanced,
>
>
>
> Pedro Figueiredo
>
>
>
>


Re: sort by a copy field error

2015-04-14 Thread Shawn Heisey
On 4/14/2015 11:32 AM, Pedro Figueiredo wrote:
> And when I try to sort by "name_sort" the following error is raised: 
>
> "error": {
>
> "msg": "sort param field can't be found: name_sort",
>
> "code": 400
>
>   }

What was the exact sort parameter you sent to Solr?

Did you reload the core or restart Solr and then reindex after you
changed your schema?  A reindex will be required.

http://wiki.apache.org/solr/HowToReindex

Thanks,
Shawn



Re: [ANNOUNCE] Apache Solr 5.1.0 released

2015-04-14 Thread Timothy Potter
I apologize - Yonik prepared these nice release notes for 5.1 and I
neglected to include them:

Solr 5.1 Release Highlights:

 * The new Facet Module, including the JSON Facet API.
   This module is currently marked as experimental to allow for
further API feedback and improvements.

 * A new JSON request API.
   This feature is currently marked as experimental to allow for
further API feedback and improvements.

 * The ability to upload and download Solr configurations via SolrJ
(CloudSolrClient).

 * First-class support for Real-Time Get in SolrJ.

 * Spatial 2D heat-map faceting.

 * EnumField now has docValues support.

 * API to dynamically add Jars to Solr's classpath for plugins.

 * Ability to enable/disable individual stats in the StatsComponent.

 * lucene/solr query syntax to give any query clause a constant score.

 * Schema API enhancements to remove or replace fields, dynamic
fields, field types and copy fields.

 * When posting XML or JSON to Solr with curl, there is no need to
specify the content type.

 * A list of update processors to be used for an update can be
specified dynamically for any given request.

 * StatsComponent now supports Percentiles.

 * facet.contains option to limit which constraints are returned.

 * Streaming Aggregation for SolrCloud.

 * The admin UI now visualizes Lucene segment information.

 * Parameter substitution / macro expansion across entire request


On Tue, Apr 14, 2015 at 11:42 AM, Timothy Potter  wrote:
> 14 April 2015 - The Lucene PMC is pleased to announce the release of
> Apache Solr 5.1.0.
>
> Solr 5.1.0 is available for immediate download at:
> http://www.apache.org/dyn/closer.cgi/lucene/solr/5.1.0
>
> Solr 5.1.0 includes 39 new features, 40 bug fixes, and 36 optimizations
> / other changes from over 60 unique contributors.
>
> For detailed information about what is included in 5.1.0 release,
> please see: http://lucene.apache.org/solr/5_1_0/changes/Changes.html
>
> Enjoy!


Errors during Indexing in SOLR 4.6

2015-04-14 Thread Gopal Agarwal
I think you should look into:
https://issues.apache.org/jira/browse/LUCENE-5923

Any more details get in touch.


[ANNOUNCE] Apache Solr 5.1.0 released

2015-04-14 Thread Timothy Potter
14 April 2015 - The Lucene PMC is pleased to announce the release of
Apache Solr 5.1.0.

Solr 5.1.0 is available for immediate download at:
http://www.apache.org/dyn/closer.cgi/lucene/solr/5.1.0

Solr 5.1.0 includes 39 new features, 40 bug fixes, and 36 optimizations
/ other changes from over 60 unique contributors.

For detailed information about what is included in 5.1.0 release,
please see: http://lucene.apache.org/solr/5_1_0/changes/Changes.html

Enjoy!


sort by a copy field error

2015-04-14 Thread Pedro Figueiredo
Hello,

 

I have a pretty basic question:  how can I sort by a copyfield?

 

My schema conf is:

 





  

 

And when I try to sort by "name_sort" the following error is raised: 

"error": {

"msg": "sort param field can't be found: name_sort",

"code": 400

  }

 

Thanks in advanced,

 

Pedro Figueiredo

 



Re: Indexing PDF and MS Office files

2015-04-14 Thread Andrea Gazzarini

It seems something like https://issues.apache.org/jira/browse/TIKA-1251.
I see you're using Solr 4.10.2 which uses Tika 1.5 and that issue seems 
to be fixed in Tika 1.6.


I agree with Erik: you should try with another version of Tika.

Best,
Andrea

On 04/14/2015 06:44 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Andrea,

Yes, I am using the stock schema.xml that comes with the example server of
Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and
put into the content field in the index.

Please find the log information for the Parsing error below.


org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
... 32 more
Caused by: java.lang.IllegalArgumentException: This paragraph is not the
first one in the table
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932)
at
org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188)
at
org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 35 more

ERROR - 2015-04-14 14:51:21.151; org.apache.solr.common.SolrException;
null:org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at
org.apache.solr.ha

Re: Indexing PDF and MS Office files

2015-04-14 Thread Erick Erickson
looks like this is just a file that Tika can't handle, based on this line:

bq: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser

You might be able to get some joy from parsing this from Java and see
if a more recent Tika would fix it. Here's some  sample code:

http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Tue, Apr 14, 2015 at 9:44 AM, Vijaya Narayana Reddy Bhoomi Reddy
 wrote:
> Andrea,
>
> Yes, I am using the stock schema.xml that comes with the example server of
> Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and
> put into the content field in the index.
>
> Please find the log information for the Parsing error below.
>
>
> org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@138b0c5
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:368)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
> at
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
> at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Unknown Source)
> Caused by: org.apache.tika.exception.TikaException: Unexpected
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> ... 32 more
> Caused by: java.lang.IllegalArgumentException: This paragraph is not the
> first one in the table
> at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932)
> at
> org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188)
> at
> org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172)
> at
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
> at org.apache.

Re: Indexing PDF and MS Office files

2015-04-14 Thread Vijaya Narayana Reddy Bhoomi Reddy
Andrea,

Yes, I am using the stock schema.xml that comes with the example server of
Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and
put into the content field in the index.

Please find the log information for the Parsing error below.


org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
... 32 more
Caused by: java.lang.IllegalArgumentException: This paragraph is not the
first one in the table
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932)
at
org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188)
at
org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 35 more

ERROR - 2015-04-14 14:51:21.151; org.apache.solr.common.SolrException;
null:org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyR

Re: Indexing PDF and MS Office files

2015-04-14 Thread Andrea Gazzarini

Hi,
solrconfig.xml (especially if you didn't touch it) should be good. What 
about the schema? Are you using the one that comes with the download 
bundle, too?


I don't see the stacktrace..did you forget to paste it?

Best,
Andrea

On 04/14/2015 06:06 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Hi,

Here are the solr-config xml and the error log from Solr logs for your 
reference. As mentioned earlier, I didnt make any changes to the 
solr-config.xml as I am using the xml file out of the box one that 
came with the default installation.


Please let me know your thoughts on why these issues are occuring.

Thanks & Regards
Vijay




*Vijay Bhoomireddy*, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW

*T:+44 20 3475 7980*
*M:**+44 7481 298 360*
*W: *ww w.whishworks.com 






On 14 April 2015 at 15:57, Vijaya Narayana Reddy Bhoomi Reddy 
> wrote:


Hi,

I am trying to index PDF and Microsoft Office files (.doc, .docx,
.ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the
following issues. Request to please let me know what is going
wrong with the indexing process.

I am using solr 4.10.2 and using the default example server
configuration that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using
*.* in the Solr Query console, metadata information is displayed
properly. However, the PDF content field is empty. This is
happening for all PDF files I have tried. I have tried with some
proprietary files, PDF eBooks etc. Whatever be the PDF file,
content is not being displayed.

MS Office files -  For some office files, everything works perfect
and the extracted content is visible in the query console.
However, for others, I see the below error message during the
indexing process.

*Exception in thread "main"
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser*
*
*

I am using SolrJ to index the documents and below is the code
snippet related to indexing. Please let me know where the issue is
occurring.

static String solrServerURL =
"http://localhost:8983/solr";;
static SolrServer solrServer = new HttpSolrServer(solrServerURL);
static ContentStreamUpdateRequest
indexingReq = new ContentStreamUpdateRequest("/update/extract");

indexingReq.addFile(file, fileType);
indexingReq.setParam("literal.id ", literalId);
indexingReq.setParam("uprefix", "attr_");
indexingReq.setParam("fmap.content", "content");
indexingReq.setParam("literal.fileurl", fileURL);
indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true,
true);
solrServer.request(indexingReq);

Thanks & Regards
Vijay




The contents of this e-mail are confidential and for the exclusive use 
of the intended recipient. If you receive this e-mail in error please 
delete it from your system immediately and notify us either by e-mail 
or telephone. You should not copy, forward or otherwise disclose the 
content of the e-mail. The views expressed in this communication may 
not necessarily be the view held by WHISHWORKS. 




Re: Indexing PDF and MS Office files

2015-04-14 Thread Vijaya Narayana Reddy Bhoomi Reddy
Hi,

Here are the solr-config xml and the error log from Solr logs for your
reference. As mentioned earlier, I didnt make any changes to the
solr-config.xml as I am using the xml file out of the box one that came
with the default installation.

Please let me know your thoughts on why these issues are occuring.

Thanks & Regards
Vijay


*Vijay Bhoomireddy*, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW
*T:  +44 20 3475 7980*
*M: **+44 7481 298 360*
*W: *ww w.whishworks.com



  


On 14 April 2015 at 15:57, Vijaya Narayana Reddy Bhoomi Reddy <
vijaya.bhoomire...@whishworks.com> wrote:

> Hi,
>
> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
> .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
> Request to please let me know what is going wrong with the indexing
> process.
>
> I am using solr 4.10.2 and using the default example server configuration
> that comes with Solr distribution.
>
> PDF Files - Indexing as such works fine, but when I query using *.* in the
> Solr Query console, metadata information is displayed properly. However,
> the PDF content field is empty. This is happening for all PDF files I have
> tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
> be the PDF file, content is not being displayed.
>
> MS Office files -  For some office files, everything works perfect and the
> extracted content is visible in the query console. However, for others, I
> see the below error message during the indexing process.
>
> *Exception in thread "main"
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser*
>
>
> I am using SolrJ to index the documents and below is the code snippet
> related to indexing. Please let me know where the issue is occurring.
>
> static String solrServerURL = "
> http://localhost:8983/solr";;
> static SolrServer solrServer = new HttpSolrServer(solrServerURL);
> static ContentStreamUpdateRequest indexingReq =
> new
> ContentStreamUpdateRequest("/update/extract");
>
> indexingReq.addFile(file, fileType);
> indexingReq.setParam("literal.id", literalId);
> indexingReq.setParam("uprefix", "attr_");
> indexingReq.setParam("fmap.content", "content");
> indexingReq.setParam("literal.fileurl", fileURL);
> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
> solrServer.request(indexingReq);
>
> Thanks & Regards
> Vijay
>
>
>

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.





  

  
  4.10.2

  

  
  
  

  
  

  
  

  
  

  
  
  
  
  ${solr.data.dir:}


  
  

 
  
 
${solr.hdfs.home:}

${solr.hdfs.confdir:}

${solr.hdfs.blockcache.enabled:true}

${solr.hdfs.blockcache.global:true}

   

  
  

  
  

  
  
















   







${solr.lock.type:native}












  
  
  
  
  
  



 true


 false
  


  
  
  
  
  
  

  
  

 

  ${solr.ulog.dir:}

 

  
   ${solr.autoCommit.maxTime:15000} 
   false 
 



  
   ${solr.autoSoftCommit.maxTime:-1} 
 






  
  
  
  
  
  

  
  

1024









   



 










true

   
   

   
   20

   
   200

   


  

  


  

  static firstSearcher warming in solrconfig.xml

  



false


2

  


  
  
 








  

  
  
  

 
   explicit
   10
   text
 









  
  
 
   explicit
   json
   true
   text
 
  


  
  
 
   true
   json
   true
 
  

  

  

  {!xport}
  xsort
  false



  query

  






  
  
 
   explicit

   
   velocity
   browse
   layout
   Solritas

   
   edismax
   
  text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
  title^10.0 description

proper routing (from non-Java client) in solr cloud 5.0.0

2015-04-14 Thread Ian Rose
Hi all -

I've just upgraded my dev install of Solr (cloud) from 4.10 to 5.0.  Our
client is written in Go, for which I am not aware of a client, so we wrote
our own.  One tricky bit for this was the routing logic; if a document has
routing prefix X and belong to collection Y, we need to know which solr
node to connect to.  Previously we accomplished this by watching the
clusterstate.json
file in zookeeper - at startup and whenever it changes, the client parses
the file contents to build a routing table.

However in 5.0 newly create collections do not show up in clusterstate.json
but instead have their own state.json document.  Are there any
recommendations for how to handle this from the client?  The obvious answer
is to watch every collection's state.json document, but we run a lot of
collections (~1000 currently, and growing) so I'm concerned about keeping
that many watches open at the same time (should I be?).  How does the SolrJ
client handle this?

Thanks!
- Ian


Re: Problem related to filter on Zero value for DateField

2015-04-14 Thread Jack Krupansky
What does your main query look like? Normally we don't speak of "searching"
with the fq parameter - it filters the results, but the actual searching is
done via the main query with the q parameter.

-- Jack Krupansky

On Tue, Apr 14, 2015 at 4:17 AM, Ali Nazemian  wrote:

> Dears,
> Hi,
> I have strange problem with Solr 4.10.x. My problem is when I do searching
> on solr Zero date which is "0002-11-30T00:00:00Z" if more than one filter
> be considered, the results became invalid. For example consider this
> scenario:
> When I search for a document with fq=p_date:"0002-11-30T00:00:00Z" Solr
> returns three different documents which is right for my Collection. All of
> these three documents have same value of "7" for document status. Now If I
> search for fq=document_status:7 the same three documents returns which is
> also a correct response. But When I do the searching on
> fq=focument_status:7&fq=p_date:"0002-11-30T00:00:00Z", Solr returns
> nothing! (0 document) While I have not such problem with other date values
> beside Solr Zero ("0002-11-30T00:00:00Z"). Please let me know it is a bug
> related to Solr or I did something wrong?
> Best regards.
>
> --
> A.Nazemian
>


Re: Indexing PDF and MS Office files

2015-04-14 Thread Andrea Gazzarini

Hi Vijay,
Please paste an extract of your schema, where the "content" field (the 
field where the PDF text shoudl be) and its type are declared.

For the other issue, please paste the whole stacktrace because

org.apache.tika.parser.microsoft.OfficeParser*

says nothing. The complete stacktrace (or at least another three / four 
lines) should contain some other detail.


Best,
Andrea

On 04/14/2015 04:57 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Hi,

I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
.pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server configuration
that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.* in the
Solr Query console, metadata information is displayed properly. However,
the PDF content field is empty. This is happening for all PDF files I have
tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect and the
extracted content is visible in the query console. However, for others, I
see the below error message during the indexing process.

*Exception in thread "main"
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser*


I am using SolrJ to index the documents and below is the code snippet
related to indexing. Please let me know where the issue is occurring.

 static String solrServerURL = "
http://localhost:8983/solr";;
static SolrServer solrServer = new HttpSolrServer(solrServerURL);
 static ContentStreamUpdateRequest indexingReq = new

 ContentStreamUpdateRequest("/update/extract");

 indexingReq.addFile(file, fileType);
indexingReq.setParam("literal.id", literalId);
indexingReq.setParam("uprefix", "attr_");
indexingReq.setParam("fmap.content", "content");
indexingReq.setParam("literal.fileurl", fileURL);
indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solrServer.request(indexingReq);

Thanks & Regards
Vijay





Re: using DirectSpellChecker and FileBasedSpellChecker with Solr 4.10.1

2015-04-14 Thread elisabeth benoit
Thanks for your answer!

I didn't realize this what not supposed to be done (conjunction of
DirectSolrSpellChecker and FileBasedSpellChecker). I got this idea in the
mailing list while searching for a solution to get a list of words to
ignore for the DirectSolrSpellChecker.

Well well well, I'll try removing the check and see what happens. I'm not a
java programmer, but if I can find a simple solution I'll let you know.

Thanks again,
Elisabeth

2015-04-14 16:29 GMT+02:00 Dyer, James :

> Elisabeth,
>
> Currently ConjunctionSolrSpellChecker only supports adding
> WordBreakSolrSpellchecker to IndexBased- FileBased- or
> DirectSolrSpellChecker.  In the future, it would be great if it could
> handle other Spell Checker combinations.  For instance, if you had a
> (e)dismax query that searches multiple fields, to have a separate
> spellchecker for each of them.
>
> But CSSC is not hardened for this more general usage, as hinted in the API
> doc.  The check done to ensure all spellcheckers use the same
> stringdistance object, I believe, is a safeguard against using this class
> for functionality it is not able to correctly support.  It looks to me that
> SOLR-6271 was opened to fix the bug in that it is comparing references on
> the stringdistance.  This is not a problem with WBSSC because this one does
> not support string distance at all.
>
> What you're hoping for, however, is that the requirement for the string
> distances be the same to be removed entirely.  You could try modifying the
> code by removing the check.  However beware that you might not get the
> results you desire!  But should this happen, please, go ahead and fix it
> for your use case and then donate the code.  This is something I've
> personally wanted for a long time.
>
> James Dyer
> Ingram Content Group
>
>
> -Original Message-
> From: elisabeth benoit [mailto:elisaelisael...@gmail.com]
> Sent: Tuesday, April 14, 2015 7:37 AM
> To: solr-user@lucene.apache.org
> Subject: using DirectSpellChecker and FileBasedSpellChecker with Solr
> 4.10.1
>
> Hello,
>
> I am using Solr 4.10.1 and trying to use DirectSolrSpellChecker and
> FileBasedSpellchecker in same request.
>
> I've applied change from patch 135.patch (cf Solr-6271). I've tried running
> the command "patch -p1 -i 135.patch --dry-run" but it didn't work, maybe
> because the patch was a fix to Solr 4.9, so I just replaced line in
> ConjunctionSolrSpellChecker
>
> else if (!stringDistance.equals(checker.getStringDistance())) {
>  throw new IllegalArgumentException(
>  "All checkers need to use the same StringDistance.");
>}
>
>
> by
>
> else if (!stringDistance.equals(checker.getStringDistance())) {
> throw new IllegalArgumentException(
> "All checkers need to use the same StringDistance!!! 1:" +
> checker.getStringDistance() + " 2: " + stringDistance);
>   }
>
> as it was done in the patch
>
> but still, when I send a spellcheck request, I get the error
>
> msg": "All checkers need to use the same StringDistance!!!
> 1:org.apache.lucene.search.spell.LuceneLevenshteinDistance@15f57db32:
> org.apache.lucene.search.spell.LuceneLevenshteinDistance@280f7e08"
>
> From error message I gather both spellchecker use same distanceMeasure
> LuceneLevenshteinDistance, but they're not same instance of
> LuceneLevenshteinDistance.
>
> Is the condition all right? What should be done to fix this properly?
>
> Thanks,
> Elisabeth
>


Indexing PDF and MS Office files

2015-04-14 Thread Vijaya Narayana Reddy Bhoomi Reddy
Hi,

I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
.pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server configuration
that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.* in the
Solr Query console, metadata information is displayed properly. However,
the PDF content field is empty. This is happening for all PDF files I have
tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect and the
extracted content is visible in the query console. However, for others, I
see the below error message during the indexing process.

*Exception in thread "main"
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser*


I am using SolrJ to index the documents and below is the code snippet
related to indexing. Please let me know where the issue is occurring.

static String solrServerURL = "
http://localhost:8983/solr";;
static SolrServer solrServer = new HttpSolrServer(solrServerURL);
static ContentStreamUpdateRequest indexingReq = new

ContentStreamUpdateRequest("/update/extract");

indexingReq.addFile(file, fileType);
indexingReq.setParam("literal.id", literalId);
indexingReq.setParam("uprefix", "attr_");
indexingReq.setParam("fmap.content", "content");
indexingReq.setParam("literal.fileurl", fileURL);
indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solrServer.request(indexingReq);

Thanks & Regards
Vijay

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.


RE: Securing solr index

2015-04-14 Thread Davis, Daniel (NIH/NLM) [C]
That's a good point - if he's talking about securing the Solr filesystem, he 
can use standard mechanisms.

You can also go beyond user/group/other permissions if your filesystem supports 
it.   You can use Posix ACLs on many local linux filesystems.

-Original Message-
From: Per Steffensen [mailto:st...@designware.dk] 
Sent: Tuesday, April 14, 2015 8:04 AM
To: solr-user@lucene.apache.org
Subject: Re: Securing solr index

Hi

I might misunderstand you, but if you are talking about securing the actual 
files/folders of the index, I do not think this is a Solr/Lucene concern. Use 
standard mechanisms of your OS. E.g. on linux/unix use chown, chgrp, chmod, 
sudo, apparmor etc - e.g. allowing only root to write the folders/files and 
sudo the user running Solr/Lucene to operate as root in this area. Even admins 
should not (normally) operate as root
- that way they cannot write the files either. No one knows the root-password - 
except maybe for the super-super-admin, or you split the root-password in two 
and two admins know a part each, so that they have to both agree in order to 
operate as root. Be creative yourself.

Regards, Per Steffensen

On 13/04/15 12:13, Suresh Vanasekaran wrote:
> Hi,
>
> We are having the solr index maintained in a central server and multiple 
> users might be able to access the index data.
>
> May I know what are best practice for securing the solr index folder where 
> ideally only application user should be able to access. Even an admin user 
> should not be able to copy the data and use it in another schema.
>
> Thanks
>
>
>
>  CAUTION - Disclaimer * This e-mail 
> contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for 
> the use of the addressee(s). If you are not the intended recipient, 
> please notify the sender by e-mail and delete the original message. 
> Further, you are not to copy, disclose, or distribute this e-mail or 
> its contents to any other person and any such actions are unlawful. 
> This e-mail may contain viruses. Infosys has taken every reasonable 
> precaution to minimize this risk, but is not liable for any damage you 
> may sustain as a result of any virus in this e-mail. You should carry 
> out your own virus checks before opening the e-mail or attachment. 
> Infosys reserves the right to monitor and review the content of all 
> messages sent to or from this e-mail address. Messages sent to or from this 
> e-mail address may be stored on the Infosys e-mail system.
> ***INFOSYS End of Disclaimer INFOSYS***
>



RE: using DirectSpellChecker and FileBasedSpellChecker with Solr 4.10.1

2015-04-14 Thread Dyer, James
Elisabeth,

Currently ConjunctionSolrSpellChecker only supports adding 
WordBreakSolrSpellchecker to IndexBased- FileBased- or DirectSolrSpellChecker.  
In the future, it would be great if it could handle other Spell Checker 
combinations.  For instance, if you had a (e)dismax query that searches 
multiple fields, to have a separate spellchecker for each of them.

But CSSC is not hardened for this more general usage, as hinted in the API doc. 
 The check done to ensure all spellcheckers use the same stringdistance object, 
I believe, is a safeguard against using this class for functionality it is not 
able to correctly support.  It looks to me that SOLR-6271 was opened to fix the 
bug in that it is comparing references on the stringdistance.  This is not a 
problem with WBSSC because this one does not support string distance at all.

What you're hoping for, however, is that the requirement for the string 
distances be the same to be removed entirely.  You could try modifying the code 
by removing the check.  However beware that you might not get the results you 
desire!  But should this happen, please, go ahead and fix it for your use case 
and then donate the code.  This is something I've personally wanted for a long 
time.

James Dyer
Ingram Content Group


-Original Message-
From: elisabeth benoit [mailto:elisaelisael...@gmail.com] 
Sent: Tuesday, April 14, 2015 7:37 AM
To: solr-user@lucene.apache.org
Subject: using DirectSpellChecker and FileBasedSpellChecker with Solr 4.10.1

Hello,

I am using Solr 4.10.1 and trying to use DirectSolrSpellChecker and
FileBasedSpellchecker in same request.

I've applied change from patch 135.patch (cf Solr-6271). I've tried running
the command "patch -p1 -i 135.patch --dry-run" but it didn't work, maybe
because the patch was a fix to Solr 4.9, so I just replaced line in
ConjunctionSolrSpellChecker

else if (!stringDistance.equals(checker.getStringDistance())) {
 throw new IllegalArgumentException(
 "All checkers need to use the same StringDistance.");
   }


by

else if (!stringDistance.equals(checker.getStringDistance())) {
throw new IllegalArgumentException(
"All checkers need to use the same StringDistance!!! 1:" +
checker.getStringDistance() + " 2: " + stringDistance);
  }

as it was done in the patch

but still, when I send a spellcheck request, I get the error

msg": "All checkers need to use the same StringDistance!!!
1:org.apache.lucene.search.spell.LuceneLevenshteinDistance@15f57db32:
org.apache.lucene.search.spell.LuceneLevenshteinDistance@280f7e08"

From error message I gather both spellchecker use same distanceMeasure
LuceneLevenshteinDistance, but they're not same instance of
LuceneLevenshteinDistance.

Is the condition all right? What should be done to fix this properly?

Thanks,
Elisabeth


Re: Java.net.socketexception: broken pipe Solr 4.10.2

2015-04-14 Thread jaime spicciati
We ran into this during our indexing process running on 4.10.3. After
increasing zookeeper timeouts, client timeouts, socket timeouts,
implementing retry logic on our loading process the thing that worked was
to change the Hard Commit timing. We were performing a Hard Commit every 5
minutes and after a couple hours of loading data some of the shards would
start going down because they would timeout with zookeeper and/or close
connections. Changing the timeouts just moved the problem later in the
ingest process.

Through a combination of decreasing the hard commit timing to 15 seconds,
and migrating to G1 garbage collect, we are able to prevent ingest
failures. For us the periodic stop the world garbage collects were causing
connections to be closed and other nasty things such as zookeeper timeouts
that would cause recovery to kick in. (Soft commits are turned off until
the full ingest/baseline completes). I believe until a Hard Commit is
issued Solr keeps the data in memory which explains why we were
experiencing nasty garbage collects.

The other change we made which may have helped is that we ensured the
socket timeouts were in sync between the jetty instance running Solr and
the SolrJ loading the data. During some of our batch updates Solr would
take a couple minutes to respond back which I believe in some instances the
socket server side would be closed (maxIdleTime setting in Jetty).

Hope this helps,
Jaime Spicciati

Thanks
Jaime


On Tue, Apr 14, 2015 at 9:26 AM, vsilgalis  wrote:

> Right now index size is about 10GB on each shard (yes I could use more
> RAM),
> but I'm looking more for a step up then step down approach.  I will try
> adding more RAM to these machines as my next step.
>
> 1. Zookeeper is external to these boxes in a three node cluster with more
> than enough RAM to keep everything off disk.
>
> 2. os disk cache, when I add more RAM I will just add it as RAM for the
> machine and not to the Java Heap unless that is something you recommend.
>
> 3. java heap looks good so far, GC is minimal as far as i can tell but I
> can
> look into this some more.
>
> 4. we do have 2 cores per machine, but the second core is a joke (10MB)
>
> note: zkClientTimeout is set to 30 for safety's sake.
>
> java settings:
>
> -XX:+CMSClassUnloadingEnabled-XX:+AggressiveOpts-XX:+ParallelRefProcEnabled-XX:+CMSParallelRemarkEnabled-XX:CMSMaxAbortablePrecleanTime=6000-XX:CMSTriggerPermRatio=80-XX:CMSInitiatingOccupancyFraction=50-XX:+UseCMSInitiatingOccupancyOnly-XX:CMSFullGCsBeforeCompaction=1-XX:PretenureSizeThreshold=64m-XX:+CMSScavengeBeforeRemark-XX:ParallelGCThreads=4-XX:ConcGCThreads=4-XX:+UseConcMarkSweepGC-XX:+UseParNewGC-XX:MaxTenuringThreshold=8-XX:TargetSurvivorRatio=90-XX:SurvivorRatio=4-XX:NewRatio=3-XX:-UseSuperWord-Xmx5588m-Xms1596m
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Java-net-socketexception-broken-pipe-Solr-4-10-2-tp4199484p4199561.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Java.net.socketexception: broken pipe Solr 4.10.2

2015-04-14 Thread vsilgalis
Right now index size is about 10GB on each shard (yes I could use more RAM),
but I'm looking more for a step up then step down approach.  I will try
adding more RAM to these machines as my next step.

1. Zookeeper is external to these boxes in a three node cluster with more
than enough RAM to keep everything off disk.

2. os disk cache, when I add more RAM I will just add it as RAM for the
machine and not to the Java Heap unless that is something you recommend.

3. java heap looks good so far, GC is minimal as far as i can tell but I can
look into this some more.

4. we do have 2 cores per machine, but the second core is a joke (10MB)

note: zkClientTimeout is set to 30 for safety's sake.

java settings:
-XX:+CMSClassUnloadingEnabled-XX:+AggressiveOpts-XX:+ParallelRefProcEnabled-XX:+CMSParallelRemarkEnabled-XX:CMSMaxAbortablePrecleanTime=6000-XX:CMSTriggerPermRatio=80-XX:CMSInitiatingOccupancyFraction=50-XX:+UseCMSInitiatingOccupancyOnly-XX:CMSFullGCsBeforeCompaction=1-XX:PretenureSizeThreshold=64m-XX:+CMSScavengeBeforeRemark-XX:ParallelGCThreads=4-XX:ConcGCThreads=4-XX:+UseConcMarkSweepGC-XX:+UseParNewGC-XX:MaxTenuringThreshold=8-XX:TargetSurvivorRatio=90-XX:SurvivorRatio=4-XX:NewRatio=3-XX:-UseSuperWord-Xmx5588m-Xms1596m



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Java-net-socketexception-broken-pipe-Solr-4-10-2-tp4199484p4199561.html
Sent from the Solr - User mailing list archive at Nabble.com.


using DirectSpellChecker and FileBasedSpellChecker with Solr 4.10.1

2015-04-14 Thread elisabeth benoit
Hello,

I am using Solr 4.10.1 and trying to use DirectSolrSpellChecker and
FileBasedSpellchecker in same request.

I've applied change from patch 135.patch (cf Solr-6271). I've tried running
the command "patch -p1 -i 135.patch --dry-run" but it didn't work, maybe
because the patch was a fix to Solr 4.9, so I just replaced line in
ConjunctionSolrSpellChecker

else if (!stringDistance.equals(checker.getStringDistance())) {
 throw new IllegalArgumentException(
 "All checkers need to use the same StringDistance.");
   }


by

else if (!stringDistance.equals(checker.getStringDistance())) {
throw new IllegalArgumentException(
"All checkers need to use the same StringDistance!!! 1:" +
checker.getStringDistance() + " 2: " + stringDistance);
  }

as it was done in the patch

but still, when I send a spellcheck request, I get the error

msg": "All checkers need to use the same StringDistance!!!
1:org.apache.lucene.search.spell.LuceneLevenshteinDistance@15f57db32:
org.apache.lucene.search.spell.LuceneLevenshteinDistance@280f7e08"

>From error message I gather both spellchecker use same distanceMeasure
LuceneLevenshteinDistance, but they're not same instance of
LuceneLevenshteinDistance.

Is the condition all right? What should be done to fix this properly?

Thanks,
Elisabeth


Re: Securing solr index

2015-04-14 Thread Per Steffensen

Hi

I might misunderstand you, but if you are talking about securing the 
actual files/folders of the index, I do not think this is a Solr/Lucene 
concern. Use standard mechanisms of your OS. E.g. on linux/unix use 
chown, chgrp, chmod, sudo, apparmor etc - e.g. allowing only root to 
write the folders/files and sudo the user running Solr/Lucene to operate 
as root in this area. Even admins should not (normally) operate as root 
- that way they cannot write the files either. No one knows the 
root-password - except maybe for the super-super-admin, or you split the 
root-password in two and two admins know a part each, so that they have 
to both agree in order to operate as root. Be creative yourself.


Regards, Per Steffensen

On 13/04/15 12:13, Suresh Vanasekaran wrote:

Hi,

We are having the solr index maintained in a central server and multiple users 
might be able to access the index data.

May I know what are best practice for securing the solr index folder where 
ideally only application user should be able to access. Even an admin user 
should not be able to copy the data and use it in another schema.

Thanks



 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or its contents to any other 
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken
every reasonable precaution to minimize this risk, but is not liable for any 
damage
you may sustain as a result of any virus in this e-mail. You should carry out 
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this 
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***





facet on external field

2015-04-14 Thread jainam vora
Hi,

I am using external field for price field since it changes frequently.
generate facets using external field? how?

I understand that faceting requires indexing and external fields fields are
not actually indexed.


-- 
Thanks & Regards,
Jainam Vora


Errors during Indexing in SOLR 4.6

2015-04-14 Thread abhi Abhishek
Hi All,
 we recently migrated from SOLR 3.6 to SOLR 4, while indexing in SOLR 4
we are getting below exception.

Apr 1, 2015 9:22:57 AM org.apache.solr.common.SolrException log

SEVERE: null:org.apache.solr.common.SolrException: Exception writing
document id 932684555 to the index; possible analysis error.

at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)

at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)

at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)

Caused by: java.lang.IllegalArgumentException: first position increment
must be > 0 (got 0) for field 'DataEnglish'

at
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:131)



this works perfectly fine in SOLR 3.6. can someone help in debugging this.
any fixes/solutions?


Thanks in Advance.


Best Regards,

Abhishek


Re: Java.net.socketexception: broken pipe Solr 4.10.2

2015-04-14 Thread Shawn Heisey
On 4/13/2015 10:11 PM, vsilgalis wrote:
> just a couple of notes:
> this a 2 shard setup with 2 nodes per shard.
> 
> Currently these are on VMs with 8 cores and 8GB of ram each (java max heap
> is ~5588mb but we usually never even get that high) backed by a NFS file
> store which we store the indexes on (netapp SAN with nfs exports on SAS
> disk).

Broken pipe errors usually indicate that the client gave up waiting for
the server and disconnected the TCP connection before the server
completed processing and sent a response.  This is frequently because of
configured timeouts on the client.  If reasonable timeouts are being
exceeded, it's usually a performance problem.

You haven't indicated how much disk space is occupied by the index data
on each of these servers.  There are also several other things that
would be helpful to know.

Please read this wiki page, then come back with any questions you might
have, and I may also ask a question or two:

http://wiki.apache.org/solr/SolrPerformanceProblems

My immediate suspects are an OS disk cache that is too small, and/or
problems with garbage collection pauses.  These are two of the issues
discussed on that wiki page.

Thanks,
Shawn



Problem related to filter on Zero value for DateField

2015-04-14 Thread Ali Nazemian
Dears,
Hi,
I have strange problem with Solr 4.10.x. My problem is when I do searching
on solr Zero date which is "0002-11-30T00:00:00Z" if more than one filter
be considered, the results became invalid. For example consider this
scenario:
When I search for a document with fq=p_date:"0002-11-30T00:00:00Z" Solr
returns three different documents which is right for my Collection. All of
these three documents have same value of "7" for document status. Now If I
search for fq=document_status:7 the same three documents returns which is
also a correct response. But When I do the searching on
fq=focument_status:7&fq=p_date:"0002-11-30T00:00:00Z", Solr returns
nothing! (0 document) While I have not such problem with other date values
beside Solr Zero ("0002-11-30T00:00:00Z"). Please let me know it is a bug
related to Solr or I did something wrong?
Best regards.

-- 
A.Nazemian