subject:"solr indexing"

Re: Re:Interpreting Solr indexing times

2021-01-13 Thread Alessandro Benedetti

I agree, documents may be gigantic or very small,  with heavy text analysis
or simple strings ...
so it's not possible to give an evaluation here.
But you could make use of the nightly benchmark to give you an idea of
Lucene indexing speed (the engine inside Apache Solr) :

http://home.apache.org/~mikemccand/lucenebench/indexing.html

Not sure we have something similar for Apache Solr officially.
https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceData -> this
should be a bit outdated

Cheers



-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re:Interpreting Solr indexing times

2021-01-10 Thread xiefengchang

it's hard to answer your question without your solrconfig.xml, 
managed-schema(or schema.xml), and good to have some log snippet as well~

















At 2021-01-07 21:28:00, "ufuk yılmaz"  wrote:
>Hello all,
>
>I have been looking at our SolrCloud indexing performance statistics and 
>trying to make sense of the numbers. We are using a custom Flume sink and 
>sending updates to Solr (8.4) using SolrJ.
>
>I know these stuff depend on a lot of things but can you tell me if these 
>statistics are horribly bad (which means something is going obviously wrong), 
>or something expectable from a Solr cluster under right circumstances?
>
>We are sending documents in batches of 1000.
>
>{
>  "UPDATE./update.distrib.requestTimes": {
>"count": 7579,
>"meanRate": 0.044953336300254124,
>"1minRate": 0.2855655259375961,
>"5minRate": 0.29214637836736357,
>"15minRate": 0.29510868125823914,
>"min_ms": 5.854106,
>"max_ms": 56854.784017,
>"mean_ms": 3100.877968690649,
>"median_ms": 1084.258683,
>"stddev_ms": 4643.097311691323,
>"p75_ms": 2407.196867,
>"p95_ms": 15509.748909,
>"p99_ms": 16206.134345,
>"p999_ms": 16206.134345
>  },
>  "UPDATE./update.local.totalTime": 0,
>  "UPDATE./update.requestTimes": {
>"count": 7579,
>"meanRate": 0.044953336230621366,
>"1minRate": 0.2855655259375961,
>"5minRate": 0.29214637836736357,
>"15minRate": 0.29510868125823914,
>"min_ms": 5.857796,
>"max_ms": 56854.792298,
>"mean_ms": 3100.885675292589,
>"median_ms": 1084.264825,
>"stddev_ms": 4643.097457508117,
>"p75_ms": 2407.201642,
>"p95_ms": 15509.755934,
>"p99_ms": 16206.141754,
>"p999_ms": 16206.141754
>  },
>  "UPDATE./update.requests": 7580,
>  "UPDATE./update.totalTime": 33520426747162,
>  "UPDATE.update.totalTime": 0,
>  "UPDATE.updateHandler.adds": 854,
>  "UPDATE.updateHandler.autoCommitMaxTime": "15000ms",
>  "UPDATE.updateHandler.autoCommits": 2428,
>  "UPDATE.updateHandler.softAutoCommitMaxTime":"1ms",
>  "UPDATE.updateHandler.softAutoCommits":3380,
>  "UPDATE.updateHandler.commits": {
>"count": 5777,
>"meanRate": 0.034265134931240636,
>"1minRate": 0.13653886429826526,
>"5minRate": 0.12997330621941325,
>"15minRate": 0.12634106125326003
>  },
>  "UPDATE.updateHandler.cumulativeAdds": {
>"count": 2578492,
>"meanRate": 15.293816240408821,
>"1minRate": 90.7054223213904,
>"5minRate": 99.48315440730897,
>"15minRate": 101.77967003607128
>  },
>}
>
>
>Sent from Mail for Windows 10
>

Interpreting Solr indexing times

2021-01-07 Thread ufuk yılmaz

Hello all,

I have been looking at our SolrCloud indexing performance statistics and trying 
to make sense of the numbers. We are using a custom Flume sink and sending 
updates to Solr (8.4) using SolrJ.

I know these stuff depend on a lot of things but can you tell me if these 
statistics are horribly bad (which means something is going obviously wrong), 
or something expectable from a Solr cluster under right circumstances?

We are sending documents in batches of 1000.

{
  "UPDATE./update.distrib.requestTimes": {
"count": 7579,
"meanRate": 0.044953336300254124,
"1minRate": 0.2855655259375961,
"5minRate": 0.29214637836736357,
"15minRate": 0.29510868125823914,
"min_ms": 5.854106,
"max_ms": 56854.784017,
"mean_ms": 3100.877968690649,
"median_ms": 1084.258683,
"stddev_ms": 4643.097311691323,
"p75_ms": 2407.196867,
"p95_ms": 15509.748909,
"p99_ms": 16206.134345,
"p999_ms": 16206.134345
  },
  "UPDATE./update.local.totalTime": 0,
  "UPDATE./update.requestTimes": {
"count": 7579,
"meanRate": 0.044953336230621366,
"1minRate": 0.2855655259375961,
"5minRate": 0.29214637836736357,
"15minRate": 0.29510868125823914,
"min_ms": 5.857796,
"max_ms": 56854.792298,
"mean_ms": 3100.885675292589,
"median_ms": 1084.264825,
"stddev_ms": 4643.097457508117,
"p75_ms": 2407.201642,
"p95_ms": 15509.755934,
"p99_ms": 16206.141754,
"p999_ms": 16206.141754
  },
  "UPDATE./update.requests": 7580,
  "UPDATE./update.totalTime": 33520426747162,
  "UPDATE.update.totalTime": 0,
  "UPDATE.updateHandler.adds": 854,
  "UPDATE.updateHandler.autoCommitMaxTime": "15000ms",
  "UPDATE.updateHandler.autoCommits": 2428,
  "UPDATE.updateHandler.softAutoCommitMaxTime":"1ms",
  "UPDATE.updateHandler.softAutoCommits":3380,
  "UPDATE.updateHandler.commits": {
"count": 5777,
"meanRate": 0.034265134931240636,
"1minRate": 0.13653886429826526,
"5minRate": 0.12997330621941325,
"15minRate": 0.12634106125326003
  },
  "UPDATE.updateHandler.cumulativeAdds": {
"count": 2578492,
"meanRate": 15.293816240408821,
"1minRate": 90.7054223213904,
"5minRate": 99.48315440730897,
"15minRate": 101.77967003607128
  },
}


Sent from Mail for Windows 10

Re: SOLR indexing takes longer time

2020-08-18 Thread Walter Underwood

Instead of writing code, I’d fire up SQL Workbench/J, load the same JDBC driver
that is being used in Solr, and run the query.

https://www.sql-workbench.eu 

If that takes 3.5 hours, you have isolated the problem.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 18, 2020, at 6:50 AM, David Hastings  
> wrote:
> 
> Another thing to mention is to make sure the indexer you build doesnt send
> commits until its actually done.  Made that mistake with some early in
> house indexers.
> 
> On Tue, Aug 18, 2020 at 9:38 AM Charlie Hull  wrote:
> 
>> 1. You could write some code to pull the items out of Mongo and dump
>> them to disk - if this is still slow, then it's Mongo that's the problem.
>> 2. Write a standalone indexer to replace DIH, it's single threaded and
>> deprecated anyway.
>> 3. Minor point - consider whether you need to index everything every
>> time or just the deltas.
>> 4. Upgrade Solr anyway, not for speed reasons but because that's a very
>> old version you're running.
>> 
>> HTH
>> 
>> Charlie
>> 
>> On 17/08/2020 19:22, Abhijit Pawar wrote:
>>> Hello,
>>> 
>>> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
>>> replicas and just single core.
>>> It takes almost 3.5 hours to index that data.
>>> I am using a data import handler to import data from the mongo database.
>>> 
>>> Is there something we can do to reduce the time taken to index?
>>> Will upgrade to newer version help?
>>> 
>>> Appreciate your help!
>>> 
>>> Regards,
>>> Abhijit
>>> 
>> 
>> --
>> Charlie Hull
>> OpenSource Connections, previously Flax
>> 
>> tel/fax: +44 (0)8700 118334
>> mobile:  +44 (0)7767 825828
>> web: www.o19s.com
>> 
>>

Re: SOLR indexing takes longer time

2020-08-18 Thread David Hastings

Another thing to mention is to make sure the indexer you build doesnt send
commits until its actually done.  Made that mistake with some early in
house indexers.

On Tue, Aug 18, 2020 at 9:38 AM Charlie Hull  wrote:

> 1. You could write some code to pull the items out of Mongo and dump
> them to disk - if this is still slow, then it's Mongo that's the problem.
> 2. Write a standalone indexer to replace DIH, it's single threaded and
> deprecated anyway.
> 3. Minor point - consider whether you need to index everything every
> time or just the deltas.
> 4. Upgrade Solr anyway, not for speed reasons but because that's a very
> old version you're running.
>
> HTH
>
> Charlie
>
> On 17/08/2020 19:22, Abhijit Pawar wrote:
> > Hello,
> >
> > We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> > replicas and just single core.
> > It takes almost 3.5 hours to index that data.
> > I am using a data import handler to import data from the mongo database.
> >
> > Is there something we can do to reduce the time taken to index?
> > Will upgrade to newer version help?
> >
> > Appreciate your help!
> >
> > Regards,
> > Abhijit
> >
>
> --
> Charlie Hull
> OpenSource Connections, previously Flax
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.o19s.com
>
>

Re: SOLR indexing takes longer time

2020-08-18 Thread Charlie Hull

1. You could write some code to pull the items out of Mongo and dump 
them to disk - if this is still slow, then it's Mongo that's the problem.
2. Write a standalone indexer to replace DIH, it's single threaded and 
deprecated anyway.
3. Minor point - consider whether you need to index everything every 
time or just the deltas.
4. Upgrade Solr anyway, not for speed reasons but because that's a very 
old version you're running.


HTH

Charlie

On 17/08/2020 19:22, Abhijit Pawar wrote:

Hello,

We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
replicas and just single core.
It takes almost 3.5 hours to index that data.
I am using a data import handler to import data from the mongo database.

Is there something we can do to reduce the time taken to index?
Will upgrade to newer version help?

Appreciate your help!

Regards,
Abhijit



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com

Re: SOLR indexing takes longer time

2020-08-17 Thread Aroop Ganguly

Adding on to what others have said, indexing speed in general is largely 
affected by the parallelism and isolation you can give to each node.
Is there a reason why you cannot have more than 1 shard?
If you have 5 node cluster, why not have 5 shards, maxshardspernode=1 replica=1 
is ok. You should see dramatic gains.
Solr’s power and speed in doing everything comes from using it as a distributed 
system. By sharing more you will be using the benefit of that distributed 
capability,

HTH

Regards
Aroop

> On Aug 17, 2020, at 11:22 AM, Abhijit Pawar  wrote:
> 
> Hello,
> 
> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> replicas and just single core.
> It takes almost 3.5 hours to index that data.
> I am using a data import handler to import data from the mongo database.
> 
> Is there something we can do to reduce the time taken to index?
> Will upgrade to newer version help?
> 
> Appreciate your help!
> 
> Regards,
> Abhijit

Re: SOLR indexing takes longer time

2020-08-17 Thread Shawn Heisey


On 8/17/2020 12:22 PM, Abhijit Pawar wrote:

We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
replicas and just single core.
It takes almost 3.5 hours to index that data.
I am using a data import handler to import data from the mongo database.

Is there something we can do to reduce the time taken to index?
Will upgrade to newer version help?


There's not enough information here to provide a diagnosis.

Are you running Solr in cloud mode (with zookeeper)?

3.5 hours for 20 documents sounds like slowness with the data 
source, not a problem with Solr, but it's too soon to rule anything out.


Would you be able to write a program that pulls data from your mongo 
database but doesn't send it to Solr?  Ideally it would be a Java 
program using the same JDBC driver you're using with DIH.


Thanks,
Shawn

Re: SOLR indexing takes longer time

2020-08-17 Thread Walter Underwood

I’m seeing multiple red flags for performance here. The top ones are “DIH”,
“MongoDB”, and “SQL on MongoDB”. MongoDB is not a relational database.

Our multi-threaded extractor using the Mongo API was still three times slower
than the same approach on MySQL.

Check the CPU usage on the Solr hosts while you are indexing. If it is under 
50%, the bottleneck is MongoDB and single-threaded indexing.

For another check, run that same query in a regular database client and time it.
The Solr indexing will never be faster than that.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 17, 2020, at 11:58 AM, Abhijit Pawar  wrote:
> 
> Sure Divye,
> 
> *Here's the config.*
> 
> *conf/solr-config.xml:*
> 
> 
> 
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
> 
>  name="config">/home/ec2-user/solr/solr-5.4.1/server/solr/test_core/conf/dataimport/data-source-config.xml
> 
> 
> 
> 
> 
> *schema.xml:*
> has of all the field definitions
> 
> *conf/dataimport/data-source-config.xml*
> 
> 
>  driver="com.mongodb.jdbc.MongoDriver" url="mongodb://< ADDRESS>>:27017/<>"/>
> 
>  dataSource="mongod"
> transformer="<>,TemplateTransformer"
> onError="continue"
> pk="uuid"
> query="SELECT field1,field2,field3,.. FROM products"
> deltaImportQuery="SELECT field1,field2,field3,.. FROM products WHERE
> orgidStr = '${dataimporter.request.orgid}' AND idStr =
> '${dataimporter.delta.idStr}'"
> deltaQuery="SELECT idStr FROM products WHERE orgidStr =
> '${dataimporter.request.orgid}' AND updatedAt >
> '${dataimporter.last_index_time}'"
>> 
> 
> 
> 
> 
> .
> .
> . 4-5 more nested entities...
> 
> On Mon, Aug 17, 2020 at 1:32 PM Divye Handa 
> wrote:
> 
>> Can you share the dih configuration you are using for same?
>> 
>> On Mon, 17 Aug, 2020, 23:52 Abhijit Pawar,  wrote:
>> 
>>> Hello,
>>> 
>>> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
>>> replicas and just single core.
>>> It takes almost 3.5 hours to index that data.
>>> I am using a data import handler to import data from the mongo database.
>>> 
>>> Is there something we can do to reduce the time taken to index?
>>> Will upgrade to newer version help?
>>> 
>>> Appreciate your help!
>>> 
>>> Regards,
>>> Abhijit
>>> 
>>

Re: SOLR indexing takes longer time

2020-08-17 Thread Abhijit Pawar

Sure Divye,

*Here's the config.*

*conf/solr-config.xml:*





/home/ec2-user/solr/solr-5.4.1/server/solr/test_core/conf/dataimport/data-source-config.xml





*schema.xml:*
has of all the field definitions

*conf/dataimport/data-source-config.xml*









.
.
. 4-5 more nested entities...

On Mon, Aug 17, 2020 at 1:32 PM Divye Handa 
wrote:

> Can you share the dih configuration you are using for same?
>
> On Mon, 17 Aug, 2020, 23:52 Abhijit Pawar,  wrote:
>
> > Hello,
> >
> > We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> > replicas and just single core.
> > It takes almost 3.5 hours to index that data.
> > I am using a data import handler to import data from the mongo database.
> >
> > Is there something we can do to reduce the time taken to index?
> > Will upgrade to newer version help?
> >
> > Appreciate your help!
> >
> > Regards,
> > Abhijit
> >
>

Re: SOLR indexing takes longer time

2020-08-17 Thread Jörn Franke

The DIH is single threaded and deprecated. Your best bet is to have a 
script/program extracting data from MongoDB and write them to Solr in Batches 
using multiple threads. You will see a significant higher performance for your 
data.

> Am 17.08.2020 um 20:23 schrieb Abhijit Pawar :
> 
> Hello,
> 
> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> replicas and just single core.
> It takes almost 3.5 hours to index that data.
> I am using a data import handler to import data from the mongo database.
> 
> Is there something we can do to reduce the time taken to index?
> Will upgrade to newer version help?
> 
> Appreciate your help!
> 
> Regards,
> Abhijit

Re: SOLR indexing takes longer time

2020-08-17 Thread Divye Handa

Can you share the dih configuration you are using for same?

On Mon, 17 Aug, 2020, 23:52 Abhijit Pawar,  wrote:

> Hello,
>
> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> replicas and just single core.
> It takes almost 3.5 hours to index that data.
> I am using a data import handler to import data from the mongo database.
>
> Is there something we can do to reduce the time taken to index?
> Will upgrade to newer version help?
>
> Appreciate your help!
>
> Regards,
> Abhijit
>

SOLR indexing takes longer time

2020-08-17 Thread Abhijit Pawar

Hello,

We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
replicas and just single core.
It takes almost 3.5 hours to index that data.
I am using a data import handler to import data from the mongo database.

Is there something we can do to reduce the time taken to index?
Will upgrade to newer version help?

Appreciate your help!

Regards,
Abhijit

Re: Solr indexing with Tika DIH - ZeroByteFileException

2020-04-23 Thread Charlie Hull

If users can upload any PDF, including broken or huge ones, and some 
cause a Tika error, you should decouple Tika from Solr and run it as a 
separate process to extract text before indexing with Solr. Otherwise 
some of what is uploaded *will* break Solr.

https://lucidworks.com/post/indexing-with-solrj/ has some good hints.

Cheers

Charlie

On 11/06/2019 15:27, neilb wrote:

Hi, while going through solr logs, I found data import error for certain
documents. Here are details about the error.

Exception while processing: file document :
null:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable
to read content Processing Document # 7866
at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:171)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.ZeroByteFileException: InputStream must
have > 0 bytes
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165)


How do I know which document(document name with path) is #7866? And how do I
ignore ZeroByteFileException as document network share is not in my control.
Users can upload any size pdfs to it.

Thanks!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com

Re: Solr indexing with Tika DIH - ZeroByteFileException

2020-04-22 Thread ravi kumar amaravadi

Hi,
Iam also facing same issue. Does anyone have any update/soulution how to fix
this issue as part DIH?

Thanks.

Regards,
Ravi kumar



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr indexing performance

2019-12-05 Thread Shawn Heisey


On 12/5/2019 10:42 PM, Paras Lehana wrote:

Can ulimit

settings impact this? Review once.


If the OS limits prevent Solr from opening a file or starting a thread, 
it is far more likely that the indexing would fail.  It's not likely 
that such problems would make indexing slow.


Thanks,
Shawn

Re: Solr indexing performance

2019-12-05 Thread Paras Lehana

Can ulimit

settings impact this? Review once.

On Thu, 5 Dec 2019 at 23:31, Shawn Heisey  wrote:

> On 12/5/2019 10:28 AM, Rahul Goswami wrote:
> > We have a Solr 7.2.1 Solr Cloud setup where the client is indexing in 5
> > parallel threads with 5000 docs per batch. This is a test setup and all
> > documents are indexed on the same node. We are seeing connection timeout
> > issues thereafter some time into indexing. I am yet to analyze GC pauses
> > and other possibilities, but as a guideline just wanted to know what
> > indexing rate might be "too high" for Solr so as to consider throttling ?
> > The documents are mostly metadata with about 25 odd fields, so not very
> > heavy.
> > Would be nice to know a baseline performance expectation for better
> > application design considerations.
>
> It's not really possible to give you a number here.  It depends on a lot
> of things, and every install is going to be different.
>
> On a setup that I once dealt with, where there was only a single thread
> doing the indexing, indexing on each core happened at about 1000 docs
> per second.  I've heard people mention rates beyond 5 docs per
> second.  I've also heard people talk about rates of indexing far lower
> than what I was seeing.
>
> When you say "connection timeout" issues ... that could mean a couple of
> different things.  It could mean that the connection never gets
> established because it times out while trying, or it could mean that the
> connection gets established, and then times out after that.  Which are
> you seeing?  Usually dealing with that involves changing timeout
> settings on the client application.  Figuring out what's causing the
> delays that lead to the timeouts might be harder.  GC pauses are a
> primary candidate.
>
> There are typically two bottlenecks possible when indexing.  One is that
> the source system cannot supply the documents fast enough.  The other is
> that the Solr server is sitting mostly idle while the indexing program
> waits for an opportunity to send more documents.  The first is not
> something we can help you with.  The second is dealt with by making the
> indexing application multi-threaded or multi-process, or adding more
> threads/processes.
>
> Thanks,
> Shawn
>


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*

Re: Solr indexing performance

2019-12-05 Thread Shawn Heisey


On 12/5/2019 10:28 AM, Rahul Goswami wrote:

We have a Solr 7.2.1 Solr Cloud setup where the client is indexing in 5
parallel threads with 5000 docs per batch. This is a test setup and all
documents are indexed on the same node. We are seeing connection timeout
issues thereafter some time into indexing. I am yet to analyze GC pauses
and other possibilities, but as a guideline just wanted to know what
indexing rate might be "too high" for Solr so as to consider throttling ?
The documents are mostly metadata with about 25 odd fields, so not very
heavy.
Would be nice to know a baseline performance expectation for better
application design considerations.


It's not really possible to give you a number here.  It depends on a lot 
of things, and every install is going to be different.


On a setup that I once dealt with, where there was only a single thread 
doing the indexing, indexing on each core happened at about 1000 docs 
per second.  I've heard people mention rates beyond 5 docs per 
second.  I've also heard people talk about rates of indexing far lower 
than what I was seeing.


When you say "connection timeout" issues ... that could mean a couple of 
different things.  It could mean that the connection never gets 
established because it times out while trying, or it could mean that the 
connection gets established, and then times out after that.  Which are 
you seeing?  Usually dealing with that involves changing timeout 
settings on the client application.  Figuring out what's causing the 
delays that lead to the timeouts might be harder.  GC pauses are a 
primary candidate.


There are typically two bottlenecks possible when indexing.  One is that 
the source system cannot supply the documents fast enough.  The other is 
that the Solr server is sitting mostly idle while the indexing program 
waits for an opportunity to send more documents.  The first is not 
something we can help you with.  The second is dealt with by making the 
indexing application multi-threaded or multi-process, or adding more 
threads/processes.


Thanks,
Shawn

Re: Solr indexing performance

2019-12-05 Thread Vincenzo D'Amore

Hi, the clients are reusing their SolrClient? 

Ciao,
Vincenzo

--
mobile: 3498513251
skype: free.dev

> On 5 Dec 2019, at 18:28, Rahul Goswami  wrote:
> 
> Hello,
> 
> We have a Solr 7.2.1 Solr Cloud setup where the client is indexing in 5
> parallel threads with 5000 docs per batch. This is a test setup and all
> documents are indexed on the same node. We are seeing connection timeout
> issues thereafter some time into indexing. I am yet to analyze GC pauses
> and other possibilities, but as a guideline just wanted to know what
> indexing rate might be "too high" for Solr so as to consider throttling ?
> The documents are mostly metadata with about 25 odd fields, so not very
> heavy.
> Would be nice to know a baseline performance expectation for better
> application design considerations.
> 
> Thanks,
> Rahul

Solr indexing performance

2019-12-05 Thread Rahul Goswami

Hello,

We have a Solr 7.2.1 Solr Cloud setup where the client is indexing in 5
parallel threads with 5000 docs per batch. This is a test setup and all
documents are indexed on the same node. We are seeing connection timeout
issues thereafter some time into indexing. I am yet to analyze GC pauses
and other possibilities, but as a guideline just wanted to know what
indexing rate might be "too high" for Solr so as to consider throttling ?
The documents are mostly metadata with about 25 odd fields, so not very
heavy.
Would be nice to know a baseline performance expectation for better
application design considerations.

Thanks,
Rahul

Re: Solr indexing for unstructured data

2019-08-22 Thread Alexandre Rafalovitch

In Admin UI, there is schema browsing screen:
https://lucene.apache.org/solr/guide/8_1/schema-browser-screen.html
That shows you all the fields you have, their configuration and their
(tokenized) indexed content.

This seems to be a good midpoint between indexing and querying. So, I
would check whether the field you expect (and the fields you did not
expect) are there. If they are, focus on querying. If they are not,
focus on indexing.

This is a generic advice, because the question is not really clear.
Specifically:
1) "PDF parsed as text" "and I index that file" - what does that file
look like (content type)
2) "I index with bin/post" "I am able to retrieve results"  vs "I use
bin/post above" "it does not return fields in query". I can't tell the
difference between those two sequences, if you are indexing the same
file with the same command, you should get the same results.

Hope that helps.

Regards,
   Alex.

On Thu, 22 Aug 2019 at 09:44, amrit pattnaik  wrote:
>
> Hi ,
> I am a newbie in Solr. I have a scenario wherein the pdf documents with
> unstructured data have been parsed as text and kept in a separate directory.
>
> Now once I build a collection and do indexing using "bin/post -c collection
> name document name", the document gets indexed and I am able to retrieve
> the result. But it is a schemaless mode, I add fields to the managed-schema
> of collection.
>
> If I use bin/post command mentioned above, it does not return the added
> fields in schema in query result. So I tried indexing using curl command
> wherein I explicitly mention the field name value in the document sent for
> indexing. The required fields show up in query result but if I do a keyword
> based search, the document added through curl command don't show up.
>
> Would appreciate pointers/ help as I have been stuck on this issue for long.
>
> Regards,
> Amrit
>
> --
> With Regards,
>
> Amrit Pattnaik

Solr indexing for unstructured data

2019-08-22 Thread amrit pattnaik

Hi ,
I am a newbie in Solr. I have a scenario wherein the pdf documents with
unstructured data have been parsed as text and kept in a separate directory.

Now once I build a collection and do indexing using "bin/post -c collection
name document name", the document gets indexed and I am able to retrieve
the result. But it is a schemaless mode, I add fields to the managed-schema
of collection.

If I use bin/post command mentioned above, it does not return the added
fields in schema in query result. So I tried indexing using curl command
wherein I explicitly mention the field name value in the document sent for
indexing. The required fields show up in query result but if I do a keyword
based search, the document added through curl command don't show up.

Would appreciate pointers/ help as I have been stuck on this issue for long.

Regards,
Amrit

-- 
With Regards,

Amrit Pattnaik

Solr indexing with Tika DIH - ZeroByteFileException

2019-06-11 Thread neilb

Hi, while going through solr logs, I found data import error for certain
documents. Here are details about the error.

Exception while processing: file document :
null:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable
to read content Processing Document # 7866
at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:171)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.ZeroByteFileException: InputStream must
have > 0 bytes
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165)


How do I know which document(document name with path) is #7866? And how do I
ignore ZeroByteFileException as document network share is not in my control.
Users can upload any size pdfs to it.

Thanks!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr indexing with Tika DIH local vs network share

2019-04-04 Thread neilb

Thank you Erick, this is very helpful!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr indexing with Tika DIH local vs network share

2019-03-29 Thread Erick Erickson

So just try adding the autocommit and auotsoftcommit settings. All of the 
example configs have these entries and you can copy/paste/change

> On Mar 29, 2019, at 10:35 AM, neilb  wrote:
> 
> Hi Erick, I am using solrconfig.xml from samples only and has very few
> entries. I have attached my config files for review along with reply.
> 
> Thanks
> solrconfig.xml
>   
> tika-data-config.xml
>   
> managed-schema
> 
>   
> 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr indexing with Tika DIH local vs network share

2019-03-29 Thread neilb

Hi Erick, I am using solrconfig.xml from samples only and has very few
entries. I have attached my config files for review along with reply.

Thanks
solrconfig.xml
  
tika-data-config.xml
  
managed-schema
 
 





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr indexing with Tika DIH local vs network share

2019-03-29 Thread Erick Erickson

I suspect is that your autocommit settings in solrconfig.xml 
are something like 

hard commit: has openSearcher set to “false”
soft commit: has the interval set to -1 (never)

That means that until an external commit is executed, you won’t see any 
documents. Try setting your soft commit  to something like, say, 5 minutes (or 
even one minute). That would reduce the interval before docs become searchable.

I think DIH issues a commit at the end of the run, so that would be why you 
didn’t see anything for so long if I’m right.

Here’s more than you want to know about all this: 
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

I _still_ recommend you move the Tika processing off of Solr. 4G of memory is 
easily exceeded with the right (well, wrong) PDF document. And since Tika is 
runing inside Solr, that’ll mean Solr has an OOM and at that point you really 
don’t know the state of Solr and must restart. Running Tika in a different 
process will insulate Solr from this kind of thing.

Best,
Erick

> On Mar 29, 2019, at 8:36 AM, neilb  wrote:
> 
> Hi Erick, thanks a lot for your suggestions. I will look into it. But to
> answer my own query, I was little impatient and checking indexing status
> after every minute. What I found is after few hours, status started updating
> with document count and finished the indexing process in around 5Hrs.
> Do you see anything wrong with current setup of Solr and Tika DIH? All I am
> looking for PDF full text search results and have it integrated in web app
> dashboard using ajax queries. Also this particular  article
>    was helpful
> to get Solr running as windows service with 4G memory configuration under
> localsystem account.
> 
> Thanks again!
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr indexing with Tika DIH local vs network share

2019-03-29 Thread neilb

Hi Erick, thanks a lot for your suggestions. I will look into it. But to
answer my own query, I was little impatient and checking indexing status
after every minute. What I found is after few hours, status started updating
with document count and finished the indexing process in around 5Hrs.
Do you see anything wrong with current setup of Solr and Tika DIH? All I am
looking for PDF full text search results and have it integrated in web app
dashboard using ajax queries. Also this particular  article
   was helpful
to get Solr running as windows service with 4G memory configuration under
localsystem account.

Thanks again!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr indexing with Tika DIH local vs network share

2019-03-26 Thread Erick Erickson

Not quite an answer to your specific qustion, but… There
are a number of reasons why it’s better to run your Tika
process outside of Solr and DIH. Here’s the long form:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

Ignore the RDBMS parts. It’s somewhat old, but should be
adaptable easily.

Best,
Erick

> On Mar 26, 2019, at 8:27 AM, neilb  wrote:
> 
> Hi, I am trying to setup Solr for our  project which can return full text
> searches on PDF documents. I am able to run the sample Tika DIH example
> locally on my windows server machine. It can index all PDF documents
> recursively in "baseDir" of config xml. Presently "baseDir" points to local
> folder on the same machine and has around 10K pdf files. This whole setup
> works as expected.
> 
> Next step is to import PDF documents located on network share. I created
> another core, with very similar configuration files except this time,
> baseDir points to network share ("\\myserver\pdfshare"). I have no success
> in indexing these documents on newly created core. I have tried mapping this
> network share to local drive and updated config accordingly but still no
> success. 
> I managed to copy all pdf file from network share to local folder where
> example core with sample Tika DIH points and I am able to index all pdf
> files. 
> 
> So I am not sure why Tika config with network path is not able to index the
> files. Looking into log I can see following entries but that doesn't explain
> anything. Can someone guide to resolve the issue?
> 
> 2019-03-26 13:58:37.250 DEBUG (Scheduler-1147580192) [   ]
> o.e.j.i.FillInterest onFail
> FillInterest@419eacc8{AC.ReadCB@1ad637ed{HttpConnection@1ad637ed::SocketChannelEndPoint@6190d407{/10.206.11.68:51486<->/10.205.53.163:8983,OPEN,fill=FI,flush=-,to=120010/12}{io=1/1,kio=1,kro=1}->HttpConnection@1ad637ed[p=HttpParser{s=START,0
> of
> -1},g=HttpGenerator@7d81e85c{s=START}]=>HttpChannelOverHttp@10e588cc{r=2,c=false,a=IDLE,uri=null,age=0}}}
> java.util.concurrent.TimeoutException: Idle timeout expired: 120010/12
> ms
>   at 
> org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)
> [jetty-io-9.4.14.v20181114.jar:9.4.14.v20181114]
>   at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
> [jetty-io-9.4.14.v20181114.jar:9.4.14.v20181114]
>   at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> [?:1.8.0_201]
>   at java.util.concurrent.FutureTask.run(Unknown Source) [?:1.8.0_201]
>   at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown
> Source) [?:1.8.0_201]
>   at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
> Source) [?:1.8.0_201]
>   at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> [?:1.8.0_201]
>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> [?:1.8.0_201]
>   at java.lang.Thread.run(Unknown Source) [?:1.8.0_201]
> 
> 
> Is it possible that Solr is not ale to access the network share? Is this
> anyway that I can run Solr.cmd under different user (who as access to
> network share) in windows environment?
> Please let me know if you wish to know any more details about the issue.
> 
> 
> Thanks in advance
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Solr indexing with Tika DIH local vs network share

2019-03-26 Thread neilb

Hi, I am trying to setup Solr for our  project which can return full text
searches on PDF documents. I am able to run the sample Tika DIH example
locally on my windows server machine. It can index all PDF documents
recursively in "baseDir" of config xml. Presently "baseDir" points to local
folder on the same machine and has around 10K pdf files. This whole setup
works as expected.

Next step is to import PDF documents located on network share. I created
another core, with very similar configuration files except this time,
baseDir points to network share ("\\myserver\pdfshare"). I have no success
in indexing these documents on newly created core. I have tried mapping this
network share to local drive and updated config accordingly but still no
success. 
I managed to copy all pdf file from network share to local folder where
example core with sample Tika DIH points and I am able to index all pdf
files. 

So I am not sure why Tika config with network path is not able to index the
files. Looking into log I can see following entries but that doesn't explain
anything. Can someone guide to resolve the issue?

2019-03-26 13:58:37.250 DEBUG (Scheduler-1147580192) [   ]
o.e.j.i.FillInterest onFail
FillInterest@419eacc8{AC.ReadCB@1ad637ed{HttpConnection@1ad637ed::SocketChannelEndPoint@6190d407{/10.206.11.68:51486<->/10.205.53.163:8983,OPEN,fill=FI,flush=-,to=120010/12}{io=1/1,kio=1,kro=1}->HttpConnection@1ad637ed[p=HttpParser{s=START,0
of
-1},g=HttpGenerator@7d81e85c{s=START}]=>HttpChannelOverHttp@10e588cc{r=2,c=false,a=IDLE,uri=null,age=0}}}
java.util.concurrent.TimeoutException: Idle timeout expired: 120010/12
ms
at 
org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)
[jetty-io-9.4.14.v20181114.jar:9.4.14.v20181114]
at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
[jetty-io-9.4.14.v20181114.jar:9.4.14.v20181114]
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
[?:1.8.0_201]
at java.util.concurrent.FutureTask.run(Unknown Source) [?:1.8.0_201]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown
Source) [?:1.8.0_201]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
Source) [?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
[?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
[?:1.8.0_201]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_201]


Is it possible that Solr is not ale to access the network share? Is this
anyway that I can run Solr.cmd under different user (who as access to
network share) in windows environment?
Please let me know if you wish to know any more details about the issue.


Thanks in advance




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Docker and Solr Indexing

2019-02-12 Thread solrnoobie

Oh ok then that must no be the culprit then.

I got this logs from our application server but I'm not sure if this is
useful:

Caused by: org.apache.solr.client.solrj.SolrServerException:
org.apache.http.ParseException: Invalid content type: 
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:497)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at
org.springframework.data.solr.core.SolrTemplate$7.doInSolr(SolrTemplate.java:223)
at
org.springframework.data.solr.core.SolrTemplate$7.doInSolr(SolrTemplate.java:220)
at
org.springframework.data.solr.core.SolrTemplate.execute(SolrTemplate.java:132)
... 12 more
Caused by: org.apache.http.ParseException: Invalid content type: 
at org.apache.http.entity.ContentType.parse(ContentType.java:233)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:496)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:483)



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Docker and Solr Indexing

2019-02-12 Thread Shawn Heisey


On 2/12/2019 6:56 AM, solrnoobie wrote:

I know this is too late of a reply but I found this on our solr.log

java.nio.file.NoSuchFileException:


USUALLY, this is a harmless annoyance, not an indication of an actual 
problem.


Some people have indicated that it causes problems when using the backup 
facility.  That's not a part of Solr that I have spent any time with, so 
I cannot confirm.


See this issue:

https://issues.apache.org/jira/browse/SOLR-9120

There was a fix committed for that issue, first available in version 7.2.0.

Thanks,
Shawn

Re: Docker and Solr Indexing

2019-02-12 Thread solrnoobie

I know this is too late of a reply but I found this on our solr.log

java.nio.file.NoSuchFileException:
/opt/solr/server/solr/primaryCollectionPERF_shard1_replica9/data/index/segments_78
at
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
at
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
at
java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
at
java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
at
java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
at java.base/java.nio.file.Files.readAttributes(Files.java:1763)
at java.base/java.nio.file.Files.size(Files.java:2380)
at
org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
at
org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:128)
at
org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(LukeRequestHandler.java:615)
at
org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:588)
at
org.apache.solr.handler.admin.LukeRequestHandler.handleRequestBody(LukeRequestHandler.java:138)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.base/java.lang.Thread.run(Thread.java:834)




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Docker and Solr Indexing

2019-02-12 Thread solrnoobie

I know this is too late of a reply but I found this on our solr.log

java.nio.file.NoSuchFileException:
/opt/solr/server/solr/primaryCollectionPERF_shard1_replica9/data/index/segments_78
at
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
at
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
at
java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
at
java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
at
java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
at java.base/java.nio.file.Files.readAttributes(Files.java:1763)
at java.base/java.nio.file.Files.size(Files.java:2380)
at
org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
at
org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:128)
at
org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(LukeRequestHandler.java:615)
at
org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:588)
at
org.apache.solr.handler.admin.LukeRequestHandler.handleRequestBody(LukeRequestHandler.java:138)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.base/java.lang.Thread.run(Thread.java:834)




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Solr indexing raises error while posting PDF

2019-01-23 Thread sonam mittal

I am using Solr-6.6.4 version and Ubuntu 16 version.I have created a
collection in Solr using the configuration files of the Solr example
*techproducts*. I am trying to post a PDF in Solr but it is raising some
errors.I have also installed the apache tika through maven but still it is
showing the following error.

SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/ifarm_tech/update...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file Types.pdf (application/pdf) to [base]/extract
SimplePostTool: WARNING: Solr returned an error #500 (Server Error)
for url: 
http://localhost:8983/solr/ifarm_tech/update/extract?resource.name=%2Fhome%2Fubuntu%2Fpdf_cancer%2FTypes.pdf&literal.id=%2Fhome%2Fubuntu%2Fpdf_cancer%2FTypes.pdf
SimplePostTool: WARNING: Response: 


Error 500 Server Error

HTTP ERROR 500
Problem accessing /solr/ifarm_tech/update/extract. Reason:
Server ErrorCaused
by:java.lang.NoClassDefFoundError: Could not initialize
class org.apache.pdfbox.pdmodel.PDDocument
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:149)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
at 
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)




SimplePostTool: WARNING: IOException while reading response:
java.io.IOException: Server returned HTTP response c

Re: Making Solr Indexing Errors Visible

2018-09-30 Thread Jason Gerlowski

Hi

Also worth mentioning that bin/post only handles certain file
extensions, and AFAIR it doesn't mention specifically when it skips
over a file because of the extension. You mentioned you're trying to
index Word docs and pdf's.  Are there any other formats in the
directory that might be messing up your counts?

I also second Shawn's suggestion that you post the "bin/post" output
and a directory listing.  Additionally, if you're able to clean up the
output a bit, you might be able to diff the two lists of files and see
if the ones missing have anything particular in common.

Good luck,

Jason
On Thu, Sep 27, 2018 at 9:58 AM Shawn Heisey  wrote:
>
> On 9/26/2018 2:39 PM, Terry Steichen wrote:
> > Let me try to clarify a bit - I'm just using bin/post to index the files
> > in a directory.  That indexing process produces a lengthy screen display
> > of files that were indexed.  (I realize this isn't production-quality,
> > but I'm not ready for production just yet, so that should be OK.)
>
> I see a previous message on the list from you indicating solr 6.6.0.
> FYI there are five bugfix releases after 6.6.0 -- the latest 6.x release
> is 6.6.5.  I don't see any fixes related to the post tool, but maybe one
> of the problems that did get fixed might help your server behave better.
>
> Switching my source checkout to the 6.6.0 tag and checking that version...
>
> Each time a file is sent, you should get a log line starting with
> "POSTing file".
>
> The error detection in SimplePostTool has a bunch of parts.  It seems
> that *most* errors will abort the tool entirely, skipping any files that
> have not yet been processed, and logging a message with "FATAL" included.
>
> Can you show us a directory listing and all the output that you get from
> bin/post when processing that directory?
>
> Thanks,
> Shawn
>

Re: Making Solr Indexing Errors Visible

2018-09-27 Thread Shawn Heisey


On 9/26/2018 2:39 PM, Terry Steichen wrote:

Let me try to clarify a bit - I'm just using bin/post to index the files
in a directory.  That indexing process produces a lengthy screen display
of files that were indexed.  (I realize this isn't production-quality,
but I'm not ready for production just yet, so that should be OK.)


I see a previous message on the list from you indicating solr 6.6.0.  
FYI there are five bugfix releases after 6.6.0 -- the latest 6.x release 
is 6.6.5.  I don't see any fixes related to the post tool, but maybe one 
of the problems that did get fixed might help your server behave better.


Switching my source checkout to the 6.6.0 tag and checking that version...

Each time a file is sent, you should get a log line starting with 
"POSTing file".


The error detection in SimplePostTool has a bunch of parts.  It seems 
that *most* errors will abort the tool entirely, skipping any files that 
have not yet been processed, and logging a message with "FATAL" included.


Can you show us a directory listing and all the output that you get from 
bin/post when processing that directory?


Thanks,
Shawn

Re: Making Solr Indexing Errors Visible

2018-09-26 Thread Shawn Heisey


On 9/26/2018 2:39 PM, Terry Steichen wrote:

To the best of my knowledge, I'm not using SolrJ at all.  Just
Solr-out-of-the-box.  In this case, if I understand you below, it
"should indicate an error status"


I think you'd know if you were using SolrJ directly.  You'd have written 
the indexing program, or whoever DID write it would likely indicate that 
they used SolrJ to talk to Solr.  I was surprised to learn that 
SimplePostTool does NOT use SolrJ ... it uses the HTTP capability built 
into Java.



Let me try to clarify a bit - I'm just using bin/post to index the files
in a directory.  That indexing process produces a lengthy screen display
of files that were indexed.  (I realize this isn't production-quality,
but I'm not ready for production just yet, so that should be OK.)


If you check your index, are you missing files that bin/post said were 
indexed?  Have you looked in that kind of detail?


The post tool should indicate that an error occurred, and if there was 
any text in the response about the error, it should be displayed.  I was 
looking at the 7.4 code branch.  I didn't see anything about which Solr 
version you're running.


I have not spent any real time using bin/post.  It was part of a class 
that I attended as part of Lucene Revolution in 2010, but I do not 
recall what the output was.  It was all pre-designed and tested so it 
was known to work before I received it.  No errors occurred when I ran 
the script included with the class materials.



But no errors are shown (even though there have to be because the totals
indexed is less than the directory totals).

Are you saying I can't use post (to verify correct indexing), but that I
have to write custom software to accomplish that?


If you want errors detected programmatically, you'll need to write the 
indexing program.  The simple post tool won't report errors to anything 
that calls it, it will just log them.



And that there's no solr variable I can define that will do a kind of
"verbose" to show that?


If Solr returned errors during the indexing, then they will show up in 
the solr.log file, or possibly one of the rotated versions of that 
logfile.  You can also see them in the admin UI Logging tab if Solr 
hasn't been restarted, but the logfile is generally a better way to find 
them.  If you're not seeing errors there, then maybe something went 
wrong with bin/post.


I notice in a later message you indicate that you're indexing PDF and 
DOC files.  When those kinds of files are sent with bin/post, they will 
normally end up in the Extracting Request Handler, also known as SolrCell.


It is highly recommended that the Extracting Request Handler never be 
used in production.  That software embeds Tika inside Solr.  Tika is 
known to explode spectacularly when it gets a file it doesn't know how 
to handle.  PDF files in particular seem to trigger this behavior, but 
other formats can cause it as well.  If Tika is running inside Solr when 
that happens, Solr will also explode, and then you no longer have a 
search engine on that machine.  A better option is to include Tika in an 
indexing program that you write, so if it explodes, Solr stays running.


Thanks,
Shawn

Re: Making Solr Indexing Errors Visible

2018-09-26 Thread Terry Steichen

Alex,

Please look at my embedded responses to your questions.

Terry


On 09/26/2018 04:57 PM, Alexandre Rafalovitch wrote:
> The challenge here is to figure out exactly what you are doing,
> because the original description could have been 10 different things.
>
> So:
> 1) You are using bin/post command (we just found this out)
No, I said that at the outset.  And repeated it.
> 2) You are indexing a bunch of files (what format? all same or different?)
I also said I was indexing a mixture of pdf and doc files
> 3) You are indexing them into a Schema supposedly ready for those
> files (which one?)
I'm using the managed-schema, the data-driven approach
> 4) You think some of them are not in in Solr (how do you know that?
> how do you know that some are? why do you not know _which_ of the
> files are not indexed?)
I thought I made it very clear (twice) that I find that the list of
indexed files is 10% fewer than those in the directory holding the files
being indexed.  And I said that I don't know which are not getting
indexed because I am not getting error messages.
> 5) You are asking whether the error message should have told you if
> there is a problem with indexing (normally yes, but maybe there are
> some edge cases).
That's my question - why am I not getting error messages.  That's the
whole point of my query to the list.
>
> I've put the questions in brackets. I would focus on looking at
> questions in 4) first as they roughly bisect the problem. But other
> things are important too.
>
> I hope this helps,
> Alex.
>
>
> On 26 September 2018 at 16:39, Terry Steichen  wrote:
>> Shawn,
>>
>> To the best of my knowledge, I'm not using SolrJ at all.  Just
>> Solr-out-of-the-box.  In this case, if I understand you below, it
>> "should indicate an error status"
>>
>> But it doesn't.
>>
>> Let me try to clarify a bit - I'm just using bin/post to index the files
>> in a directory.  That indexing process produces a lengthy screen display
>> of files that were indexed.  (I realize this isn't production-quality,
>> but I'm not ready for production just yet, so that should be OK.)
>>
>> But no errors are shown (even though there have to be because the totals
>> indexed is less than the directory totals).
>>
>> Are you saying I can't use post (to verify correct indexing), but that I
>> have to write custom software to accomplish that?
>>
>> And that there's no solr variable I can define that will do a kind of
>> "verbose" to show that?
>>
>> And that such errors will not show up in any of solr's log files?
>>
>> Hard to believe (but what is, is, I guess).
>>
>> Terry
>>
>> On 09/26/2018 03:49 PM, Shawn Heisey wrote:
>>> On 9/26/2018 1:23 PM, Terry Steichen wrote:
 I'm pretty sure this was covered earlier.  But I can't find references
 to it.  The question is how to make indexing errors clear and obvious.
>>> If there's an indexing error and you're NOT using the concurrent
>>> client in SolrJ, the response that Solr returns should indicate an
>>> error status.  ConcurrentUpdateSolrClient gets those errors and
>>> swallows them so the calling program never knows they occurred.
>>>
 (I find that there are maybe 10% more files in a directory than end up
 in the index.  I presume they were indexing errors, but I have no idea
 which ones or what might have caused the error.)  As I recall, Solr's
 post tool doesn't give any errors when indexing.  I (vaguely) recall
 that there's a way (through the logs?) to overcome this and show the
 errors.  Or maybe it's that you have to do the indexing outside of Solr?
>>> The simple post tool is not really meant for production use.  It is a
>>> simple tool for interactive testing.
>>>
>>> I don't see anything in SimplePostTool for changing the program's exit
>>> status when an error is encountered during program operation.  If an
>>> error is encountered during the upload, a message would be logged to
>>> stderr, but you wouldn't be able to rely on the program's exit status
>>> to indicate an error.  To get that, you will need to write the
>>> indexing software.
>>>
>>> Thanks,
>>> Shawn
>>>
>>>

Re: Making Solr Indexing Errors Visible

2018-09-26 Thread Alexandre Rafalovitch

The challenge here is to figure out exactly what you are doing,
because the original description could have been 10 different things.

So:
1) You are using bin/post command (we just found this out)
2) You are indexing a bunch of files (what format? all same or different?)
3) You are indexing them into a Schema supposedly ready for those
files (which one?)
4) You think some of them are not in in Solr (how do you know that?
how do you know that some are? why do you not know _which_ of the
files are not indexed?)
5) You are asking whether the error message should have told you if
there is a problem with indexing (normally yes, but maybe there are
some edge cases).

I've put the questions in brackets. I would focus on looking at
questions in 4) first as they roughly bisect the problem. But other
things are important too.

I hope this helps,
Alex.


On 26 September 2018 at 16:39, Terry Steichen  wrote:
> Shawn,
>
> To the best of my knowledge, I'm not using SolrJ at all.  Just
> Solr-out-of-the-box.  In this case, if I understand you below, it
> "should indicate an error status"
>
> But it doesn't.
>
> Let me try to clarify a bit - I'm just using bin/post to index the files
> in a directory.  That indexing process produces a lengthy screen display
> of files that were indexed.  (I realize this isn't production-quality,
> but I'm not ready for production just yet, so that should be OK.)
>
> But no errors are shown (even though there have to be because the totals
> indexed is less than the directory totals).
>
> Are you saying I can't use post (to verify correct indexing), but that I
> have to write custom software to accomplish that?
>
> And that there's no solr variable I can define that will do a kind of
> "verbose" to show that?
>
> And that such errors will not show up in any of solr's log files?
>
> Hard to believe (but what is, is, I guess).
>
> Terry
>
> On 09/26/2018 03:49 PM, Shawn Heisey wrote:
>> On 9/26/2018 1:23 PM, Terry Steichen wrote:
>>> I'm pretty sure this was covered earlier.  But I can't find references
>>> to it.  The question is how to make indexing errors clear and obvious.
>>
>> If there's an indexing error and you're NOT using the concurrent
>> client in SolrJ, the response that Solr returns should indicate an
>> error status.  ConcurrentUpdateSolrClient gets those errors and
>> swallows them so the calling program never knows they occurred.
>>
>>> (I find that there are maybe 10% more files in a directory than end up
>>> in the index.  I presume they were indexing errors, but I have no idea
>>> which ones or what might have caused the error.)  As I recall, Solr's
>>> post tool doesn't give any errors when indexing.  I (vaguely) recall
>>> that there's a way (through the logs?) to overcome this and show the
>>> errors.  Or maybe it's that you have to do the indexing outside of Solr?
>>
>> The simple post tool is not really meant for production use.  It is a
>> simple tool for interactive testing.
>>
>> I don't see anything in SimplePostTool for changing the program's exit
>> status when an error is encountered during program operation.  If an
>> error is encountered during the upload, a message would be logged to
>> stderr, but you wouldn't be able to rely on the program's exit status
>> to indicate an error.  To get that, you will need to write the
>> indexing software.
>>
>> Thanks,
>> Shawn
>>
>>
>

Re: Making Solr Indexing Errors Visible

2018-09-26 Thread Terry Steichen

Shawn,

To the best of my knowledge, I'm not using SolrJ at all.  Just
Solr-out-of-the-box.  In this case, if I understand you below, it
"should indicate an error status" 

But it doesn't.

Let me try to clarify a bit - I'm just using bin/post to index the files
in a directory.  That indexing process produces a lengthy screen display
of files that were indexed.  (I realize this isn't production-quality,
but I'm not ready for production just yet, so that should be OK.)

But no errors are shown (even though there have to be because the totals
indexed is less than the directory totals).

Are you saying I can't use post (to verify correct indexing), but that I
have to write custom software to accomplish that? 

And that there's no solr variable I can define that will do a kind of
"verbose" to show that?

And that such errors will not show up in any of solr's log files?

Hard to believe (but what is, is, I guess).

Terry

On 09/26/2018 03:49 PM, Shawn Heisey wrote:
> On 9/26/2018 1:23 PM, Terry Steichen wrote:
>> I'm pretty sure this was covered earlier.  But I can't find references
>> to it.  The question is how to make indexing errors clear and obvious.
>
> If there's an indexing error and you're NOT using the concurrent
> client in SolrJ, the response that Solr returns should indicate an
> error status.  ConcurrentUpdateSolrClient gets those errors and
> swallows them so the calling program never knows they occurred.
>
>> (I find that there are maybe 10% more files in a directory than end up
>> in the index.  I presume they were indexing errors, but I have no idea
>> which ones or what might have caused the error.)  As I recall, Solr's
>> post tool doesn't give any errors when indexing.  I (vaguely) recall
>> that there's a way (through the logs?) to overcome this and show the
>> errors.  Or maybe it's that you have to do the indexing outside of Solr?
>
> The simple post tool is not really meant for production use.  It is a
> simple tool for interactive testing.
>
> I don't see anything in SimplePostTool for changing the program's exit
> status when an error is encountered during program operation.  If an
> error is encountered during the upload, a message would be logged to
> stderr, but you wouldn't be able to rely on the program's exit status
> to indicate an error.  To get that, you will need to write the
> indexing software.
>
> Thanks,
> Shawn
>
>

Re: Making Solr Indexing Errors Visible

2018-09-26 Thread Shawn Heisey


On 9/26/2018 1:23 PM, Terry Steichen wrote:

I'm pretty sure this was covered earlier.  But I can't find references
to it.  The question is how to make indexing errors clear and obvious.


If there's an indexing error and you're NOT using the concurrent client 
in SolrJ, the response that Solr returns should indicate an error 
status.  ConcurrentUpdateSolrClient gets those errors and swallows them 
so the calling program never knows they occurred.



(I find that there are maybe 10% more files in a directory than end up
in the index.  I presume they were indexing errors, but I have no idea
which ones or what might have caused the error.)  As I recall, Solr's
post tool doesn't give any errors when indexing.  I (vaguely) recall
that there's a way (through the logs?) to overcome this and show the
errors.  Or maybe it's that you have to do the indexing outside of Solr?


The simple post tool is not really meant for production use.  It is a 
simple tool for interactive testing.


I don't see anything in SimplePostTool for changing the program's exit 
status when an error is encountered during program operation.  If an 
error is encountered during the upload, a message would be logged to 
stderr, but you wouldn't be able to rely on the program's exit status to 
indicate an error.  To get that, you will need to write the indexing 
software.


Thanks,
Shawn

Making Solr Indexing Errors Visible

2018-09-26 Thread Terry Steichen

I'm pretty sure this was covered earlier.  But I can't find references
to it.  The question is how to make indexing errors clear and obvious. 
(I find that there are maybe 10% more files in a directory than end up
in the index.  I presume they were indexing errors, but I have no idea
which ones or what might have caused the error.)  As I recall, Solr's
post tool doesn't give any errors when indexing.  I (vaguely) recall
that there's a way (through the logs?) to overcome this and show the
errors.  Or maybe it's that you have to do the indexing outside of Solr?

Terry Steichen

Re: Docker and Solr Indexing

2018-09-12 Thread Shawn Heisey


On 9/12/2018 7:43 AM, Dominique Bejean wrote:

Are you aware about issues in Java applications in Docker if java version
is not 10 ?
https://blog.docker.com/2018/04/improved-docker-container-integration-with-java-10/


Solr explicitly sets heap size when it starts, so Java is *NOT* 
determining the heap size automatically.


As for CPUs, if the container isn't sized appropriately, then I guess 
you might have an issue there.


The latest version of Solr should start and run just fine in Java 10.  
Some earlier versions of 7.x have problems starting in Java 10, but 
should *run* fine after the script is fixed to detect the version 
correctly.  Solr 6.x is not qualified for Java 9, and therefore not 
qualified for Java 10.


Thanks,
Shawn

Re: Docker and Solr Indexing

2018-09-12 Thread Dominique Bejean

Hi,

Are you aware about issues in Java applications in Docker if java version
is not 10 ?
https://blog.docker.com/2018/04/improved-docker-container-integration-with-java-10/

Regards.

Dominique


Le mer. 12 sept. 2018 à 05:42, Shawn Heisey  a écrit :

> On 9/11/2018 9:20 PM, solrnoobie wrote:
> > So what we did is we upgraded the instances to 16 gigs and we rarely
> > encounter this now.
> >
> > So what we did was to increase the batch size to 500 instead of 50 and it
> > worked for our test data. But when we tried 1000 batch size, the invalid
> > content type error returned. Can you guys shed some light on why this is
> > happening? I don't think that a thousand per batch is too much (although
> we
> > have documents with many fields and child documents) so I am not really
> sure
> > what's causing this aside from a docker containter restart.
>
> At no point in this thread have you shared the actual error messages.
> Without those and the exact version of Solr, it's difficult to help
> you.  Saying that you got a "content type error" doesn't mean anything.
> We need to see the actual error, complete with all stacktrace data.  The
> best information will be found in the logfile -- solr.log.
>
> Solr (as packaged by this project) is not designed to restart itself
> automatically.  If the JVM encounters an OutOfMemoryError exception and
> the platform is NOT Windows, then Solr is designed to kill itself ...
> but it will NOT automatically restart without outside intervention or a
> change to its startup scripts.  This is done because program operation
> is completely unpredictable when OOME hits, so the best course of action
> is to self-terminate and let the admin fix the problem that cause the OOME.
>
> The publicly available Solr docker container is NOT an official product
> of this project.  It is third-party, so problems specific to the docker
> container may need to be handled by the project that created it.  If the
> docker container is set up to automatically restart Solr when it dies, I
> would consider that to be a bug. About the only reason that Solr will
> ever die is the OOME self-termination that I already described ... and
> since the OOME is likely to occur again after restart, it's usually
> better for the software to stay offline until the admin fixes the problem.
>
> Thanks,
> Shawn
>
>

Re: Docker and Solr Indexing

2018-09-11 Thread Shawn Heisey


On 9/11/2018 9:20 PM, solrnoobie wrote:

So what we did is we upgraded the instances to 16 gigs and we rarely
encounter this now.

So what we did was to increase the batch size to 500 instead of 50 and it
worked for our test data. But when we tried 1000 batch size, the invalid
content type error returned. Can you guys shed some light on why this is
happening? I don't think that a thousand per batch is too much (although we
have documents with many fields and child documents) so I am not really sure
what's causing this aside from a docker containter restart.


At no point in this thread have you shared the actual error messages.  
Without those and the exact version of Solr, it's difficult to help 
you.  Saying that you got a "content type error" doesn't mean anything.  
We need to see the actual error, complete with all stacktrace data.  The 
best information will be found in the logfile -- solr.log.


Solr (as packaged by this project) is not designed to restart itself 
automatically.  If the JVM encounters an OutOfMemoryError exception and 
the platform is NOT Windows, then Solr is designed to kill itself ... 
but it will NOT automatically restart without outside intervention or a 
change to its startup scripts.  This is done because program operation 
is completely unpredictable when OOME hits, so the best course of action 
is to self-terminate and let the admin fix the problem that cause the OOME.


The publicly available Solr docker container is NOT an official product 
of this project.  It is third-party, so problems specific to the docker 
container may need to be handled by the project that created it.  If the 
docker container is set up to automatically restart Solr when it dies, I 
would consider that to be a bug. About the only reason that Solr will 
ever die is the OOME self-termination that I already described ... and 
since the OOME is likely to occur again after restart, it's usually 
better for the software to stay offline until the admin fixes the problem.


Thanks,
Shawn

Re: Docker and Solr Indexing

2018-09-11 Thread solrnoobie

Thank you all for the kind and timely reply.

So what we did is we upgraded the instances to 16 gigs and we rarely
encounter this now.

So what we did was to increase the batch size to 500 instead of 50 and it
worked for our test data. But when we tried 1000 batch size, the invalid
content type error returned. Can you guys shed some light on why this is
happening? I don't think that a thousand per batch is too much (although we
have documents with many fields and child documents) so I am not really sure
what's causing this aside from a docker containter restart.

Thanks!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Docker and Solr Indexing

2018-09-11 Thread Jan Høydahl

You have not shed any light on what the reason for the container restart was, 
and there is too little information about your setup and Solr usage to guess 
what goes on. Whether 4Gb is sufficient or not depends on how much data and 
queries you plan for each shard to handle, how much heap you give to Solr out 
of those 4G and many other factors.

Jan

> 11. sep. 2018 kl. 08:05 skrev solrnoobie :
> 
> So we have a dockerized aws environment with the solr docker container having
> only 4 gigs for max ram.
> 
> Our problem is whenever we index, the container containing the leader shard
> will restart after around 2 or less minutes of index time (batch is 50 docs
> per batch with 3 threads in our app thread pool). Because of the container
> restart, indexing will fail because solrJ will throw an invalid content type
> exception because of the quick container restart.
> 
> What can possible casue the issues above?
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Docker and Solr Indexing

2018-09-11 Thread Walter Underwood

4 Gb is very small for Solr.

Solr is not designed for Dockerized, fail-often use.

We use a LOT of Docker ECS, but all of our Solr servers are on EC2
instances. That’s about sixty instances in several clusters.

We run an 8 Gb heap for all our Solr instances. Instances in our biggest
cluster (in terms of index size and doc count) are c4.8xlarge, with 36 vCPU
and 60 Gb of RAM.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 10, 2018, at 11:05 PM, solrnoobie  wrote:
> 
> So we have a dockerized aws environment with the solr docker container having
> only 4 gigs for max ram.
> 
> Our problem is whenever we index, the container containing the leader shard
> will restart after around 2 or less minutes of index time (batch is 50 docs
> per batch with 3 threads in our app thread pool). Because of the container
> restart, indexing will fail because solrJ will throw an invalid content type
> exception because of the quick container restart.
> 
> What can possible casue the issues above?
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Docker and Solr Indexing

2018-09-10 Thread solrnoobie

So we have a dockerized aws environment with the solr docker container having
only 4 gigs for max ram.

Our problem is whenever we index, the container containing the leader shard
will restart after around 2 or less minutes of index time (batch is 50 docs
per batch with 3 threads in our app thread pool). Because of the container
restart, indexing will fail because solrJ will throw an invalid content type
exception because of the quick container restart.

What can possible casue the issues above?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr indexing Duplicate URL's ending with /

2018-08-29 Thread Jan Høydahl

Hi,

You would have to direct this question to the crawler you are using, since it 
is the crawler that decides the document ID to send to Solr. Most crawlers will 
have configuration options to normalize the URL for each document.

However you could also try to clean the URL after it arrives in SOlr. See 
URLClassifyProcessor 
https://lucene.apache.org/solr/guide/7_2/update-request-processors.html#general-use-updateprocessorfactories
 

 which may perhaps help.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 29. aug. 2018 kl. 14:02 skrev kunhu0...@gmail.com:
> 
> Team,
> 
> Need suggestion on how to remove the duplicate entries while indexing to
> Solr. Below are the sample entries i see in solr collection while i need to
> remove the one which is ending with /
> 
> https://www.abc.com/2018/test.html
> https://www.abc.com/2018/test.html/
> 
> 
> Thank you
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Solr indexing Duplicate URL's ending with /

2018-08-29 Thread kunhu0...@gmail.com

Team,

Need suggestion on how to remove the duplicate entries while indexing to
Solr. Below are the sample entries i see in solr collection while i need to
remove the one which is ending with /

https://www.abc.com/2018/test.html
https://www.abc.com/2018/test.html/


Thank you



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr Indexing error

2018-08-28 Thread Shawn Heisey


On 8/28/2018 6:03 AM, kunhu0...@gmail.com wrote:

possible analysis error: Document contains at least one immense term in
field="content" (whose UTF8 encoding is longer than the max length 32766),


It's telling you exactly what is wrong.

The field named "content" is probably using a field class with no 
analysis, or using the Keyword Tokenizer so the whole field gets treated 
as a single term.  The length of that field for at least one of your 
documents is longer than 32766 characters. Maybe it's bytes -- a UTF8 
character can be more than a single byte.  Lucene has a limit on term 
length, and your input exceeded that length.


If you change the field type for content to something that's analyzed 
(split into words, basically) then this problem would likely go away.


Thanks,
Shawn

Solr Indexing error

2018-08-28 Thread kunhu0...@gmail.com

Hello All,

Need help on the error related to Solr indexing. We are using Solr 6.6.3 and
Nutch crawler 1.14. While indexing data to Solr we see errors as below

possible analysis error: Document contains at least one immense term in
field="content" (whose UTF8 encoding is longer than the max length 32766),
all of which were skipped.  Please correct the analyzer to not produce such
terms.  The prefix of the first immense term is: '[84, 69, 82, 77, 83, 32,
79, 70, 32, 85, 83, 69, 10, 69, 102, 102, 101, 99, 116, 105, 118, 101, 32,
68, 97, 116, 101, 58, 32, 74]...', original message: bytes can be at most
32766 in length; got 40638. Perhaps the document has an indexed string field
(solr.StrField) which is too large.

Can anyone please help






--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Keep Solr Indexing live

2017-12-20 Thread shashiroushan

Hello All,

I am using DIH to import data from SQL to Solr using Url 
"/dataimport?command=full-import&clean=true".
My problem is, When SQL query return zero record then Solr also return zero 
records. But as per my project requirement, Solr indexing should be clean only 
when SQL query return records.
So I cant pass “clean= false”. 

Please suggest.

Regards,
Shashi Roushan

Re: Urgent - Solr indexing is taking hours and dashboard page is not getting rendered at all :(

2017-03-09 Thread Shawn Heisey

On 3/9/2017 8:16 AM, Gaurav Srivastava wrote:
> I have a eCommerce site built on Hybris 6.2.0.4 which uses SOLR OOB
> (vendor=hybris version=6.2.0.2) as a search engine. I am facing below
> 2 problems :

6.2.0.2 is not a valid Solr version number.  They only have three
numbers, not four.  If hybris repackages Solr, then you might need to go
to them for help, as only they will know what they have changed.

If the Solr version is 6.4.0 or 6.4.1, there is a severe performance
degradation that is fixed by 6.4.2, which was just released.

https://issues.apache.org/jira/browse/SOLR-10130

> 1. Indexing is taking lot of time(4-5 hours) in last couple of weeks.
> (data has increased though)
> 2. Our dashboard page is getting hunged please find the details below.
>
> "The product category dropdown in the main site navigation is rendered
> using a query to Solr. On the first request where the navigation is
> displayed (which would effectively be immediately following login),
> Hybris queries for this information and stores it in the session cache
> for the user. If SOLR accepts the connection but never responds (or
> does not respond within the overall page timeout), then the page
> rendering times out and the user is left with a partially rendered
> page which is unusable."
>
> We have 4 jobs which indexes the data(full/update), please find the
> count of the data which gets indexed during each job :
>
> 1. Job A indexes : 4   products
> 2. Job B indexes : 12 products
> 3. Job C indexes : 120   products
> 4. Job D indexes : 90 products
>
> I am sure SOLR can handle these products easily and we should not face
> this issue. Hardware configuration is below
>
> 1. We are using 2 Cores of 16GB RAM
> 2. Disk space is 8 GB, and out of that 72% is already utilized.

This is very vague information.  "2 Cores of 16GB RAM" could mean just
about anything.  Solr cores?  CPU cores?  16GB of what?  Java/Solr
heap?  Total system memory?

The disk space mentioned is also unhelpful.  Is that total disk space? 
Size of the index?  The 72% number won't mean anything without quite a
bit more info.

> Below is the configuration and various key values from hybris side.Any
> help in this regard will be great :)

Your attachments did not make it to the list.  They almost never do.  If
you need to share something that's too big to include as regular text in
your email, store it in a semi-permanent place on the public Internet
and provide a URL to access it.  Remember that few people here know
anything about hybris.  Information from that system may not help with Solr.

One of the biggest bottlenecks in Solr indexing is actually retrieving
the data from the source system.  This is the most common reason for
slow indexing.

Other possible problems that cause slow indexing include committing
after every update request, sending one document at a time in each
update request instead of batching them, and only using one
thread/connection to index.

Here is some general info on Solr performance problems.  With the
information available, I have no idea whether this page will even be
useful to you:

https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

Re: Urgent - Solr indexing is taking hours and dashboard page is not getting rendered at all :(

2017-03-09 Thread Charlie Hull


On 09/03/2017 15:16, Gaurav Srivastava wrote:

Hi All,

I have a eCommerce site built on Hybris 6.2.0.4 which uses SOLR OOB
(vendor=hybris
version=6.2.0.2) as a search engine. I am facing below 2 problems :

1. Indexing is taking lot of time(4-5 hours) in last couple of weeks.
(data has increased though)
2. Our dashboard page is getting hunged please find the details below.

"The product category dropdown in the main site navigation is rendered
using a query to Solr. On the first request where the navigation is
displayed (which would effectively be immediately following login),
Hybris queries for this information and stores it in the session cache
for the user. If SOLR accepts the connection but never responds (or does
not respond within the overall page timeout), then the page rendering
times out and the user is left with a partially rendered page which is
unusable. "

We have 4 jobs which indexes the data(full/update), please find the
count of the data which gets indexed during each job :

1. Job A indexes : 4   products
2. Job B indexes : 12 products
3. Job C indexes : 120   products
4. Job D indexes : 90 products

I am sure SOLR can handle these products easily and we should not face
this issue. Hardware configuration is below

1. We are using 2 Cores of 16GB RAM
2. Disk space is 8 GB, and out of that 72% is already utilized.

Below is the configuration and various key values from hybris side.Any
help in this regard will be great :)


I think the Apache mailserver has stripped out your images.

From our experience, Hybris does all kinds of odd things with Solr and 
I wouldn't be surprised if it is sending a crazy query that is timing 
out. You should probably check the Solr logs to see what's being sent.


You should also of course ask Hybris support for help.

Charlie


Inline image 1

--
Regards
Gaurav Srivastava



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Urgent - Solr indexing is taking hours and dashboard page is not getting rendered at all :(

2017-03-09 Thread Gaurav Srivastava

Hi All,

I have a eCommerce site built on Hybris 6.2.0.4 which uses SOLR OOB
(vendor=hybris
version=6.2.0.2) as a search engine. I am facing below 2 problems :

1. Indexing is taking lot of time(4-5 hours) in last couple of weeks. (data
has increased though)
2. Our dashboard page is getting hunged please find the details below.

"The product category dropdown in the main site navigation is rendered
using a query to Solr. On the first request where the navigation is
displayed (which would effectively be immediately following login), Hybris
queries for this information and stores it in the session cache for the
user. If SOLR accepts the connection but never responds (or does not
respond within the overall page timeout), then the page rendering times out
and the user is left with a partially rendered page which is unusable. "

We have 4 jobs which indexes the data(full/update), please find the count
of the data which gets indexed during each job :

1. Job A indexes : 4   products
2. Job B indexes : 12 products
3. Job C indexes : 120   products
4. Job D indexes : 90 products

I am sure SOLR can handle these products easily and we should not face this
issue. Hardware configuration is below

1. We are using 2 Cores of 16GB RAM
2. Disk space is 8 GB, and out of that 72% is already utilized.

Below is the configuration and various key values from hybris side.Any help
in this regard will be great :)

[image: Inline image 1]

-- 
Regards
Gaurav Srivastava

Re: How to know if SOLR indexing is completed prorammatically

2016-09-30 Thread subinalex

Thanks a lot christian..
let me explore that..


:)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-know-if-SOLR-indexing-is-completed-prorammatically-tp4298799p4298807.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to know if SOLR indexing is completed prorammatically

2016-09-30 Thread Christian Ortner

Hi,

the admin console is backed by a JSON API. You can run the same requests it
uses programatically. Find them easily by checking your browser debug
tools' networking tab.

Regards,
Chris

On Fri, Sep 30, 2016 at 10:29 AM, subinalex  wrote:

> Hi Guys,
>
> We are running back to back solr indexing batch jobs.We need to ensure if
> the triggered batch indexing is completed before starting the next.
>
> I know we can check the status by viewing the 'Logging' and 'CoreAdmin'
> page
> of solr admin console.
>
> But,we need to find this out programmatically and based on this trigger the
> next solr indexing batch job.
>
>
> Please help with this.
>
>
> :)
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/How-to-know-if-SOLR-indexing-is-completed-
> prorammatically-tp4298799.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

How to know if SOLR indexing is completed prorammatically

2016-09-30 Thread subinalex

Hi Guys,

We are running back to back solr indexing batch jobs.We need to ensure if
the triggered batch indexing is completed before starting the next.

I know we can check the status by viewing the 'Logging' and 'CoreAdmin' page
of solr admin console.

But,we need to find this out programmatically and based on this trigger the
next solr indexing batch job.


Please help with this.


:)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-know-if-SOLR-indexing-is-completed-prorammatically-tp4298799.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr indexing sequentially or randomly?

2016-06-14 Thread Zheng Lin Edwin Yeo

Thank you.

On 14 June 2016 at 20:03, Mikhail Khludnev 
wrote:

> Sequentially.
>
> On Tue, Jun 14, 2016 at 12:32 PM, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> wrote:
>
> > Hi,
> >
> > i would like to find out, does Solr writes to the disk sequentially or
> > randomly during indexing?
> > I'm using Solr 6.0.1.
> >
> > Regards,
> > Edwin
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>

Re: Solr indexing sequentially or randomly?

2016-06-14 Thread Mikhail Khludnev

Sequentially.

On Tue, Jun 14, 2016 at 12:32 PM, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> i would like to find out, does Solr writes to the disk sequentially or
> randomly during indexing?
> I'm using Solr 6.0.1.
>
> Regards,
> Edwin
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

Solr indexing sequentially or randomly?

2016-06-14 Thread Zheng Lin Edwin Yeo

Hi,

i would like to find out, does Solr writes to the disk sequentially or
randomly during indexing?
I'm using Solr 6.0.1.

Regards,
Edwin

Re: solr Indexing PDF attachments not working. in ubuntu

2016-01-23 Thread Binoy Dalal

Do you see any exceptions in the solr log?

On Sat, 23 Jan 2016, 16:29 Moncif Aidi  wrote:

> HI,
>
> I have a problem with integrating solr in Ubuntu server.Before using solr
> on ubuntu server i tested it on my mac it was working perfectly. it indexed
> my PDF,Doc,Docx documents.so after installing solr on ubuntu server and
> using the same configuration files and librairies. i've found out that solr
> doesn't index PDf documents.But i can search over .Doc and .Docx documents.
> here some parts of my solrconfig.xml contents :
>
>  regex=".*\.jar" />
>regex="solr-cell-\d.*\.jar" />
>
>startup="lazy"
>   class="solr.extraction.ExtractingRequestHandler" >
> 
>   true
>   ignored_
>   _text_
> 
>   
>
>
> --
> M:+212 658541045
> Linkedin
> <
> https://www.linkedin.com/profile/view?id=131220035&trk=nav_responsive_tab_profile
> >
>
> <
> https://www.linkedin.com/profile/view?id=131220035&trk=nav_responsive_tab_profile
> >
> |  Facebook
>  |  *Skype :* moncif44
>
-- 
Regards,
Binoy Dalal

solr Indexing PDF attachments not working. in ubuntu

2016-01-23 Thread Moncif Aidi

HI,

I have a problem with integrating solr in Ubuntu server.Before using solr
on ubuntu server i tested it on my mac it was working perfectly. it indexed
my PDF,Doc,Docx documents.so after installing solr on ubuntu server and
using the same configuration files and librairies. i've found out that solr
doesn't index PDf documents.But i can search over .Doc and .Docx documents.
here some parts of my solrconfig.xml contents :


  



  true
  ignored_
  _text_

  


-- 
M:+212 658541045
Linkedin



|  Facebook
 |  *Skype :* moncif44

Re: Problem with Solr indexing "non-searchable" pdf files

2015-12-17 Thread Erick Erickson

Not sure how much help I can be, I have no clue what DSpace is
doing with Solr.

If you're willing to try to index straight to Solr, you can always use
SolrJ to parse the files, it's actually not very hard. Here's an example:
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

some database stuff is mixed in there, but that can be removed.

Otherwise, perhaps the DSpace folks have more guidance on
what/how they expect to do with PDFs.

Best,
Erick

On Thu, Dec 17, 2015 at 6:54 AM, RICARDO EITO BRUN  wrote:
> Hi,
> I am using SOLR as part of the dspace 5.4 SW application.
> I have a problem when running the dspace indexing command
> (index-discovery). Most of the files are not being added to the index, and
> an exception is raised.
>
> It seems that Solr does not process the PDF files that are result of
> scanning without OCR (non-searchable PDF files).
>
> Is there any way to tell Solr that the document metadata should be
> processed even if the PDF file itself cannot be indexed?
>
> Any suggestion on how to make the pdf files "searchable" using some kind of
> batch process/tool?
>
> Thanks in advance,
> Ricardo
>
> --
> RICARDO EITO BRUN
> Universidad Carlos III de Madrid

Problem with Solr indexing "non-searchable" pdf files

2015-12-17 Thread RICARDO EITO BRUN

Hi,
I am using SOLR as part of the dspace 5.4 SW application.
I have a problem when running the dspace indexing command
(index-discovery). Most of the files are not being added to the index, and
an exception is raised.

It seems that Solr does not process the PDF files that are result of
scanning without OCR (non-searchable PDF files).

Is there any way to tell Solr that the document metadata should be
processed even if the PDF file itself cannot be indexed?

Any suggestion on how to make the pdf files "searchable" using some kind of
batch process/tool?

Thanks in advance,
Ricardo

-- 
RICARDO EITO BRUN
Universidad Carlos III de Madrid

Re: solr indexing warning

2015-11-20 Thread Shawn Heisey

On 11/20/2015 12:33 AM, Midas A wrote:
> As we are this server as a master server  there are no queries running on
> it  . in that case should i remove these configuration from config file .

The following cache info says that there ARE queries being run on this
server:

> QueryResultCache:
> 
> lookups:3841
> hits:0
> hitratio:0.00
> inserts:4841
> evictions:3841
> size:1000
> warmupTime:213
> cumulative_lookups:58438
> cumulative_hits:153
> cumulative_hitratio:0.00
> cumulative_inserts:58285
> cumulative_evictions:57285

These queries might be related to indexing, and not actual user
searches.  On my indexes, I query for the existence of the documents I'm
about to delete, to make sure there's actually a need to run the delete.

This is the only cache that has a nonzero warmupTime, but it only took a
fifth of a second to warm 1000 queries, so this is not a problem.  It
has a very low hit ratio, so you could disable it and not really see a
performance difference.

Emir asked how you're doing your commits.  I'd like to know the same
thing, as well as how frequently you're doing them.

This is the best guide out there regarding commits:

http://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

One of the best pieces of advice on that page is this:

---
Don't listen to your product manager who says "we need no more than 1
second latency". Really.
---

Another piece of advice on that page is to set the hard commit
(autoCommit) interval to 15 seconds.  I personally think this is too
frequent, but many people are using that configuration and have reported
no problems with it.

Thanks,
Shawn

Hi,
Since this is master node, and not expected to have queries, you can
disable caches completely. However, from numbers cache autowarm is not
an issue here but probably frequency of commits and/or warmup queries.
How do you do commits? Since master-slave, I don't see reason to have
them too frequently. If you need NRT you should switch to SolrCloud. Do
you have warmup queries? You don't need them on master node.

Regards,
Emir

On 20.11.2015 08:33, Midas A wrote:

thanks Shawn,

As we are this server as a master server there are no queries running on
it . in that case should i remove these configuration from config file .

Total Docs: 40 0

Stats
#

Document cache :
lookups:823
hits:4
hitratio:0.00
inserts:820
evictions:0
size:820
warmupTime:0
cumulative_lookups:24474
cumulative_hits:1746
cumulative_hitratio:0.07
cumulative_inserts:22728
cumulative_evictions:13345

fieldcache:
stats:
entries_count:2
entry#0:'SegmentCoreReader(owner=_3bph(4.2.1):C3918553)'=>'_version_',long,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_LONG_PARSER=>org.apache.lucene.search.FieldCacheImpl$LongsFromArray#1919958905
entry#1:'SegmentCoreReader(owner=_3bph(4.2.1):C3918553)'=>'_version_',class
org.apache.lucene.search.FieldCacheImpl$DocsWithFieldCache,null=>org.apache.lucene.util.Bits$MatchAllBits#660036513
insanity_count:0

fieldValuecache:

lookups:0

hitratio:0.00

inserts:0

evictions:0

size:0

warmupTime:0

cumulative_lookups:0

cumulative_hits:0

cumulative_hitratio:0.00

cumulative_inserts:0

cumulative_evictions:0

filtercache:

lookups:0

hitratio:0.00

inserts:0

evictions:0

size:0

warmupTime:0

cumulative_lookups:0

cumulative_hits:0

cumulative_hitratio:0.00

cumulative_inserts:0

cumulative_evictions:0

QueryResultCache:

lookups:3841

hitratio:0.00

inserts:4841

evictions:3841

size:1000

warmupTime:213

cumulative_lookups:58438

cumulative_hits:153

cumulative_hitratio:0.00

cumulative_inserts:58285

cumulative_evictions:57285

Please suggest .

On Fri, Nov 20, 2015 at 12:15 PM, Shawn Heisey wrote:

On 11/19/2015 11:06 PM, Midas A wrote:

initialSize

="1000" autowarmCount="1000"/>

Your caches are quite large. More importantly, your autowarmCount is
very large. How many documents are in each of your cores? If you check
the Plugins/Stats area in the admin UI for your core(s), how many
entries are actually in each of those three caches? Also shown there is
the number of milliseconds that it took for each cache to warm.

The documentCache cannot be autowarmed, so that config is not doing
anything.

When a cache is autowarmed, what this does is look up the key for the
top N entries in the old cache, which contains the query used to
generate that cache entry, and executes each of those queries on the new
index to populate the new cache.

This means that up to 2000 queries are being executed every time you
commit and open a new searcher. The actual number may be less, if the
filterCache and queryResultCache are not actually reaching 1000 entries
each. Autowarming can take a significant amount of time when the
autowarmCount is high. It should be lowered.

Thanks,
Shawn

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Re: solr indexing warning

2015-11-19 Thread Midas A

thanks Shawn,

As we are this server as a master server  there are no queries running on
it  . in that case should i remove these configuration from config file .

Total Docs: 40 0

Stats
#

Document cache :
lookups:823
hits:4
hitratio:0.00
inserts:820
evictions:0
size:820
warmupTime:0
cumulative_lookups:24474
cumulative_hits:1746
cumulative_hitratio:0.07
cumulative_inserts:22728
cumulative_evictions:13345


fieldcache:
stats:
entries_count:2
entry#0:'SegmentCoreReader(owner=_3bph(4.2.1):C3918553)'=>'_version_',long,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_LONG_PARSER=>org.apache.lucene.search.FieldCacheImpl$LongsFromArray#1919958905
entry#1:'SegmentCoreReader(owner=_3bph(4.2.1):C3918553)'=>'_version_',class
org.apache.lucene.search.FieldCacheImpl$DocsWithFieldCache,null=>org.apache.lucene.util.Bits$MatchAllBits#660036513
insanity_count:0


fieldValuecache:

lookups:0

hits:0

hitratio:0.00

inserts:0

evictions:0

size:0

warmupTime:0

cumulative_lookups:0

cumulative_hits:0

cumulative_hitratio:0.00

cumulative_inserts:0

cumulative_evictions:0


filtercache:


lookups:0

hits:0

hitratio:0.00

inserts:0

evictions:0

size:0

warmupTime:0

cumulative_lookups:0

cumulative_hits:0

cumulative_hitratio:0.00

cumulative_inserts:0

cumulative_evictions:0


QueryResultCache:

lookups:3841

hits:0

hitratio:0.00

inserts:4841

evictions:3841

size:1000

warmupTime:213

cumulative_lookups:58438

cumulative_hits:153

cumulative_hitratio:0.00

cumulative_inserts:58285

cumulative_evictions:57285



Please suggest .



On Fri, Nov 20, 2015 at 12:15 PM, Shawn Heisey  wrote:

> On 11/19/2015 11:06 PM, Midas A wrote:
> >  > autowarmCount="1000"/>   > size="1000" initialSize="1000" autowarmCount="1000"/>   initialSize
> > ="1000" autowarmCount="1000"/>
>
> Your caches are quite large.  More importantly, your autowarmCount is
> very large.  How many documents are in each of your cores?  If you check
> the Plugins/Stats area in the admin UI for your core(s), how many
> entries are actually in each of those three caches?  Also shown there is
> the number of milliseconds that it took for each cache to warm.
>
> The documentCache cannot be autowarmed, so that config is not doing
> anything.
>
> When a cache is autowarmed, what this does is look up the key for the
> top N entries in the old cache, which contains the query used to
> generate that cache entry, and executes each of those queries on the new
> index to populate the new cache.
>
> This means that up to 2000 queries are being executed every time you
> commit and open a new searcher.  The actual number may be less, if the
> filterCache and queryResultCache are not actually reaching 1000 entries
> each.  Autowarming can take a significant amount of time when the
> autowarmCount is high.  It should be lowered.
>
> Thanks,
> Shawn
>
>

Re: solr indexing warning

2015-11-19 Thread Shawn Heisey

On 11/19/2015 11:06 PM, Midas A wrote:
>  autowarmCount="1000"/>   size="1000" initialSize="1000" autowarmCount="1000"/>   ="1000" autowarmCount="1000"/>

Your caches are quite large.  More importantly, your autowarmCount is
very large.  How many documents are in each of your cores?  If you check
the Plugins/Stats area in the admin UI for your core(s), how many
entries are actually in each of those three caches?  Also shown there is
the number of milliseconds that it took for each cache to warm.

The documentCache cannot be autowarmed, so that config is not doing
anything.

When a cache is autowarmed, what this does is look up the key for the
top N entries in the old cache, which contains the query used to
generate that cache entry, and executes each of those queries on the new
index to populate the new cache.

This means that up to 2000 queries are being executed every time you
commit and open a new searcher.  The actual number may be less, if the
filterCache and queryResultCache are not actually reaching 1000 entries
each.  Autowarming can take a significant amount of time when the
autowarmCount is high.  It should be lowered.

Thanks,
Shawn

Re: solr indexing warning

2015-11-19 Thread Midas A

Thanks Emir ,

So what we need to do to resolve this issue .





This is my solr configuration.  what changes should i do to avoid the
warning .

~abhishek

On Thu, Nov 19, 2015 at 6:37 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> This means that one searcher is still warming when other searcher created
> due to commit with openSearcher=true. This can be due to frequent commits
> of searcher warmup taking too long.
>
> Emir
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
> On 19.11.2015 12:16, Midas A wrote:
>
>> Getting following log on solr
>>
>>
>> PERFORMANCE WARNING: Overlapping onDeckSearchers=2`
>>
>>

Re: solr indexing warning

2015-11-19 Thread Emir Arnautovic

This means that one searcher is still warming when other searcher 
created due to commit with openSearcher=true. This can be due to 
frequent commits of searcher warmup taking too long.


Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 19.11.2015 12:16, Midas A wrote:

Getting following log on solr


PERFORMANCE WARNING: Overlapping onDeckSearchers=2`

solr indexing warning

2015-11-19 Thread Midas A

Getting following log on solr


PERFORMANCE WARNING: Overlapping onDeckSearchers=2`

Re: Problem with the Content Field during Solr Indexing

2015-11-02 Thread Susheel Kumar

Hi Shruti,

If you are looking to index images to make them searchable (Image Search)
then you will have to look at LIRE (Lucene Image Retrieval)
http://www.lire-project.net/  and can follow Lire Solr Plugin at this site
https://bitbucket.org/dermotte/liresolr.

Thanks,
Susheel

On Sat, Oct 31, 2015 at 9:46 PM, Zheng Lin Edwin Yeo 
wrote:

> Hi Shruti,
>
> From what I understand, the /update/extract handler is for indexing
> rich-text documents, and does not support ".png" files.
>
> It only supports the following files format: pdf, doc, docx, ppt, pptx,
> xls, xlsx, odt, odp, ods, ott, otp, ots, rtf, htm, html, txt, log
> If you use the default post.jar, I believe the other formats will get
> filtered out.
>
> When I tried to index ".png" file in my custom handler, it just index "
> " in the content.
>
> Regards,
> Edwin
>
>
>
> On 31 October 2015 at 09:35, Shruti Mundra  wrote:
>
> > Hi Edwin,
> >
> > The file extension of the image file is ".png" and we are following this
> > url for indexing:
> > "
> >
> >
> http://blog.thedigitalgroup.com/vijaym/wp-content/uploads/sites/11/2015/07/SolrImageExtract.png
> > "
> >
> > Thanks and Regards,
> > Shruti Mundra
> >
> > On Thu, Oct 29, 2015 at 8:33 PM, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> > >
> > wrote:
> >
> > > The "\n" actually means new line as decoded by Solr from the indexed
> > > document.
> > >
> > > What is your file extension of your image file, and which method are
> you
> > > using to do the indexing?
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 30 October 2015 at 04:38, Shruti Mundra  wrote:
> > >
> > > > Hi,
> > > >
> > > > When I'm trying index an image file directly to Solr, the attribute
> > > > content, consists of trails of "\n"s and not the data.
> > > > We are successful in getting the metadata for that image.
> > > >
> > > > Can anyone help us out on how we could get the content along with the
> > > > Metadata.
> > > >
> > > > Thanks!
> > > >
> > > > - Shruti Mundra
> > > >
> > >
> >
>

Re: Problem with the Content Field during Solr Indexing

2015-10-31 Thread Zheng Lin Edwin Yeo

Hi Shruti,

>From what I understand, the /update/extract handler is for indexing
rich-text documents, and does not support ".png" files.

It only supports the following files format: pdf, doc, docx, ppt, pptx,
xls, xlsx, odt, odp, ods, ott, otp, ots, rtf, htm, html, txt, log
If you use the default post.jar, I believe the other formats will get
filtered out.

When I tried to index ".png" file in my custom handler, it just index "
" in the content.

Regards,
Edwin



On 31 October 2015 at 09:35, Shruti Mundra  wrote:

> Hi Edwin,
>
> The file extension of the image file is ".png" and we are following this
> url for indexing:
> "
>
> http://blog.thedigitalgroup.com/vijaym/wp-content/uploads/sites/11/2015/07/SolrImageExtract.png
> "
>
> Thanks and Regards,
> Shruti Mundra
>
> On Thu, Oct 29, 2015 at 8:33 PM, Zheng Lin Edwin Yeo  >
> wrote:
>
> > The "\n" actually means new line as decoded by Solr from the indexed
> > document.
> >
> > What is your file extension of your image file, and which method are you
> > using to do the indexing?
> >
> > Regards,
> > Edwin
> >
> >
> > On 30 October 2015 at 04:38, Shruti Mundra  wrote:
> >
> > > Hi,
> > >
> > > When I'm trying index an image file directly to Solr, the attribute
> > > content, consists of trails of "\n"s and not the data.
> > > We are successful in getting the metadata for that image.
> > >
> > > Can anyone help us out on how we could get the content along with the
> > > Metadata.
> > >
> > > Thanks!
> > >
> > > - Shruti Mundra
> > >
> >
>

Re: Problem with the Content Field during Solr Indexing

2015-10-30 Thread Shruti Mundra

Hi Edwin,

The file extension of the image file is ".png" and we are following this
url for indexing:
"
http://blog.thedigitalgroup.com/vijaym/wp-content/uploads/sites/11/2015/07/SolrImageExtract.png
"

Thanks and Regards,
Shruti Mundra

On Thu, Oct 29, 2015 at 8:33 PM, Zheng Lin Edwin Yeo 
wrote:

> The "\n" actually means new line as decoded by Solr from the indexed
> document.
>
> What is your file extension of your image file, and which method are you
> using to do the indexing?
>
> Regards,
> Edwin
>
>
> On 30 October 2015 at 04:38, Shruti Mundra  wrote:
>
> > Hi,
> >
> > When I'm trying index an image file directly to Solr, the attribute
> > content, consists of trails of "\n"s and not the data.
> > We are successful in getting the metadata for that image.
> >
> > Can anyone help us out on how we could get the content along with the
> > Metadata.
> >
> > Thanks!
> >
> > - Shruti Mundra
> >
>

Re: Problem with the Content Field during Solr Indexing

2015-10-29 Thread Zheng Lin Edwin Yeo

The "\n" actually means new line as decoded by Solr from the indexed
document.

What is your file extension of your image file, and which method are you
using to do the indexing?

Regards,
Edwin

On 30 October 2015 at 04:38, Shruti Mundra  wrote:

> Hi,
>
> When I'm trying index an image file directly to Solr, the attribute
> content, consists of trails of "\n"s and not the data.
> We are successful in getting the metadata for that image.
>
> Can anyone help us out on how we could get the content along with the
> Metadata.
>
> Thanks!
>
> - Shruti Mundra
>

Problem with the Content Field during Solr Indexing

2015-10-29 Thread Shruti Mundra

Hi,

When I'm trying index an image file directly to Solr, the attribute
content, consists of trails of "\n"s and not the data.
We are successful in getting the metadata for that image.

Can anyone help us out on how we could get the content along with the
Metadata.

Thanks!

- Shruti Mundra

Re: Solr indexing based on last_modified

2015-08-17 Thread Erick Erickson

Well, you'll have to have some kind of timestamp that you can
reference and only re-send
files that have a newer timestamp. Or keep a DB around with file
path/last indexed timestamp
or

Best,
Erick

On Mon, Aug 17, 2015 at 12:36 PM, coolmals  wrote:
> I have a file system. I have a scheduler which will call solr in scheduled
> time interval. Any updates to the file system must be indexed by solr. Only
> changes must be re-indexed as file system is huge and cannot be re-indexed
> every time.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-indexing-based-on-last-modified-tp4223506p4223511.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr indexing based on last_modified

2015-08-17 Thread coolmals

I have a file system. I have a scheduler which will call solr in scheduled
time interval. Any updates to the file system must be indexed by solr. Only
changes must be re-indexed as file system is huge and cannot be re-indexed
every time.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-based-on-last-modified-tp4223506p4223511.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr indexing based on last_modified

2015-08-17 Thread Erick Erickson

There's no way that I know of with post.jar. Post.jar was never really intended
as a production tool, and sending all the files to Solr for parsing (pdf, word
and the like) is putting quite a load on the Solr server.

What is your use-case? You might consider a SolrJ program, it would be
simple enough to pass it a timestamp and only parse/send docs to solr
if the date was more recent. Here's an example (no timestamp
processing though).

https://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Mon, Aug 17, 2015 at 12:21 PM, coolmals  wrote:
> I want to update the index of a file only if last_modified has changed in the
> file. I am running post.jar with fileTypes="*", i would want to update the
> index of the files only if there is any change in them since the last update
> of index. Can you let me know how to achieve this?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-indexing-based-on-last-modified-tp4223506.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Solr indexing based on last_modified

2015-08-17 Thread coolmals

I want to update the index of a file only if last_modified has changed in the
file. I am running post.jar with fileTypes="*", i would want to update the
index of the files only if there is any change in them since the last update
of index. Can you let me know how to achieve this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-based-on-last-modified-tp4223506.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Optimizing Solr indexing over WAN

2015-07-22 Thread Markus Jelsma

Hello - Depening on size differences between source data and indexed data, you 
can gzip/bzip2 your source json/xml, then transfer it over WAN, and index it 
locally. This is the fastest method in every case we encountered.
 
-Original message-
> From:Reitzel, Charles 
> Sent: Wednesday 22nd July 2015 17:43
> To: solr-user@lucene.apache.org
> Subject: RE: Optimizing Solr indexing over WAN
> 
> Indexing over a WAN will be slow, limited by the bandwidth of the pipe.
> 
> I think you will be better served to move the data in bulk to the same LAN as 
> your target solr instances.    I would suggest ZIP+scp ... or your favorite 
> file system replication/synchronization tool.
> 
> It's true, if you are using blocking I/O over a high latency LAN, then a few 
> threads will let you make use of all the available bandwidth.  But, 
> typically, it takes very few threads to keep the pipe full.   But, after that 
> point, more threads do no good.   But this is a general sort of thing that 
> scp (or your favorite tool) will handle for you.   No need to roll your own.
> 
> Further, I don't think threading in the client buys you all that much 
> compared to bulk updates.  If you load 1000 documents at a time using SolrJ, 
> it will do a good job of spreading out the load over the shards.   
> 
> If you find it takes a bit of time to build each update request document 
> (with no indexing happening meanwhile), then you might prepare these in a 
> background thread and place into a request queue.  Thus, the foreground 
> thread is always fetching the next request, sending it or waiting for a 
> response.   The synchronization cost on a request queue will be negligible.   
>  If you find the foreground thread is waiting too much, make the batch size 
> bigger.   If you find the queue length growing too large, put the background 
> thread to sleep until the queue length drops down to a reasonable length.   
> All of this complexity may buy you a few % improvement in indexing speed.  
> Probably not worth the development cost ...
> 
> -Original Message-
> From: Ali Nazemian [mailto:alinazem...@gmail.com] 
> Sent: Wednesday, July 22, 2015 2:21 AM
> To: solr-user@lucene.apache.org
> Subject: Optimizing Solr indexing over WAN
> 
> Dears,
> Hi,
> I know that there are lots of tips about how to make the Solr indexing 
> faster. Probably some of the most important ones which are considered in 
> client side are choosing batch indexing and multi-thread indexing. There are 
> other important factors that are server side which I dont want to mentioned 
> here. Anyway my question would be is there any best practice for number of 
> client threads and the size of batch available over WAN network?
> Since the client and servers are connected over WAN network probably some of 
> the performance conditions such as network latency, bandwidth and etc.
> are different from LAN network. Another think that is matter for me is the 
> fact that document sizes are might be different in diverse scenarios. For 
> example when you want to index web-pages the size of document might be from 
> 1KB to 200KB. In such case choosing batch size according to the number of 
> documents is probably not the best way of optimizing index performance.
> Probably choosing based on the size of batch size in KB/MB would be better 
> from the network point of view. However, from the Solr side document numbers 
> matter.
> So if I want to summarize my questions here what am I looking for:
> 1- Is there any best practice available for Solr client side performance 
> tuning over WAN network for the purpose of indexing/reindexing/updating?
> Does it different from LAN network?
> 2- Which one is matter: number of documents or the total size of documents in 
> batch?
> 
> Best regards.
> 
> --
> A.Nazemian
> 
> *
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
> 
> TIAA-CREF
> *

RE: Optimizing Solr indexing over WAN

2015-07-22 Thread Reitzel, Charles

Indexing over a WAN will be slow, limited by the bandwidth of the pipe.

I think you will be better served to move the data in bulk to the same LAN as 
your target solr instances.I would suggest ZIP+scp ... or your favorite 
file system replication/synchronization tool.

It's true, if you are using blocking I/O over a high latency LAN, then a few 
threads will let you make use of all the available bandwidth.  But, typically, 
it takes very few threads to keep the pipe full.   But, after that point, more 
threads do no good.   But this is a general sort of thing that scp (or your 
favorite tool) will handle for you.   No need to roll your own.

Further, I don't think threading in the client buys you all that much compared 
to bulk updates.  If you load 1000 documents at a time using SolrJ, it will do 
a good job of spreading out the load over the shards.   

If you find it takes a bit of time to build each update request document (with 
no indexing happening meanwhile), then you might prepare these in a background 
thread and place into a request queue.  Thus, the foreground thread is always 
fetching the next request, sending it or waiting for a response.   The 
synchronization cost on a request queue will be negligible.If you find the 
foreground thread is waiting too much, make the batch size bigger.   If you 
find the queue length growing too large, put the background thread to sleep 
until the queue length drops down to a reasonable length.   All of this 
complexity may buy you a few % improvement in indexing speed.  Probably not 
worth the development cost ...

-Original Message-
From: Ali Nazemian [mailto:alinazem...@gmail.com] 
Sent: Wednesday, July 22, 2015 2:21 AM
To: solr-user@lucene.apache.org
Subject: Optimizing Solr indexing over WAN

Dears,
Hi,
I know that there are lots of tips about how to make the Solr indexing faster. 
Probably some of the most important ones which are considered in client side 
are choosing batch indexing and multi-thread indexing. There are other 
important factors that are server side which I dont want to mentioned here. 
Anyway my question would be is there any best practice for number of client 
threads and the size of batch available over WAN network?
Since the client and servers are connected over WAN network probably some of 
the performance conditions such as network latency, bandwidth and etc.
are different from LAN network. Another think that is matter for me is the fact 
that document sizes are might be different in diverse scenarios. For example 
when you want to index web-pages the size of document might be from 1KB to 
200KB. In such case choosing batch size according to the number of documents is 
probably not the best way of optimizing index performance.
Probably choosing based on the size of batch size in KB/MB would be better from 
the network point of view. However, from the Solr side document numbers matter.
So if I want to summarize my questions here what am I looking for:
1- Is there any best practice available for Solr client side performance tuning 
over WAN network for the purpose of indexing/reindexing/updating?
Does it different from LAN network?
2- Which one is matter: number of documents or the total size of documents in 
batch?

Best regards.

--
A.Nazemian

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*

Optimizing Solr indexing over WAN

2015-07-21 Thread Ali Nazemian

Dears,
Hi,
I know that there are lots of tips about how to make the Solr indexing
faster. Probably some of the most important ones which are considered in
client side are choosing batch indexing and multi-thread indexing. There
are other important factors that are server side which I dont want to
mentioned here. Anyway my question would be is there any best practice for
number of client threads and the size of batch available over WAN network?
Since the client and servers are connected over WAN network probably some
of the performance conditions such as network latency, bandwidth and etc.
are different from LAN network. Another think that is matter for me is the
fact that document sizes are might be different in diverse scenarios. For
example when you want to index web-pages the size of document might be from
1KB to 200KB. In such case choosing batch size according to the number of
documents is probably not the best way of optimizing index performance.
Probably choosing based on the size of batch size in KB/MB would be better
from the network point of view. However, from the Solr side document
numbers matter.
So if I want to summarize my questions here what am I looking for:
1- Is there any best practice available for Solr client side performance
tuning over WAN network for the purpose of indexing/reindexing/updating?
Does it different from LAN network?
2- Which one is matter: number of documents or the total size of documents
in batch?

Best regards.

-- 
A.Nazemian

Re: lucene vs Solr Indexing on Sample data

2015-06-15 Thread Erick Erickson

Basically I expect you're falling afoul of a very common misunderstanding;
It's not that Solr is slower, it's that the client isn't feeding Solr
as fast as it
should.

If you profile your Solr server, my suspicion is that you're not
driving it very hard.
You'll probably see 4 spikes in CPU activity, followed by it doing
nothing at all. The
spikes are when you actually send the doclist to Solr.

Your client is creating a 250K document packet, _then_ transmitting it to Solr,
waiting for the response, then creating another packet. While creating a
packet, Solr is doing nothing at all, just waiting.

You'll get better performance by using ConcurrentUpdateSolrClient and
much smaller packets (say 1,000). Give it, say, 10 threads and a queue length
of 10 or so. You'll have to experiment for sure.

Now, all that said since Solr is wrapping Lucene, since there's some additional
overhead because Solr has to parse out the doc and pass it on to Lucene etc,
you'll inevitably see some degradation. It shouldn't be as extreme as you're
seeing though so I'm pretty sure you'll find your client isn't written
to get the
best performance out of Solr.

In future, please don't link questions to another forum. It makes it
less likely that
people will actually respond.

Best,
Erick

On Mon, Jun 15, 2015 at 6:52 AM, Alessandro Benedetti
 wrote:
> Actually I can see a problem in your question…
> Lucene and Solr are not competitor technologies.
> Solr is a Search Server that internally uses the Lucene library and offers
> easy to use configuration and REST API.
> Lucene is a library that implements tons of search algorithms and features.
> You can see Solr as "best practice for Lucene" implemented server.
> It offers out of the box a usable search server with tons of features easy
> to use( take a look to the official site to have an idea) .
>
> On the other hand Lucene is a library, so you can develop with it your
> personal Search Server or Search application.
> More than performance you should really understand if you want to rewrite a
> lot of already implemented search features, or maybe re-use the ones
> developer by Lucene gurus.
>
> Furthermore of course, it depends of the feature you really need for your
> application.
>
> Cheers
>
> 2015-06-15 13:16 GMT+01:00 Argho Chatterjee <
> joy.chatterjee.crazyc...@gmail.com>:
>
>> Hello Everyone,
>>
>> I had posted a question on stackoverflow.com after performing a few POCs
>>
>> My hadrware consist of a single i-3 intel processor (4 CPU as per "dxdiag"
>> on run ), 8GB Ram, Laptop machine.
>>
>> My Question Link :
>>
>> http://stackoverflow.com/questions/30823314/lucene-vs-solr-indexning-speed-for-sampe-data
>>
>> but no one could solve it as of now..
>> I hope the question I posted is undertandable.
>>
>> Please if anyone could help me out with the indexing speed of Solr (way
>> slower) vs Lucene (way faster)..
>>
>> I am trying to build a module for real time indexing and querying, and the
>> traffic is high, POC pass with Lucene for handling High Traffic for
>> Indexing, for Solr It is not able to do so..
>>
>> Again My Machine Spec :
>> HP, intel core i3, 8GB ram, TB HDD.
>>
>> Please let me know if there is a problem with Solr or am I doing anything
>> wrong.
>>
>> Thanks
>> Argho
>>
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England

Re: lucene vs Solr Indexing on Sample data

2015-06-15 Thread Alessandro Benedetti

Actually I can see a problem in your question…
Lucene and Solr are not competitor technologies.
Solr is a Search Server that internally uses the Lucene library and offers
easy to use configuration and REST API.
Lucene is a library that implements tons of search algorithms and features.
You can see Solr as "best practice for Lucene" implemented server.
It offers out of the box a usable search server with tons of features easy
to use( take a look to the official site to have an idea) .

On the other hand Lucene is a library, so you can develop with it your
personal Search Server or Search application.
More than performance you should really understand if you want to rewrite a
lot of already implemented search features, or maybe re-use the ones
developer by Lucene gurus.

Furthermore of course, it depends of the feature you really need for your
application.

Cheers

2015-06-15 13:16 GMT+01:00 Argho Chatterjee <
joy.chatterjee.crazyc...@gmail.com>:

> Hello Everyone,
>
> I had posted a question on stackoverflow.com after performing a few POCs
>
> My hadrware consist of a single i-3 intel processor (4 CPU as per "dxdiag"
> on run ), 8GB Ram, Laptop machine.
>
> My Question Link :
>
> http://stackoverflow.com/questions/30823314/lucene-vs-solr-indexning-speed-for-sampe-data
>
> but no one could solve it as of now..
> I hope the question I posted is undertandable.
>
> Please if anyone could help me out with the indexing speed of Solr (way
> slower) vs Lucene (way faster)..
>
> I am trying to build a module for real time indexing and querying, and the
> traffic is high, POC pass with Lucene for handling High Traffic for
> Indexing, for Solr It is not able to do so..
>
> Again My Machine Spec :
> HP, intel core i3, 8GB ram, TB HDD.
>
> Please let me know if there is a problem with Solr or am I doing anything
> wrong.
>
> Thanks
> Argho
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

lucene vs Solr Indexing on Sample data

2015-06-15 Thread Argho Chatterjee

Hello Everyone,

I had posted a question on stackoverflow.com after performing a few POCs

My hadrware consist of a single i-3 intel processor (4 CPU as per "dxdiag"
on run ), 8GB Ram, Laptop machine.

My Question Link :
http://stackoverflow.com/questions/30823314/lucene-vs-solr-indexning-speed-for-sampe-data

but no one could solve it as of now..
I hope the question I posted is undertandable.

Please if anyone could help me out with the indexing speed of Solr (way
slower) vs Lucene (way faster)..

I am trying to build a module for real time indexing and querying, and the
traffic is high, POC pass with Lucene for handling High Traffic for
Indexing, for Solr It is not able to do so..

Again My Machine Spec :
HP, intel core i3, 8GB ram, TB HDD.

Please let me know if there is a problem with Solr or am I doing anything
wrong.

Thanks
Argho

Re: Solr -indexing from csv file having 28 cols taking lot of time ..plz help i m new to solr

2015-04-04 Thread Swaraj Kumar

I am not sure but the following regex have worked for me in JAVA. Kindly
check with your's one.

([^\x01])\x01([^\x01])\x01..([^\x01])$

Thanks,
Swaraj

Re: Solr -indexing from csv file having 28 cols taking lot of time ..plz help i m new to solr

2015-04-04 Thread Toke Eskildsen

avinash09  wrote:
> Thanks Toke , nice explanation , i have one more concern instead of comma
> separated my columns are ^A separated how to deal ^A ??

I am really not proficient with control characters and regexp. If ^A is Start 
Of Heading, which has ASCII & unicode character 1, my guess is that \u0001 
matches it.

So something like
regexp="^([^\u0001]*)\u0001([^\u0001]*)\u0001([^\u0001]*)\u0001...$"?
Untested and all.

But why not use the CSV import handler? That seems like the best fit.

- Toke Eskildsen

Re: Solr -indexing from csv file having 28 cols taking lot of time ..plz help i m new to solr

2015-04-04 Thread avinash09

Thanks Toke , nice explanation , i have one more concern instead of comma
separated my columns are ^A separated how to deal ^A ??



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-from-csv-file-having-28-cols-taking-lot-of-time-plz-help-i-m-new-to-solr-tp4196904p4197607.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr -indexing from csv file having 28 cols taking lot of time ..plz help i m new to solr

2015-04-04 Thread Swaraj Kumar

I have used the following and it works very fast in DIH solr-5.0








You can try this for getting groupNames from regex.


Regards,


Swaraj Kumar
Senior Software Engineer I
MakeMyTrip.com
+91-9811774497

Re: Solr -indexing from csv file having 28 cols taking lot of time ..plz help i m new to solr

2015-04-03 Thread Toke Eskildsen

avinash09  wrote:
> regex="^(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),
> (.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)$"

A better solution seems to have been presented, but for the record I would like 
to note that the regexp above is quite an effective performance bomb: For each 
group, the evaluation time roughly doubles. Not a problem for 10 groups, but 
you have 28.

I made a little test and matching a single sample line with 20 groups took 120 
ms/match, 24 groups took 2 seconds and 28 groups took 30 seconds on my machine. 
If you had 50 groups, a single match would take 4 years.

The explanation is that Java regexps are greedy: Every one of your groups 
starts by matching to the end of the line, then a comma is reached in the 
regexp and it backtracks. The solution is fortunately both simple and 
applicable to many other regexps: Make your matches terminate as soon as 
possible.

In this case, instead of having groups with (.*), use ([^,]*) instead, which 
means that each group matches everything, except commas. The combined regexp 
then looks like this:
regex="^([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),...([^,]*)$"

The match speed for 28 groups with that regexp was about 0.002ms (average over 
1000 matches).

- Toke Eskildsen

Re: Solr -indexing from csv file having 28 cols taking lot of time ..plz help i m new to solr

2015-04-01 Thread avinash09

Alex,
finally it worked for me found ctrl A separator ==( separator=%01&escape=\)

Thanks for your help



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-from-csv-file-having-28-cols-taking-lot-of-time-plz-help-i-m-new-to-solr-tp4196904p4197143.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr -indexing from csv file having 28 cols taking lot of time ..plz help i m new to solr

2015-04-01 Thread Alexandre Rafalovitch

That's an interesting question. The reference shows you how to set a
separator, but ^A is a special case. You may need to pass it in as a
URL escape character or similar.

But I would first get a sample working with more conventional
separator and then worry about ^A. Just so you are not confusing
several problems.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/

On 2 April 2015 at 05:05, avinash09  wrote:
> thanks Erick and Alexandre Rafalovitch R
>
> one more doubt how to pass ctrl A(^A) seprator while csv upload
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-indexing-from-csv-file-having-28-cols-taking-lot-of-time-plz-help-i-m-new-to-solr-tp4196904p4196998.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr -indexing from csv file having 28 cols taking lot of time ..plz help i m new to solr

2015-04-01 Thread avinash09

thanks Erick and Alexandre Rafalovitch R

one more doubt how to pass ctrl A(^A) seprator while csv upload  




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-from-csv-file-having-28-cols-taking-lot-of-time-plz-help-i-m-new-to-solr-tp4196904p4196998.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr -indexing from csv file having 28 cols taking lot of time ..plz help i m new to solr

2015-04-01 Thread Erick Erickson

Data Import Handler is a process in Solr that reaches out, grabs
"something external" and indexes it. "Something external" can be a
database, files on the server etc. Along the way, you can do many
transformations of the data. The point is that the source can be
anything.

The update handler is an end-point in Solr that expects certain
specific formats and puts them in the index. For instance, if you
index XML, it _must_ be in a very specific form to throw at the update
handler, something like

The csv update handler is just an update handler that expects CSV
files. The headers are usually the field names although you can map
them from the column header in your csv file to your Solr schema.

In importing csv files should be very fast. I suspect your regex is costly.

As Alexandre says, though, it would be a good idea to go through the
CSV import tutorial. The Solr reference guide has the details:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-CSVFormattedIndexUpdates

Best,
Erick

On Wed, Apr 1, 2015 at 8:04 AM, avinash09  wrote:
> sir , a silly  question m confuse here what is difference between data import
> handler and update csv
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-indexing-from-csv-file-having-28-cols-taking-lot-of-time-plz-help-i-m-new-to-solr-tp4196904p4196940.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr -indexing from csv file having 28 cols taking lot of time ..plz help i m new to solr

2015-04-01 Thread avinash09

sir , a silly  question m confuse here what is difference between data import
handler and update csv



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-from-csv-file-having-28-cols-taking-lot-of-time-plz-help-i-m-new-to-solr-tp4196904p4196940.html
Sent from the Solr - User mailing list archive at Nabble.com.

1 2 3 4 5 >

1 - 100 of 418 matches

Mail list logo