subject:"RE\: Data Import"

Re: Data Import Handler (DIH) - Installing and running

2020-12-23 Thread Erick Erickson

Have you done what the message says and looked at your Solr log? If so,
what information is there?

> On Dec 23, 2020, at 5:13 AM, DINSD | SPAutores 
>  wrote:
> 
> Hi,
> 
> I'm trying to install the package "data-import-handler", since it was 
> discontinued from core SolR distro.
> 
> https://github.com/rohitbemax/dataimporthandler
> 
> However, as soon as the first command is carried out
> 
> solr -c -Denable.packages=true
> 
> I get this screen in web interface
> 
> 
> 
> Has anyone been through this, or have any idea why it's happening ?
> 
> Thanks for any help
> Rui Pimentel
> 
> 
> 
> DINSD - Departamento de Informática / SPA Digital
> Av. Duque de Loulé, 31 - 1069-153 Lisboa  PORTUGAL
> T (+ 351) 21 359 44 36 / (+ 351) 21 359 44 00  F (+ 351) 21 353 02 57
>  informat...@spautores.pt
>  www.SPAutores.pt
> 
> Please consider the environment before printing this email 
> 
> Esta mensagem electrónica, incluindo qualquer dos seus anexos, contém 
> informação PRIVADA, CONFIDENCIAL e de DIVULGAÇÃO PROIBIDA,e destina-se 
> unicamente à pessoa e endereço electrónico acima indicados. Se não for o 
> destinatário desta mensagem, agradecemos que a elimine e nos comunique de 
> imediato através do telefone  +351 21 359 44 00 ou por email para: 
> ge...@spautores.pt 
> 
> This electronic mail transmission including any attachment hereof, contains 
> information that is PRIVATE, CONFIDENTIAL and PROTECTED FROM DISCLOSURE, and 
> it is only for the use of the person and the e-mail address above indicated. 
> If you have received this electronic mail transmission in error, please 
> destroy it and notify us immediately through the telephone number  +351 21 
> 359 44 00 or at the e-mail address:  ge...@spautores.pt
>

Re: Data Import Blocker - Solr

2020-12-19 Thread Shawn Heisey


On 12/18/2020 12:03 AM, basel altameme wrote:

While trying to Import & Index data from MySQL DB custom view i am facing the 
error below:
Data Config problem: The value of attribute "query" associated with an element type 
"entity" must not contain the '<' character.
Please note that in my SQL statements i am using '<>' as an operator for 
comparing only.
sample line:
         when (`v`.`live_type_id` <> 1) then 100


These configurations are written in XML.  So you must encode the 
character using XML-friendly notation.


Instead of <> it should say  to be correct.  Or you could use != 
which is also correct SQL notation for "not equal to".


Thanks,
Shawn

Re: Data Import Blocker - Solr

2020-12-18 Thread Erick Erickson

Have you tried escaping that character?

> On Dec 18, 2020, at 2:03 AM, basel altameme  
> wrote:
> 
> Dear,
> While trying to Import & Index data from MySQL DB custom view i am facing the 
> error below:
> Data Config problem: The value of attribute "query" associated with an 
> element type "entity" must not contain the '<' character.
> Please note that in my SQL statements i am using '<>' as an operator for 
> comparing only.
> sample line:
> when (`v`.`live_type_id` <> 1) then 100
> 
> Kindly advice.
> Regards,Basel
>

Re: data import handler deprecated?

2020-11-30 Thread Dmitri Maziuk


On 11/30/2020 7:50 AM, David Smiley wrote:

Yes, absolutely to what Eric said.  We goofed on news / release highlights
on how to communicate what's happening in Solr.  From a Solr insider point
of view, we are "deprecating" because strictly speaking, the code isn't in
our codebase any longer.  From a user point of view (the audience of news /
release notes), the functionality has *moved*.


Just FYI, there is the dih 8.7.0 jar in 
repo1.maven.org/maven2/org/apache/solr -- whereas the github build is on 
8.6.0.


Dima

Re: data import handler deprecated?

2020-11-30 Thread David Smiley

Yes, absolutely to what Eric said.  We goofed on news / release highlights
on how to communicate what's happening in Solr.  From a Solr insider point
of view, we are "deprecating" because strictly speaking, the code isn't in
our codebase any longer.  From a user point of view (the audience of news /
release notes), the functionality has *moved*.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Nov 30, 2020 at 8:04 AM Eric Pugh 
wrote:

> You don’t need to abandon DIH right now….   You can just use the Github
> hosted version….   The more people who use it, the better a community it
> will form around it!It’s a bit chicken and egg, since no one is
> actively discussing it, submitting PR’s etc, it may languish.   If you use
> it, and test it, and support other community folks using it, then it will
> continue on!
>
>
>
> > On Nov 29, 2020, at 12:12 PM, Dmitri Maziuk 
> wrote:
> >
> > On 11/29/2020 10:32 AM, Erick Erickson wrote:
> >
> >> And I absolutely agree with Walter that the DB is often where
> >> the bottleneck lies. You might be able to
> >> use multiple threads and/or processes to query the
> >> DB if that’s the case and you can find some kind of partition
> >> key.
> >
> > IME the difficult part has always been dealing with incremental updates,
> if we were to roll our own, my vote would be for a database trigger that
> does a POST in whichever language the DBMS likes.
> >
> > But this has not been a part of our "solr 6.5 update" project until now.
> >
> > Thanks everyone,
> > Dima
>
> ___
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>

Re: data import handler deprecated?

2020-11-30 Thread Eric Pugh

You don’t need to abandon DIH right now….   You can just use the Github hosted 
version….   The more people who use it, the better a community it will form 
around it!It’s a bit chicken and egg, since no one is actively discussing 
it, submitting PR’s etc, it may languish.   If you use it, and test it, and 
support other community folks using it, then it will continue on!

> On Nov 29, 2020, at 12:12 PM, Dmitri Maziuk  wrote:
> 
> On 11/29/2020 10:32 AM, Erick Erickson wrote:
> 
>> And I absolutely agree with Walter that the DB is often where
>> the bottleneck lies. You might be able to
>> use multiple threads and/or processes to query the
>> DB if that’s the case and you can find some kind of partition
>> key.
> 
> IME the difficult part has always been dealing with incremental updates, if 
> we were to roll our own, my vote would be for a database trigger that does a 
> POST in whichever language the DBMS likes.
> 
> But this has not been a part of our "solr 6.5 update" project until now.
> 
> Thanks everyone,
> Dima

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com  | 
My Free/Busy   
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Re: data import handler deprecated?

2020-11-29 Thread Dmitri Maziuk


On 11/29/2020 10:32 AM, Erick Erickson wrote:


And I absolutely agree with Walter that the DB is often where
the bottleneck lies. You might be able to
use multiple threads and/or processes to query the
DB if that’s the case and you can find some kind of partition
key.


IME the difficult part has always been dealing with incremental updates, 
if we were to roll our own, my vote would be for a database trigger that 
does a POST in whichever language the DBMS likes.


But this has not been a part of our "solr 6.5 update" project until now.

Thanks everyone,
Dima

Re: data import handler deprecated?

2020-11-29 Thread Erick Erickson

If you like Java instead of Python, here’s a skeletal program:

https://lucidworks.com/post/indexing-with-solrj/

It’s simple and single-threaded, but could serve as a basis for
something along the lines that Walter suggests.

And I absolutely agree with Walter that the DB is often where
the bottleneck lies. You might be able to
use multiple threads and/or processes to query the
DB if that’s the case and you can find some kind of partition
key.

You also might (and it depends on the Solr version) be able,
to wrap a jdbc stream in an update decorator.

https://lucene.apache.org/solr/guide/8_0/stream-source-reference.html

https://lucene.apache.org/solr/guide/8_0/stream-decorator-reference.html

Best,
Erick

> On Nov 29, 2020, at 3:04 AM, Walter Underwood  wrote:
> 
> I recommend building an outboard loader, like I did a dozen years ago for
> Solr 1.3 (before DIH) and did again recently. I’m glad to send you my Python
> program, though it reads from a JSONL file, not a database.
> 
> Run a loop fetching records from a database. Put each record into a 
> synchronized
> (thread-safe) queue. Run multiple worker threads, each pulling records from 
> the
> queue, batching them up, and sending them to Solr. For maximum indexing speed
> (at the expense of query performance), count the number of CPUs per shard 
> leader
> and run two worker threads per CPU.
> 
> Adjust the batch size to be maybe 10k to 50k bytes. That might be 20 to 1000 
> documents, depending on the content.
> 
> With this setup, your database will probably be your bottleneck. I’ve had this
> index a million (small) documents per minute to a multi-shard cluster, from a 
> JSONL
> file on local disk.
> 
> Also, don’t worry about finding the leaders and sending the right document to
> the right shard. I just throw the batches at the load balancer and let Solr 
> figure
> it out. That is super simple and amazingly fast.
> 
> If you are doing big batches, building a dumb ETL system with JSONL files in 
> Amazon S3 has some real advantages. It allows loading prod data into a test
> cluster for load benchmarks, for example. Also good for disaster recovery, 
> just
> load the recent batches from S3. Want to know exactly which documents were
> in the index in October? Look at the batches in S3.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Nov 28, 2020, at 6:23 PM, matthew sporleder  wrote:
>> 
>> I went through the same stages of grief that you are about to start
>> but (luckily?) my core dataset grew some weird cousins and we ended up
>> writing our own indexer to join them all together/do partial
>> updates/other stuff beyond DIH.  It's not difficult to upload docs but
>> is definitely slower so far.  I think there is a bit of a 'clean core'
>> focus going on in solr-land right now and DIH is easy(!) but it's also
>> easy to hit its limits (atomic/partial updates?  wtf is an "entity?"
>> etc) so anyway try to be happy that you are aware of it now.
>> 
>> On Sat, Nov 28, 2020 at 7:41 PM Dmitri Maziuk  
>> wrote:
>>> 
>>> On 11/28/2020 5:48 PM, matthew sporleder wrote:
>>> 
 ...  The bottom of
 that github page isn't hopeful however :)
>>> 
>>> Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC
>>> JAR" :)
>>> 
>>> It's a more general queston though, what is the path forward for users
>>> who with data in two places? Hope that a community-maintained plugin
>>> will still be there tomorrow? Dump our tables to CSV (and POST them) and
>>> roll our own delta-updates logic? Or are we to choose one datastore and
>>> drop the other?
>>> 
>>> Dima
>

Re: data import handler deprecated?

2020-11-29 Thread Walter Underwood

I recommend building an outboard loader, like I did a dozen years ago for
Solr 1.3 (before DIH) and did again recently. I’m glad to send you my Python
program, though it reads from a JSONL file, not a database.

Run a loop fetching records from a database. Put each record into a synchronized
(thread-safe) queue. Run multiple worker threads, each pulling records from the
queue, batching them up, and sending them to Solr. For maximum indexing speed
(at the expense of query performance), count the number of CPUs per shard leader
and run two worker threads per CPU.

Adjust the batch size to be maybe 10k to 50k bytes. That might be 20 to 1000 
documents, depending on the content.

With this setup, your database will probably be your bottleneck. I’ve had this
index a million (small) documents per minute to a multi-shard cluster, from a 
JSONL
file on local disk.

Also, don’t worry about finding the leaders and sending the right document to
the right shard. I just throw the batches at the load balancer and let Solr 
figure
it out. That is super simple and amazingly fast.

If you are doing big batches, building a dumb ETL system with JSONL files in 
Amazon S3 has some real advantages. It allows loading prod data into a test
cluster for load benchmarks, for example. Also good for disaster recovery, just
load the recent batches from S3. Want to know exactly which documents were
in the index in October? Look at the batches in S3.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 28, 2020, at 6:23 PM, matthew sporleder  wrote:
> 
> I went through the same stages of grief that you are about to start
> but (luckily?) my core dataset grew some weird cousins and we ended up
> writing our own indexer to join them all together/do partial
> updates/other stuff beyond DIH.  It's not difficult to upload docs but
> is definitely slower so far.  I think there is a bit of a 'clean core'
> focus going on in solr-land right now and DIH is easy(!) but it's also
> easy to hit its limits (atomic/partial updates?  wtf is an "entity?"
> etc) so anyway try to be happy that you are aware of it now.
> 
> On Sat, Nov 28, 2020 at 7:41 PM Dmitri Maziuk  wrote:
>> 
>> On 11/28/2020 5:48 PM, matthew sporleder wrote:
>> 
>>> ...  The bottom of
>>> that github page isn't hopeful however :)
>> 
>> Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC
>> JAR" :)
>> 
>> It's a more general queston though, what is the path forward for users
>> who with data in two places? Hope that a community-maintained plugin
>> will still be there tomorrow? Dump our tables to CSV (and POST them) and
>> roll our own delta-updates logic? Or are we to choose one datastore and
>> drop the other?
>> 
>> Dima

Re: data import handler deprecated?

2020-11-28 Thread matthew sporleder

I went through the same stages of grief that you are about to start
but (luckily?) my core dataset grew some weird cousins and we ended up
writing our own indexer to join them all together/do partial
updates/other stuff beyond DIH.  It's not difficult to upload docs but
is definitely slower so far.  I think there is a bit of a 'clean core'
focus going on in solr-land right now and DIH is easy(!) but it's also
easy to hit its limits (atomic/partial updates?  wtf is an "entity?"
etc) so anyway try to be happy that you are aware of it now.

On Sat, Nov 28, 2020 at 7:41 PM Dmitri Maziuk  wrote:
>
> On 11/28/2020 5:48 PM, matthew sporleder wrote:
>
> > ...  The bottom of
> > that github page isn't hopeful however :)
>
> Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC
> JAR" :)
>
> It's a more general queston though, what is the path forward for users
> who with data in two places? Hope that a community-maintained plugin
> will still be there tomorrow? Dump our tables to CSV (and POST them) and
> roll our own delta-updates logic? Or are we to choose one datastore and
> drop the other?
>
> Dima

Re: data import handler deprecated?

2020-11-28 Thread Dmitri Maziuk


On 11/28/2020 5:48 PM, matthew sporleder wrote:


...  The bottom of
that github page isn't hopeful however :)


Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC 
JAR" :)


It's a more general queston though, what is the path forward for users 
who with data in two places? Hope that a community-maintained plugin 
will still be there tomorrow? Dump our tables to CSV (and POST them) and 
roll our own delta-updates logic? Or are we to choose one datastore and 
drop the other?


Dima

Re: data import handler deprecated?

2020-11-28 Thread matthew sporleder

https://solr.cool/#utilities -> https://github.com/rohitbemax/dataimporthandler

You can import it in the many new/novel ways to add things to a solr
install and it should work like always (apparently).  The bottom of
that github page isn't hopeful however :)

On Sat, Nov 28, 2020 at 5:21 PM Dmitri Maziuk  wrote:
>
> Hi all,
>
> trying to set up solr-8.7.0, contrib/dataimporthandler/README.txt says
> this module is deprecated as of 8.6 and scheduled for removal in 9.0.
>
> How do we pull data out of our relational database in 8.7+?
>
> TIA
> Dima

Re: Data Import Handler - Concurrent Entity Importing

2020-05-13 Thread ART GALLERY

check out the videos on this website TROO.TUBE don't be such a
sheep/zombie/loser/NPC. Much love!
https://troo.tube/videos/watch/aaa64864-52ee-4201-922f-41300032f219

On Tue, May 5, 2020 at 1:58 PM Mikhail Khludnev  wrote:
>
> Hello, James.
>
> DataImportHandler has a lock preventing concurrent execution. If you need
> to run several imports in parallel at the same core, you need to duplicate
> "/dataimport" handlers definition in solrconfig.xml. Thus, you can run them
> in parallel. Regarding schema, I prefer the latter but mileage may vary.
>
> --
> Mikhail.
>
> On Tue, May 5, 2020 at 6:39 PM James Greene 
> wrote:
>
> > Hello, I'm new to the group here so please excuse me if I do not have the
> > etiquette down yet.
> >
> > Is it possible to have multiple entities (customer configurable, up to 40
> > atm) in a DIH configuration to be imported at once?  Right now I have
> > multiple root entities in my configuration but they get indexes
> > sequentially and this means the entities that are last are always delayed
> > hitting the index.
> >
> > I'm trying to migrate an existing setup (solr 6.6) that utilizes a
> > different collection for each "entity type" into a single collection (solr
> > 8.4) to get around some of the hurdles faced when needing to have searches
> > that require multiple block joins and currently does not work going cross
> > core.
> >
> > I'm also wondering if it is better to fully qualify a field name or use two
> > different fields for performing the "same" search.  i.e:
> >
> >
> > {
> > type_A_status; Active
> > type_A_value: Test
> > }
> > vs
> > {
> > type: A
> > status: Active
> > value: Test
> > }
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev

Re: Data Import Handler - Concurrent Entity Importing

2020-05-05 Thread Mikhail Khludnev

Hello, James.

DataImportHandler has a lock preventing concurrent execution. If you need
to run several imports in parallel at the same core, you need to duplicate
"/dataimport" handlers definition in solrconfig.xml. Thus, you can run them
in parallel. Regarding schema, I prefer the latter but mileage may vary.

--
Mikhail.

On Tue, May 5, 2020 at 6:39 PM James Greene 
wrote:

> Hello, I'm new to the group here so please excuse me if I do not have the
> etiquette down yet.
>
> Is it possible to have multiple entities (customer configurable, up to 40
> atm) in a DIH configuration to be imported at once?  Right now I have
> multiple root entities in my configuration but they get indexes
> sequentially and this means the entities that are last are always delayed
> hitting the index.
>
> I'm trying to migrate an existing setup (solr 6.6) that utilizes a
> different collection for each "entity type" into a single collection (solr
> 8.4) to get around some of the hurdles faced when needing to have searches
> that require multiple block joins and currently does not work going cross
> core.
>
> I'm also wondering if it is better to fully qualify a field name or use two
> different fields for performing the "same" search.  i.e:
>
>
> {
> type_A_status; Active
> type_A_value: Test
> }
> vs
> {
> type: A
> status: Active
> value: Test
> }
>


-- 
Sincerely yours
Mikhail Khludnev

Re: data-import-handler for solr-7.5.0

2018-10-02 Thread Alexandre Rafalovitch

Ok, so then you can switch to debug mode and keep trying to figure it
out. Also try BinFileDataSource or URLDataSource, maybe it will have
an easier way.

Or using relative path (example:
https://github.com/arafalov/solr-apachecon2018-presentation/blob/master/configsets/pets-final/pets-data-config.xml).

Regards,
   Alex.
On Tue, 2 Oct 2018 at 12:46, Martin Frank Hansen (MHQ)  wrote:
>
> Thanks for the info, the UI looks interesting... It does read the data-config 
> correctly, so the problem is probably in this file.
>
> Martin Frank Hansen, Senior Data Analytiker
>
> Data, IM & Analytics
>
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail m...@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
> -Oprindelig meddelelse-
> Fra: Alexandre Rafalovitch 
> Sendt: 2. oktober 2018 18:18
> Til: solr-user 
> Emne: Re: data-import-handler for solr-7.5.0
>
> Admin UI for DIH will show you the config file read. So, if nothing is there, 
> the path is most likely the issue
>
> You can also provide or update the configuration right in UI if you enable 
> debug.
>
> Finally, the config file is reread on every invocation, so you don't need to 
> restart the core after changing it.
>
> Hope this helps,
>Alex.
> On Tue, 2 Oct 2018 at 11:45, Jan Høydahl  wrote:
> >
> > > url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"
> >
> > Have you tried url="C:\\Users\\z6mhq/Desktop\\data_import\\nh_test.xml" ?
> >
> > --
> > Jan Høydahl, search solution architect Cominvent AS -
> > www.cominvent.com
> >
> > > 2. okt. 2018 kl. 17:15 skrev Martin Frank Hansen (MHQ) :
> > >
> > > Hi,
> > >
> > > I am having some problems getting the data-import-handler in Solr to 
> > > work. I have tried a lot of things but I simply get no response from 
> > > Solr, not even an error.
> > >
> > > When calling the API:
> > > http://localhost:8983/solr/nh/dataimport?command=full-import
> > > {
> > >  "responseHeader":{
> > >"status":0,
> > >"QTime":38},
> > >  "initArgs":[
> > >"defaults",[
> > >  "config","C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml"]],
> > >  "command":"full-import",
> > >  "status":"idle",
> > >  "importResponse":"",
> > >  "statusMessages":{}}
> > >
> > > The data looks like this:
> > >
> > > 
> > >  
> > > 2165432
> > > 5  
> > >
> > >  
> > > 28548113
> > > 89   
> > >
> > >
> > > The data-config file looks like this:
> > >
> > > 
> > >  
> > >
> > >   > >name="xml"
> > >pk="id"
> > >processor="XPathEntityProcessor"
> > >stream="true"
> > >forEach="/journal/doc"
> > >url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"
> > >transformer="RegexTransformer,TemplateTransformer"
> > >>
> > >
> > >
> > >
> > >  
> > >  
> > > 
> > >
> > > And I referenced the jar files in the solr-config.xml as well as adding 
> > > the request-handler by adding the following lines:
> > >
> > >  > > regex="solr-dataimporthandler-\d.*\.jar" />  > > dir="${solr.install.dir:../../../..}/dist/"
> > > regex="solr-dataimporthandler-extras-\d.*\.jar" />
> > >
> > >
> > >  > > class="org.apache.solr.handler.dataimport.DataImportHandler">
> > >
> > >   > > name="config">C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml
> > >
> > >  
> > >
> > > I am running a core residing in the folder 
> > > “C:/Users/z6mhq/Desktop/nh/nh/conf” while the Solr installation is in 
> > > “C:/Users/z6mhq/Documents/solr-7.5.0”.
> > >
> > > I really hope that someone can spot my mistake…
> > >
> > > Thanks in advance.
> > >
> > > Martin Frank Hansen
> > >
> > >
> > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder 
> > > du KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der 
> > > fortæller,

Re: data-import-handler for solr-7.5.0

2018-10-02 Thread Alexandre Rafalovitch

Admin UI for DIH will show you the config file read. So, if nothing is
there, the path is most likely the issue

You can also provide or update the configuration right in UI if you
enable debug.

Finally, the config file is reread on every invocation, so you don't
need to restart the core after changing it.

Hope this helps,
   Alex.
On Tue, 2 Oct 2018 at 11:45, Jan Høydahl  wrote:
>
> > url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"
>
> Have you tried url="C:\\Users\\z6mhq/Desktop\\data_import\\nh_test.xml" ?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 2. okt. 2018 kl. 17:15 skrev Martin Frank Hansen (MHQ) :
> >
> > Hi,
> >
> > I am having some problems getting the data-import-handler in Solr to work. 
> > I have tried a lot of things but I simply get no response from Solr, not 
> > even an error.
> >
> > When calling the API: 
> > http://localhost:8983/solr/nh/dataimport?command=full-import
> > {
> >  "responseHeader":{
> >"status":0,
> >"QTime":38},
> >  "initArgs":[
> >"defaults",[
> >  "config","C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml"]],
> >  "command":"full-import",
> >  "status":"idle",
> >  "importResponse":"",
> >  "statusMessages":{}}
> >
> > The data looks like this:
> >
> > 
> >  
> > 2165432
> > 5
> >  
> >
> >  
> > 28548113
> > 89
> >  
> > 
> >
> >
> > The data-config file looks like this:
> >
> > 
> >  
> >
> >   >name="xml"
> >pk="id"
> >processor="XPathEntityProcessor"
> >stream="true"
> >forEach="/journal/doc"
> >url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"
> >transformer="RegexTransformer,TemplateTransformer"
> >>
> >
> >
> >
> >  
> >  
> > 
> >
> > And I referenced the jar files in the solr-config.xml as well as adding the 
> > request-handler by adding the following lines:
> >
> >  > regex="solr-dataimporthandler-\d.*\.jar" />
> >  > regex="solr-dataimporthandler-extras-\d.*\.jar" />
> >
> >
> >  > class="org.apache.solr.handler.dataimport.DataImportHandler">
> >
> >   > name="config">C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml
> >
> >  
> >
> > I am running a core residing in the folder 
> > “C:/Users/z6mhq/Desktop/nh/nh/conf” while the Solr installation is in 
> > “C:/Users/z6mhq/Documents/solr-7.5.0”.
> >
> > I really hope that someone can spot my mistake…
> >
> > Thanks in advance.
> >
> > Martin Frank Hansen
> >
> >
> > Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
> > KMD’s Privatlivspolitik, der 
> > fortæller, hvordan vi behandler oplysninger om dig.
> >
> > Protection of your personal data is important to us. Here you can read 
> > KMD’s Privacy Policy outlining how we 
> > process your personal data.
> >
> > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. 
> > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst 
> > informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi 
> > dig slette e-mailen i dit system uden at videresende eller kopiere den. 
> > Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri 
> > for virus og andre fejl, som kan påvirke computeren eller it-systemet, 
> > hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi 
> > påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse 
> > med at modtage og bruge e-mailen.
> >
> > Please note that this message may contain confidential information. If you 
> > have received this message by mistake, please inform the sender of the 
> > mistake by sending a reply, then delete the message from your system 
> > without making, distributing or retaining any copies of it. Although we 
> > believe that the message and any attachments are free from viruses and 
> > other errors that might affect the computer or it-system where it is 
> > received and read, the recipient opens the message at his or her own risk. 
> > We assume no responsibility for any loss or damage arising from the receipt 
> > or use of this message.
>

Re: data-import-handler for solr-7.5.0

2018-10-02 Thread Jan Høydahl

> url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"

Have you tried url="C:\\Users\\z6mhq/Desktop\\data_import\\nh_test.xml" ?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 2. okt. 2018 kl. 17:15 skrev Martin Frank Hansen (MHQ) :
> 
> Hi,
> 
> I am having some problems getting the data-import-handler in Solr to work. I 
> have tried a lot of things but I simply get no response from Solr, not even 
> an error.
> 
> When calling the API: 
> http://localhost:8983/solr/nh/dataimport?command=full-import
> {
>  "responseHeader":{
>"status":0,
>"QTime":38},
>  "initArgs":[
>"defaults",[
>  "config","C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml"]],
>  "command":"full-import",
>  "status":"idle",
>  "importResponse":"",
>  "statusMessages":{}}
> 
> The data looks like this:
> 
> 
>  
> 2165432
> 5
>  
> 
>  
> 28548113
> 89
>  
> 
> 
> 
> The data-config file looks like this:
> 
> 
>  
>
>  name="xml"
>pk="id"
>processor="XPathEntityProcessor"
>stream="true"
>forEach="/journal/doc"
>url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"
>transformer="RegexTransformer,TemplateTransformer"
>> 
>
>
> 
>  
>  
> 
> 
> And I referenced the jar files in the solr-config.xml as well as adding the 
> request-handler by adding the following lines:
> 
>  regex="solr-dataimporthandler-\d.*\.jar" />
>  regex="solr-dataimporthandler-extras-\d.*\.jar" />
> 
> 
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
>
>   name="config">C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml
>
>  
> 
> I am running a core residing in the folder 
> “C:/Users/z6mhq/Desktop/nh/nh/conf” while the Solr installation is in 
> “C:/Users/z6mhq/Documents/solr-7.5.0”.
> 
> I really hope that someone can spot my mistake…
> 
> Thanks in advance.
> 
> Martin Frank Hansen
> 
> 
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
> KMD’s Privatlivspolitik, der fortæller, 
> hvordan vi behandler oplysninger om dig.
> 
> Protection of your personal data is important to us. Here you can read KMD’s 
> Privacy Policy outlining how we process 
> your personal data.
> 
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. 
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
> afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
> e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen 
> og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og 
> andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages 
> og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget 
> ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge 
> e-mailen.
> 
> Please note that this message may contain confidential information. If you 
> have received this message by mistake, please inform the sender of the 
> mistake by sending a reply, then delete the message from your system without 
> making, distributing or retaining any copies of it. Although we believe that 
> the message and any attachments are free from viruses and other errors that 
> might affect the computer or it-system where it is received and read, the 
> recipient opens the message at his or her own risk. We assume no 
> responsibility for any loss or damage arising from the receipt or use of this 
> message.

Re: Data Import Handler with Solr Source behind Load Balancer

2018-09-14 Thread Emir Arnautović

Hi Thomas,
Is this SolrCloud or Solr master-slave? Do you update index while indexing? Did 
you check if all your instances behind LB are in sync if you are using 
master-slave?
My guess would be that DIH is using cursors to read data from another Solr. If 
you are using multiple Solr instances behind LB there might be some diffs in 
index that results in different documents being returned for the same cursor 
mark. Is num doc and max doc the same on new instance after import?

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/

> On 12 Sep 2018, at 05:53, Zimmermann, Thomas  
> wrote:
> 
> We have a Solr v7 Instance sourcing data from a Data Import Handler with a 
> Solr data source running Solr v4. When it hits a single server in that 
> instance directly, all documents are read and written correctly to the v7. 
> When we hit the load balancer DNS entry, the resulting data import handler 
> json states that it read all the documents and skipped none, and all looks 
> fine, but the result set is missing ~20% of the documents in the v7 core. 
> This has happened multiple time on multiple environments.
> 
> Any thoughts on whether this might be a bug in the underlying DIH code? I'll 
> also pass it along to the server admins on our side for input.

Re: Data Import from Command Line

2018-08-20 Thread Adam Blank

Thank you both for the responses. I was able to get the import working
through telnet, and I'll see if I can get the post utility working as that
seems like a better option.

Thanks,
Adam

On Mon, Aug 20, 2018, 2:04 PM Alexandre Rafalovitch 
wrote:

> Admin UI just hits Solr for a particular URL with specific parameters.
> You could totally call it from the command line, but it _would_ need
> to be an HTTP client of some sort. You could encode all of the
> parameters into the DIH (or a new) handler, it is all defined in
> solrconfig.xml (/dataimport is the default one).
>
> If you don't have curl, maybe you have wget? Or lynx? Or, just for
> giggles, you could Telnet into port 80 and manually type the required
> command (
> http://blog.tonycode.com/tech-stuff/http-notes/making-http-requests-via-telnet/
> ):
> GET /dataimport?param=value HTTP/1.0
>
> Regards,
>Alex.
> P.s. And yes, maybe bin/post could be used as well. Or the previous
> direct java invocation of the posttool jar. May need to massage the
> parameters a bit though.
>
> On 20 August 2018 at 13:45, Adam Blank  wrote:
> > Hi,
> >
> > I'm running Solr 5.5.0 on AIX, and I'm wondering if there's a way to
> import
> > the index from the command line instead of using the admin console?  I
> > don't have the ability to use a HTTP client such as cURL to connect to
> the
> > console.
> >
> > Thank you,
> > Adam
>

Re: Data Import from Command Line

2018-08-20 Thread Alexandre Rafalovitch

Admin UI just hits Solr for a particular URL with specific parameters.
You could totally call it from the command line, but it _would_ need
to be an HTTP client of some sort. You could encode all of the
parameters into the DIH (or a new) handler, it is all defined in
solrconfig.xml (/dataimport is the default one).

If you don't have curl, maybe you have wget? Or lynx? Or, just for
giggles, you could Telnet into port 80 and manually type the required
command 
(http://blog.tonycode.com/tech-stuff/http-notes/making-http-requests-via-telnet/):
GET /dataimport?param=value HTTP/1.0

Regards,
   Alex.
P.s. And yes, maybe bin/post could be used as well. Or the previous
direct java invocation of the posttool jar. May need to massage the
parameters a bit though.

On 20 August 2018 at 13:45, Adam Blank  wrote:
> Hi,
>
> I'm running Solr 5.5.0 on AIX, and I'm wondering if there's a way to import
> the index from the command line instead of using the admin console?  I
> don't have the ability to use a HTTP client such as cURL to connect to the
> console.
>
> Thank you,
> Adam

Re: Data Import from Command Line

2018-08-20 Thread Christopher Schultz

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Adam,

On 8/20/18 1:45 PM, Adam Blank wrote:
> I'm running Solr 5.5.0 on AIX, and I'm wondering if there's a way
> to import the index from the command line instead of using the
> admin console?  I don't have the ability to use a HTTP client such
> as cURL to connect to the console.

I'm not sure when it was added, but there is a program called "post"
which comes with later versions of Solr that can be used to load data
into an index.

- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlt7AfcACgkQHPApP6U8
pFgtgw/7BTV7shvNcXKrpTB11g0wjYXAJOlqARlYgWFcQIhVcs1jfbJi8O6Yxh0x
BIA/EAdob9zC/EgYbMfkM/duibr2A1/wF+CkhhTd6M/HcoSOXbI31L1LDo/xa0lg
z6t3AO9WYYKnFmD2JIxdidH1zHpIz74cAc3q43PFVtLNW2fVT2cNlg7Vn6vdVmoi
79VLPnvdyxZRdQtxbhdvCribPdFP6YLC3dgxh1KeeZzdO0OcjQykSrssX/hd207z
9iuw2TusoUIgXQsMLRtnKqqVp38MYPppk49uGprhB8iTJjDAVlvgD3jURef7S7s/
w1KBPVZTGQFh6cvzjOOZHUkaj0hX4PuYkun/hQY3Uy5kBIw5fo0Y10bjVcRZGYrb
SQDTUe0sdfU27qaY8DLqSf21to5K+wTIuOO28C1TkHkjKymg0w7THz583o0aOCzr
5fjNN00FevrWFLm+n7c2tToW3H1cAZkh5XRDDDUYnqzVzchSOHlFKM1X0gMOq8Lf
If434uctruwsqBrkscTWcS5UALGLxuwtNk9trLLeRII8YapB6MI6xoUnCvWFv1sO
fziqKXXwBmrI+v/1FqiR8Md3r32jm8Gy54acViJc9+szUEM26C+FSzvsdGnf5oVr
tlsHVwLBPORS6hGJ+MvqMGkrxlO1WNm5MrJxHNoyQ5KqAL7WT+s=
=+VTK
-END PGP SIGNATURE-

Re: Data import batch mode for delta

2018-04-17 Thread Shawn Heisey


On 4/16/2018 7:32 PM, gadelkareem wrote:

I cannot complain cuz it actually worked well for me so far but..

I still do not understand if Solr already paginates the results from the
full import, why not do the same for the delta. It is almost the same query:
`select id from t where t.lastmod > ${solrTime}`
`select * from t where id IN ${dataimporter.ids} limit 1000 offset 0`
and so on..


Solr does not paginate SQL queries made by the dataimport handler 
(DIH).  It sends the query exactly as it is configured in the DIH config.


Thanks,
Shawn

Re: Data import batch mode for delta

2018-04-16 Thread gadelkareem

Thanks Shawn.

I cannot complain cuz it actually worked well for me so far but..

I still do not understand if Solr already paginates the results from the
full import, why not do the same for the delta. It is almost the same query:
`select id from t where t.lastmod > ${solrTime}`
`select * from t where id IN ${dataimporter.ids} limit 1000 offset 0` 
and so on..



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Data import batch mode for delta

2018-04-05 Thread Shawn Heisey


On 4/5/2018 7:31 PM, gadelkareem wrote:

Why the deltaImportQuery uses "where id='${dataimporter.id}'" instead of
something like where id IN ('${dataimporter.id})'


Because there's only one value for that property.

If the deltaQuery returns a million rows, then deltaImportQuery is going 
to be executed a million times.  Once for each row returned by the 
deltaQuery.


That IS as inefficient as it sounds.  Think of the dataimport handler as 
a stop-gap solution -- to help you get started with loading data from a 
database, until you can write a proper application to do your indexing.


Thanks,
Shawn

RE: data import class not found

2017-08-31 Thread Steve Pruitt

I just tried putting the solr-dataimporthandler-6.6.0.jar in server/solr/lib 
and I got past the problem.  I still don't understand why not found in /dist

-Original Message-
From: Steve Pruitt [mailto:bpru...@opentext.com] 
Sent: Thursday, August 31, 2017 3:05 PM
To: solr-user@lucene.apache.org
Subject: [EXTERNAL] - data import class not found

I still can't understand how Solr establishes the classpath.

I have a custom entity processor that subclasses EntityProcessorBase.  When I 
execute the /dataimport call I get

java.lang.NoClassDefFoundError: 
org/apache/solr/handler/dataimport/EntityProcessorBase

no matter how I state in solrconfig.xml to locate the solr-dataimporthandler 
jar.

I have  tried:

from the existing libs in solrconfig.xml 

from the Ref Guide

try anything

But, I always get the class not found error.  The DataImportHandler class is 
found when Solr starts, since EntityProcessorBase is in the same jar why is it 
not found.

I have not tried putting in the core's lib thinking the above should work.  Of 
course, the 3rd choice is only an experiment.

Thanks.

-S

Re: Data Import

2017-03-17 Thread Mike Thomsen

If Solr is down, then adding through SolrJ would fail as well. Kafka's new
API has some great features for this sort of thing. The new client API is
designed to be run in a long-running loop where you poll for new messages
with a certain amount of defined timeout (ex: consumer.poll(1000) for 1s)
So if Solr becomes unstable or goes down, it's easy to have the consumer
just stop and either wait until Solr comes back up or save the data to
disk/commit the Kafka offsets to ZK and stop running.

On Fri, Mar 17, 2017 at 1:24 PM, OTH  wrote:

> Are Kafka and SQS interchangeable?  (The latter does not seem to be free.)
>
> @Wunder:
> I'm assuming, that updating to Solr would fail if Solr is unavailable not
> just if posting via say a DB trigger, but probably also if trying to post
> through SolrJ?  (Which is what I'm using for now.)  So, even if using
> SolrJ, it would be a good idea to use a queuing software?
>
> Thanks
>
> On Fri, Mar 17, 2017 at 10:12 PM, vishal jain  wrote:
>
> > Streaming the data through kafka would be a good option if near real time
> > data indexing is the key requirement.
> > In our application the RDBMS data is populated by an ETL job periodically
> > so we don't need real time data indexing for now.
> >
> > Cheers,
> > Vishal
> >
> > On Fri, Mar 17, 2017 at 10:30 PM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> > > Or set a trigger on your RDBMS's main table to put the relevant
> > > information in a different table (call it EVENTS) and have your SolrJ
> > > consult the EVENTS table periodically. Essentially you're using the
> > > EVENTS table as a queue where the trigger is the producer and the
> > > SolrJ program is the consumer.
> > >
> > > It's a polling solution though, so not event-driven. There's no
> > > mechanism that I know of have, say, your RDBMS push an event to DIH
> > > for instance.
> > >
> > > Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
> > > for this kind of problem..
> > >
> > > Best,
> > > Erick
> > >
> > > On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
> > >  wrote:
> > > > One assumes by hooking into the same code that updates RDBMS, as
> > > > opposed to be reverse engineering the changes from looking at the DB
> > > > content. This would be especially the case for Delete changes.
> > > >
> > > > Regards,
> > > >Alex.
> > > > 
> > > > http://www.solr-start.com/ - Resources for Solr users, new and
> > > experienced
> > > >
> > > >
> > > > On 17 March 2017 at 11:37, OTH  wrote:
> > > >>>
> > > >>> Also, solrj is good when you want your RDBMS updates make
> immediately
> > > >>> available in solr.
> > > >>
> > > >> How can SolrJ be used to make RDBMS updates immediately available?
> > > >> Thanks
> > > >>
> > > >> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
> > > sujaybawas...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> Hi Vishal,
> > > >>>
> > > >>> As per my experience DIH is the best for RDBMS to solr index. DIH
> > with
> > > >>> caching has best performance. DIH nested entities allow you to
> define
> > > >>> simple queries.
> > > >>> Also, solrj is good when you want your RDBMS updates make
> immediately
> > > >>> available in solr. DIH full import can be used for index all data
> > first
> > > >>> time or restore index in case index is corrupted.
> > > >>>
> > > >>> Thanks,
> > > >>> Sujay
> > > >>>
> > > >>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain 
> > > wrote:
> > > >>>
> > > >>> > Hi,
> > > >>> >
> > > >>> >
> > > >>> > I am new to Solr and am trying to move data from my RDBMS to
> Solr.
> > I
> > > know
> > > >>> > the available options are:
> > > >>> > 1) Post Tool
> > > >>> > 2) DIH
> > > >>> > 3) SolrJ (as ours is a J2EE application).
> > > >>> >
> > > >>> > I want to know what is the recommended way for Data import in
> > > production
> > > >>> > environment.
> > > >>> > Will sending data via SolrJ in batches be faster than posting a
> csv
> > > using
> > > >>> > POST tool?
> > > >>> >
> > > >>> >
> > > >>> > Thanks,
> > > >>> > Vishal
> > > >>> >
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Thanks,
> > > >>> Sujay P Bawaskar
> > > >>> M:+91-77091 53669
> > > >>>
> > >
> >
>

Re: Data Import

2017-03-17 Thread OTH

Are Kafka and SQS interchangeable?  (The latter does not seem to be free.)

@Wunder:
I'm assuming, that updating to Solr would fail if Solr is unavailable not
just if posting via say a DB trigger, but probably also if trying to post
through SolrJ?  (Which is what I'm using for now.)  So, even if using
SolrJ, it would be a good idea to use a queuing software?

Thanks

On Fri, Mar 17, 2017 at 10:12 PM, vishal jain  wrote:

> Streaming the data through kafka would be a good option if near real time
> data indexing is the key requirement.
> In our application the RDBMS data is populated by an ETL job periodically
> so we don't need real time data indexing for now.
>
> Cheers,
> Vishal
>
> On Fri, Mar 17, 2017 at 10:30 PM, Erick Erickson 
> wrote:
>
> > Or set a trigger on your RDBMS's main table to put the relevant
> > information in a different table (call it EVENTS) and have your SolrJ
> > consult the EVENTS table periodically. Essentially you're using the
> > EVENTS table as a queue where the trigger is the producer and the
> > SolrJ program is the consumer.
> >
> > It's a polling solution though, so not event-driven. There's no
> > mechanism that I know of have, say, your RDBMS push an event to DIH
> > for instance.
> >
> > Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
> > for this kind of problem..
> >
> > Best,
> > Erick
> >
> > On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
> >  wrote:
> > > One assumes by hooking into the same code that updates RDBMS, as
> > > opposed to be reverse engineering the changes from looking at the DB
> > > content. This would be especially the case for Delete changes.
> > >
> > > Regards,
> > >Alex.
> > > 
> > > http://www.solr-start.com/ - Resources for Solr users, new and
> > experienced
> > >
> > >
> > > On 17 March 2017 at 11:37, OTH  wrote:
> > >>>
> > >>> Also, solrj is good when you want your RDBMS updates make immediately
> > >>> available in solr.
> > >>
> > >> How can SolrJ be used to make RDBMS updates immediately available?
> > >> Thanks
> > >>
> > >> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
> > sujaybawas...@gmail.com>
> > >> wrote:
> > >>
> > >>> Hi Vishal,
> > >>>
> > >>> As per my experience DIH is the best for RDBMS to solr index. DIH
> with
> > >>> caching has best performance. DIH nested entities allow you to define
> > >>> simple queries.
> > >>> Also, solrj is good when you want your RDBMS updates make immediately
> > >>> available in solr. DIH full import can be used for index all data
> first
> > >>> time or restore index in case index is corrupted.
> > >>>
> > >>> Thanks,
> > >>> Sujay
> > >>>
> > >>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain 
> > wrote:
> > >>>
> > >>> > Hi,
> > >>> >
> > >>> >
> > >>> > I am new to Solr and am trying to move data from my RDBMS to Solr.
> I
> > know
> > >>> > the available options are:
> > >>> > 1) Post Tool
> > >>> > 2) DIH
> > >>> > 3) SolrJ (as ours is a J2EE application).
> > >>> >
> > >>> > I want to know what is the recommended way for Data import in
> > production
> > >>> > environment.
> > >>> > Will sending data via SolrJ in batches be faster than posting a csv
> > using
> > >>> > POST tool?
> > >>> >
> > >>> >
> > >>> > Thanks,
> > >>> > Vishal
> > >>> >
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Thanks,
> > >>> Sujay P Bawaskar
> > >>> M:+91-77091 53669
> > >>>
> >
>

RE: Data Import

2017-03-17 Thread Liu, Daphne

NO, I use the free version. I have the driver from someone else. I can share it 
if you want to use Cassandra.
They have modified it for me since the free JDBC driver I found will timeout 
when the document is greater than 16mb.

Kind regards,

Daphne Liu
BI Architect - Matrix SCM

CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL 32256 
USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 / 
daphne@cevalogistics.com



-Original Message-
From: vishal jain [mailto:jain02...@gmail.com]
Sent: Friday, March 17, 2017 12:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Data Import

Hi Daphne,

Are you using DSE?


Thanks & Regards,
Vishal

On Fri, Mar 17, 2017 at 7:40 PM, Liu, Daphne <daphne@cevalogistics.com>
wrote:

> I just want to share my recent project. I have successfully sent all
> our EDI documents to Cassandra 3.7 clusters using Solr 6.3 Data Import
> JDBC Cassandra connector indexing our documents.
> Since Cassandra is so fast for writing, compression rate is around 13%
> and all my documents can be keep in my Cassandra clusters' memory, we
> are very happy with the result.
>
>
> Kind regards,
>
> Daphne Liu
> BI Architect - Matrix SCM
>
> CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL
> 32256 USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 /
> daphne@cevalogistics.com
>
>
>
> -Original Message-
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: Friday, March 17, 2017 9:54 AM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: Data Import
>
> I feel DIH is much better for prototyping, even though people do use
> it in production. If you do want to use DIH, you may benefit from
> reviewing the DIH-DB example I am currently rewriting in
> https://issues.apache.org/jira/browse/SOLR-10312 (may need to change
> luceneMatchVersion in solrconfig.xml first).
>
> CSV, etc, could be useful if you want to keep history of past imports,
> again useful during development, as you evolve schema.
>
> SolrJ may actually be easiest/best for production since you already
> have Java stack.
>
> The choice is yours in the end.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
>
>
> On 17 March 2017 at 08:56, Shawn Heisey <apa...@elyograg.org> wrote:
> > On 3/17/2017 3:04 AM, vishal jain wrote:
> >> I am new to Solr and am trying to move data from my RDBMS to Solr.
> >> I
> know the available options are:
> >> 1) Post Tool
> >> 2) DIH
> >> 3) SolrJ (as ours is a J2EE application).
> >>
> >> I want to know what is the recommended way for Data import in
> >> production environment. Will sending data via SolrJ in batches be
> faster than posting a csv using POST tool?
> >
> > I've heard that CSV import runs EXTREMELY fast, but I have never
> > tested it.  The same threading problem that I discuss below would
> > apply to indexing this way.
> >
> > DIH is extremely powerful, but it has one glaring problem:  It's
> > single-threaded, which means that only one stream of data is going
> > into Solr, and each batch of documents to be inserted must wait for
> > the previous one to finish inserting before it can start.  I do not
> > know if DIH batches documents or sends them in one at a time.  If
> > you have a manually sharded index, you can run DIH on each shard in
> > parallel, but each one will be single-threaded.  That single thread
> > is pretty efficient, but it's still only one thread.
> >
> > Sending multiple index updates to Solr in parallel (multi-threading)
> > is how you radically speed up the Solr part of indexing.  This is
> > usually done with a custom indexing program, which might be written
> > with SolrJ or even in a completely different language.
> >
> > One thing to keep in mind with ANY indexing method:  Once the
> > situation is examined closely, most people find that it's not Solr
> > that makes their indexing slow.  The bottleneck is usually the
> > source system -- how quickly the data can be retrieved.  It usually
> > takes a lot longer to obtain the data than it does for Solr to index it.
> >
> > Thanks,
> > Shawn
> >
> This e-mail message is intended for the above named recipient(s) only.
> It may contain confidential information that is privileged. If you are
> not the intended recipient, you are hereby notified that any
> dissemination, distribution or copying of this e-mail and any
> attachment(s) is strictly prohibited. If you have received this e-mail
> by error, please immediately noti

Re: Data Import

2017-03-17 Thread vishal jain

Streaming the data through kafka would be a good option if near real time
data indexing is the key requirement.
In our application the RDBMS data is populated by an ETL job periodically
so we don't need real time data indexing for now.

Cheers,
Vishal

On Fri, Mar 17, 2017 at 10:30 PM, Erick Erickson 
wrote:

> Or set a trigger on your RDBMS's main table to put the relevant
> information in a different table (call it EVENTS) and have your SolrJ
> consult the EVENTS table periodically. Essentially you're using the
> EVENTS table as a queue where the trigger is the producer and the
> SolrJ program is the consumer.
>
> It's a polling solution though, so not event-driven. There's no
> mechanism that I know of have, say, your RDBMS push an event to DIH
> for instance.
>
> Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
> for this kind of problem..
>
> Best,
> Erick
>
> On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
>  wrote:
> > One assumes by hooking into the same code that updates RDBMS, as
> > opposed to be reverse engineering the changes from looking at the DB
> > content. This would be especially the case for Delete changes.
> >
> > Regards,
> >Alex.
> > 
> > http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
> >
> >
> > On 17 March 2017 at 11:37, OTH  wrote:
> >>>
> >>> Also, solrj is good when you want your RDBMS updates make immediately
> >>> available in solr.
> >>
> >> How can SolrJ be used to make RDBMS updates immediately available?
> >> Thanks
> >>
> >> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
> sujaybawas...@gmail.com>
> >> wrote:
> >>
> >>> Hi Vishal,
> >>>
> >>> As per my experience DIH is the best for RDBMS to solr index. DIH with
> >>> caching has best performance. DIH nested entities allow you to define
> >>> simple queries.
> >>> Also, solrj is good when you want your RDBMS updates make immediately
> >>> available in solr. DIH full import can be used for index all data first
> >>> time or restore index in case index is corrupted.
> >>>
> >>> Thanks,
> >>> Sujay
> >>>
> >>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain 
> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> >
> >>> > I am new to Solr and am trying to move data from my RDBMS to Solr. I
> know
> >>> > the available options are:
> >>> > 1) Post Tool
> >>> > 2) DIH
> >>> > 3) SolrJ (as ours is a J2EE application).
> >>> >
> >>> > I want to know what is the recommended way for Data import in
> production
> >>> > environment.
> >>> > Will sending data via SolrJ in batches be faster than posting a csv
> using
> >>> > POST tool?
> >>> >
> >>> >
> >>> > Thanks,
> >>> > Vishal
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> Sujay P Bawaskar
> >>> M:+91-77091 53669
> >>>
>

Re: Data Import

2017-03-17 Thread Walter Underwood

That fails if Solr is not available.

To avoid dropping updates, you need some kind of persistent queue. We use 
Amazon SQS for our incremental updates.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 17, 2017, at 10:09 AM, OTH  wrote:
> 
> Could the database trigger not just post the change to solr?
> 
> On Fri, Mar 17, 2017 at 10:00 PM, Erick Erickson 
> wrote:
> 
>> Or set a trigger on your RDBMS's main table to put the relevant
>> information in a different table (call it EVENTS) and have your SolrJ
>> consult the EVENTS table periodically. Essentially you're using the
>> EVENTS table as a queue where the trigger is the producer and the
>> SolrJ program is the consumer.
>> 
>> It's a polling solution though, so not event-driven. There's no
>> mechanism that I know of have, say, your RDBMS push an event to DIH
>> for instance.
>> 
>> Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
>> for this kind of problem..
>> 
>> Best,
>> Erick
>> 
>> On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
>>  wrote:
>>> One assumes by hooking into the same code that updates RDBMS, as
>>> opposed to be reverse engineering the changes from looking at the DB
>>> content. This would be especially the case for Delete changes.
>>> 
>>> Regards,
>>>   Alex.
>>> 
>>> http://www.solr-start.com/ - Resources for Solr users, new and
>> experienced
>>> 
>>> 
>>> On 17 March 2017 at 11:37, OTH  wrote:
> 
> Also, solrj is good when you want your RDBMS updates make immediately
> available in solr.
 
 How can SolrJ be used to make RDBMS updates immediately available?
 Thanks
 
 On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
>> sujaybawas...@gmail.com>
 wrote:
 
> Hi Vishal,
> 
> As per my experience DIH is the best for RDBMS to solr index. DIH with
> caching has best performance. DIH nested entities allow you to define
> simple queries.
> Also, solrj is good when you want your RDBMS updates make immediately
> available in solr. DIH full import can be used for index all data first
> time or restore index in case index is corrupted.
> 
> Thanks,
> Sujay
> 
> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain 
>> wrote:
> 
>> Hi,
>> 
>> 
>> I am new to Solr and am trying to move data from my RDBMS to Solr. I
>> know
>> the available options are:
>> 1) Post Tool
>> 2) DIH
>> 3) SolrJ (as ours is a J2EE application).
>> 
>> I want to know what is the recommended way for Data import in
>> production
>> environment.
>> Will sending data via SolrJ in batches be faster than posting a csv
>> using
>> POST tool?
>> 
>> 
>> Thanks,
>> Vishal
>> 
> 
> 
> 
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669
> 
>>

Re: Data Import

2017-03-17 Thread vishal jain

Hi Daphne,

Are you using DSE?


Thanks & Regards,
Vishal

On Fri, Mar 17, 2017 at 7:40 PM, Liu, Daphne <daphne@cevalogistics.com>
wrote:

> I just want to share my recent project. I have successfully sent all our
> EDI documents to Cassandra 3.7 clusters using Solr 6.3 Data Import JDBC
> Cassandra connector indexing our documents.
> Since Cassandra is so fast for writing, compression rate is around 13% and
> all my documents can be keep in my Cassandra clusters' memory, we are very
> happy with the result.
>
>
> Kind regards,
>
> Daphne Liu
> BI Architect - Matrix SCM
>
> CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL
> 32256 USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 /
> daphne@cevalogistics.com
>
>
>
> -Original Message-
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: Friday, March 17, 2017 9:54 AM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: Data Import
>
> I feel DIH is much better for prototyping, even though people do use it in
> production. If you do want to use DIH, you may benefit from reviewing the
> DIH-DB example I am currently rewriting in
> https://issues.apache.org/jira/browse/SOLR-10312 (may need to change
> luceneMatchVersion in solrconfig.xml first).
>
> CSV, etc, could be useful if you want to keep history of past imports,
> again useful during development, as you evolve schema.
>
> SolrJ may actually be easiest/best for production since you already have
> Java stack.
>
> The choice is yours in the end.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 17 March 2017 at 08:56, Shawn Heisey <apa...@elyograg.org> wrote:
> > On 3/17/2017 3:04 AM, vishal jain wrote:
> >> I am new to Solr and am trying to move data from my RDBMS to Solr. I
> know the available options are:
> >> 1) Post Tool
> >> 2) DIH
> >> 3) SolrJ (as ours is a J2EE application).
> >>
> >> I want to know what is the recommended way for Data import in
> >> production environment. Will sending data via SolrJ in batches be
> faster than posting a csv using POST tool?
> >
> > I've heard that CSV import runs EXTREMELY fast, but I have never
> > tested it.  The same threading problem that I discuss below would
> > apply to indexing this way.
> >
> > DIH is extremely powerful, but it has one glaring problem:  It's
> > single-threaded, which means that only one stream of data is going
> > into Solr, and each batch of documents to be inserted must wait for
> > the previous one to finish inserting before it can start.  I do not
> > know if DIH batches documents or sends them in one at a time.  If you
> > have a manually sharded index, you can run DIH on each shard in
> > parallel, but each one will be single-threaded.  That single thread is
> > pretty efficient, but it's still only one thread.
> >
> > Sending multiple index updates to Solr in parallel (multi-threading)
> > is how you radically speed up the Solr part of indexing.  This is
> > usually done with a custom indexing program, which might be written
> > with SolrJ or even in a completely different language.
> >
> > One thing to keep in mind with ANY indexing method:  Once the
> > situation is examined closely, most people find that it's not Solr
> > that makes their indexing slow.  The bottleneck is usually the source
> > system -- how quickly the data can be retrieved.  It usually takes a
> > lot longer to obtain the data than it does for Solr to index it.
> >
> > Thanks,
> > Shawn
> >
> This e-mail message is intended for the above named recipient(s) only. It
> may contain confidential information that is privileged. If you are not the
> intended recipient, you are hereby notified that any dissemination,
> distribution or copying of this e-mail and any attachment(s) is strictly
> prohibited. If you have received this e-mail by error, please immediately
> notify the sender by replying to this e-mail and deleting the message
> including any attachment(s) from your system. Thank you in advance for your
> cooperation and assistance. Although the company has taken reasonable
> precautions to ensure no viruses are present in this email, the company
> cannot accept responsibility for any loss or damage arising from the use of
> this email or attachments.
>

Re: Data Import

2017-03-17 Thread OTH

Could the database trigger not just post the change to solr?

On Fri, Mar 17, 2017 at 10:00 PM, Erick Erickson 
wrote:

> Or set a trigger on your RDBMS's main table to put the relevant
> information in a different table (call it EVENTS) and have your SolrJ
> consult the EVENTS table periodically. Essentially you're using the
> EVENTS table as a queue where the trigger is the producer and the
> SolrJ program is the consumer.
>
> It's a polling solution though, so not event-driven. There's no
> mechanism that I know of have, say, your RDBMS push an event to DIH
> for instance.
>
> Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
> for this kind of problem..
>
> Best,
> Erick
>
> On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
>  wrote:
> > One assumes by hooking into the same code that updates RDBMS, as
> > opposed to be reverse engineering the changes from looking at the DB
> > content. This would be especially the case for Delete changes.
> >
> > Regards,
> >Alex.
> > 
> > http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
> >
> >
> > On 17 March 2017 at 11:37, OTH  wrote:
> >>>
> >>> Also, solrj is good when you want your RDBMS updates make immediately
> >>> available in solr.
> >>
> >> How can SolrJ be used to make RDBMS updates immediately available?
> >> Thanks
> >>
> >> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
> sujaybawas...@gmail.com>
> >> wrote:
> >>
> >>> Hi Vishal,
> >>>
> >>> As per my experience DIH is the best for RDBMS to solr index. DIH with
> >>> caching has best performance. DIH nested entities allow you to define
> >>> simple queries.
> >>> Also, solrj is good when you want your RDBMS updates make immediately
> >>> available in solr. DIH full import can be used for index all data first
> >>> time or restore index in case index is corrupted.
> >>>
> >>> Thanks,
> >>> Sujay
> >>>
> >>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain 
> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> >
> >>> > I am new to Solr and am trying to move data from my RDBMS to Solr. I
> know
> >>> > the available options are:
> >>> > 1) Post Tool
> >>> > 2) DIH
> >>> > 3) SolrJ (as ours is a J2EE application).
> >>> >
> >>> > I want to know what is the recommended way for Data import in
> production
> >>> > environment.
> >>> > Will sending data via SolrJ in batches be faster than posting a csv
> using
> >>> > POST tool?
> >>> >
> >>> >
> >>> > Thanks,
> >>> > Vishal
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> Sujay P Bawaskar
> >>> M:+91-77091 53669
> >>>
>

Re: Data Import

2017-03-17 Thread Erick Erickson

Or set a trigger on your RDBMS's main table to put the relevant
information in a different table (call it EVENTS) and have your SolrJ
consult the EVENTS table periodically. Essentially you're using the
EVENTS table as a queue where the trigger is the producer and the
SolrJ program is the consumer.

It's a polling solution though, so not event-driven. There's no
mechanism that I know of have, say, your RDBMS push an event to DIH
for instance.

Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
for this kind of problem..

Best,
Erick

On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
 wrote:
> One assumes by hooking into the same code that updates RDBMS, as
> opposed to be reverse engineering the changes from looking at the DB
> content. This would be especially the case for Delete changes.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 17 March 2017 at 11:37, OTH  wrote:
>>>
>>> Also, solrj is good when you want your RDBMS updates make immediately
>>> available in solr.
>>
>> How can SolrJ be used to make RDBMS updates immediately available?
>> Thanks
>>
>> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar 
>> wrote:
>>
>>> Hi Vishal,
>>>
>>> As per my experience DIH is the best for RDBMS to solr index. DIH with
>>> caching has best performance. DIH nested entities allow you to define
>>> simple queries.
>>> Also, solrj is good when you want your RDBMS updates make immediately
>>> available in solr. DIH full import can be used for index all data first
>>> time or restore index in case index is corrupted.
>>>
>>> Thanks,
>>> Sujay
>>>
>>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain  wrote:
>>>
>>> > Hi,
>>> >
>>> >
>>> > I am new to Solr and am trying to move data from my RDBMS to Solr. I know
>>> > the available options are:
>>> > 1) Post Tool
>>> > 2) DIH
>>> > 3) SolrJ (as ours is a J2EE application).
>>> >
>>> > I want to know what is the recommended way for Data import in production
>>> > environment.
>>> > Will sending data via SolrJ in batches be faster than posting a csv using
>>> > POST tool?
>>> >
>>> >
>>> > Thanks,
>>> > Vishal
>>> >
>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Sujay P Bawaskar
>>> M:+91-77091 53669
>>>

Re: Data Import

2017-03-17 Thread vishal jain

Thanks to all of you for the valuable inputs.
Being on J2ee platform I also felt using solrJ in a multi threaded
environment would be a better choice to index RDBMS data into SolrCloud.
I will try with a scheduler triggered micro service to do the job using
SolrJ.

Regards,
Vishal

On Fri, Mar 17, 2017 at 9:11 PM, Alexandre Rafalovitch 
wrote:

> One assumes by hooking into the same code that updates RDBMS, as
> opposed to be reverse engineering the changes from looking at the DB
> content. This would be especially the case for Delete changes.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 17 March 2017 at 11:37, OTH  wrote:
> >>
> >> Also, solrj is good when you want your RDBMS updates make immediately
> >> available in solr.
> >
> > How can SolrJ be used to make RDBMS updates immediately available?
> > Thanks
> >
> > On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar  >
> > wrote:
> >
> >> Hi Vishal,
> >>
> >> As per my experience DIH is the best for RDBMS to solr index. DIH with
> >> caching has best performance. DIH nested entities allow you to define
> >> simple queries.
> >> Also, solrj is good when you want your RDBMS updates make immediately
> >> available in solr. DIH full import can be used for index all data first
> >> time or restore index in case index is corrupted.
> >>
> >> Thanks,
> >> Sujay
> >>
> >> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain 
> wrote:
> >>
> >> > Hi,
> >> >
> >> >
> >> > I am new to Solr and am trying to move data from my RDBMS to Solr. I
> know
> >> > the available options are:
> >> > 1) Post Tool
> >> > 2) DIH
> >> > 3) SolrJ (as ours is a J2EE application).
> >> >
> >> > I want to know what is the recommended way for Data import in
> production
> >> > environment.
> >> > Will sending data via SolrJ in batches be faster than posting a csv
> using
> >> > POST tool?
> >> >
> >> >
> >> > Thanks,
> >> > Vishal
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks,
> >> Sujay P Bawaskar
> >> M:+91-77091 53669
> >>
>

Re: Data Import

2017-03-17 Thread Alexandre Rafalovitch

One assumes by hooking into the same code that updates RDBMS, as
opposed to be reverse engineering the changes from looking at the DB
content. This would be especially the case for Delete changes.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 17 March 2017 at 11:37, OTH  wrote:
>>
>> Also, solrj is good when you want your RDBMS updates make immediately
>> available in solr.
>
> How can SolrJ be used to make RDBMS updates immediately available?
> Thanks
>
> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar 
> wrote:
>
>> Hi Vishal,
>>
>> As per my experience DIH is the best for RDBMS to solr index. DIH with
>> caching has best performance. DIH nested entities allow you to define
>> simple queries.
>> Also, solrj is good when you want your RDBMS updates make immediately
>> available in solr. DIH full import can be used for index all data first
>> time or restore index in case index is corrupted.
>>
>> Thanks,
>> Sujay
>>
>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain  wrote:
>>
>> > Hi,
>> >
>> >
>> > I am new to Solr and am trying to move data from my RDBMS to Solr. I know
>> > the available options are:
>> > 1) Post Tool
>> > 2) DIH
>> > 3) SolrJ (as ours is a J2EE application).
>> >
>> > I want to know what is the recommended way for Data import in production
>> > environment.
>> > Will sending data via SolrJ in batches be faster than posting a csv using
>> > POST tool?
>> >
>> >
>> > Thanks,
>> > Vishal
>> >
>>
>>
>>
>> --
>> Thanks,
>> Sujay P Bawaskar
>> M:+91-77091 53669
>>

Re: Data Import

2017-03-17 Thread OTH

>
> Also, solrj is good when you want your RDBMS updates make immediately
> available in solr.

How can SolrJ be used to make RDBMS updates immediately available?
Thanks

On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar 
wrote:

> Hi Vishal,
>
> As per my experience DIH is the best for RDBMS to solr index. DIH with
> caching has best performance. DIH nested entities allow you to define
> simple queries.
> Also, solrj is good when you want your RDBMS updates make immediately
> available in solr. DIH full import can be used for index all data first
> time or restore index in case index is corrupted.
>
> Thanks,
> Sujay
>
> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain  wrote:
>
> > Hi,
> >
> >
> > I am new to Solr and am trying to move data from my RDBMS to Solr. I know
> > the available options are:
> > 1) Post Tool
> > 2) DIH
> > 3) SolrJ (as ours is a J2EE application).
> >
> > I want to know what is the recommended way for Data import in production
> > environment.
> > Will sending data via SolrJ in batches be faster than posting a csv using
> > POST tool?
> >
> >
> > Thanks,
> > Vishal
> >
>
>
>
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669
>

RE: Data Import

2017-03-17 Thread Liu, Daphne

I just want to share my recent project. I have successfully sent all our EDI 
documents to Cassandra 3.7 clusters using Solr 6.3 Data Import JDBC Cassandra 
connector indexing our documents.
Since Cassandra is so fast for writing, compression rate is around 13% and all 
my documents can be keep in my Cassandra clusters' memory, we are very happy 
with the result.


Kind regards,

Daphne Liu
BI Architect - Matrix SCM

CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL 32256 
USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 / 
daphne@cevalogistics.com



-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
Sent: Friday, March 17, 2017 9:54 AM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Data Import

I feel DIH is much better for prototyping, even though people do use it in 
production. If you do want to use DIH, you may benefit from reviewing the 
DIH-DB example I am currently rewriting in
https://issues.apache.org/jira/browse/SOLR-10312 (may need to change 
luceneMatchVersion in solrconfig.xml first).

CSV, etc, could be useful if you want to keep history of past imports, again 
useful during development, as you evolve schema.

SolrJ may actually be easiest/best for production since you already have Java 
stack.

The choice is yours in the end.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 17 March 2017 at 08:56, Shawn Heisey <apa...@elyograg.org> wrote:
> On 3/17/2017 3:04 AM, vishal jain wrote:
>> I am new to Solr and am trying to move data from my RDBMS to Solr. I know 
>> the available options are:
>> 1) Post Tool
>> 2) DIH
>> 3) SolrJ (as ours is a J2EE application).
>>
>> I want to know what is the recommended way for Data import in
>> production environment. Will sending data via SolrJ in batches be faster 
>> than posting a csv using POST tool?
>
> I've heard that CSV import runs EXTREMELY fast, but I have never
> tested it.  The same threading problem that I discuss below would
> apply to indexing this way.
>
> DIH is extremely powerful, but it has one glaring problem:  It's
> single-threaded, which means that only one stream of data is going
> into Solr, and each batch of documents to be inserted must wait for
> the previous one to finish inserting before it can start.  I do not
> know if DIH batches documents or sends them in one at a time.  If you
> have a manually sharded index, you can run DIH on each shard in
> parallel, but each one will be single-threaded.  That single thread is
> pretty efficient, but it's still only one thread.
>
> Sending multiple index updates to Solr in parallel (multi-threading)
> is how you radically speed up the Solr part of indexing.  This is
> usually done with a custom indexing program, which might be written
> with SolrJ or even in a completely different language.
>
> One thing to keep in mind with ANY indexing method:  Once the
> situation is examined closely, most people find that it's not Solr
> that makes their indexing slow.  The bottleneck is usually the source
> system -- how quickly the data can be retrieved.  It usually takes a
> lot longer to obtain the data than it does for Solr to index it.
>
> Thanks,
> Shawn
>
This e-mail message is intended for the above named recipient(s) only. It may 
contain confidential information that is privileged. If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of this e-mail and any attachment(s) is strictly 
prohibited. If you have received this e-mail by error, please immediately 
notify the sender by replying to this e-mail and deleting the message including 
any attachment(s) from your system. Thank you in advance for your cooperation 
and assistance. Although the company has taken reasonable precautions to ensure 
no viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments.

Re: Data Import

2017-03-17 Thread Alexandre Rafalovitch

I feel DIH is much better for prototyping, even though people do use
it in production. If you do want to use DIH, you may benefit from
reviewing the DIH-DB example I am currently rewriting in
https://issues.apache.org/jira/browse/SOLR-10312 (may need to change
luceneMatchVersion in solrconfig.xml first).

CSV, etc, could be useful if you want to keep history of past imports,
again useful during development, as you evolve schema.

SolrJ may actually be easiest/best for production since you already
have Java stack.

The choice is yours in the end.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 17 March 2017 at 08:56, Shawn Heisey  wrote:
> On 3/17/2017 3:04 AM, vishal jain wrote:
>> I am new to Solr and am trying to move data from my RDBMS to Solr. I know 
>> the available options are:
>> 1) Post Tool
>> 2) DIH
>> 3) SolrJ (as ours is a J2EE application).
>>
>> I want to know what is the recommended way for Data import in production
>> environment. Will sending data via SolrJ in batches be faster than posting a 
>> csv using POST tool?
>
> I've heard that CSV import runs EXTREMELY fast, but I have never tested
> it.  The same threading problem that I discuss below would apply to
> indexing this way.
>
> DIH is extremely powerful, but it has one glaring problem:  It's
> single-threaded, which means that only one stream of data is going into
> Solr, and each batch of documents to be inserted must wait for the
> previous one to finish inserting before it can start.  I do not know if
> DIH batches documents or sends them in one at a time.  If you have a
> manually sharded index, you can run DIH on each shard in parallel, but
> each one will be single-threaded.  That single thread is pretty
> efficient, but it's still only one thread.
>
> Sending multiple index updates to Solr in parallel (multi-threading) is
> how you radically speed up the Solr part of indexing.  This is usually
> done with a custom indexing program, which might be written with SolrJ
> or even in a completely different language.
>
> One thing to keep in mind with ANY indexing method:  Once the situation
> is examined closely, most people find that it's not Solr that makes
> their indexing slow.  The bottleneck is usually the source system -- how
> quickly the data can be retrieved.  It usually takes a lot longer to
> obtain the data than it does for Solr to index it.
>
> Thanks,
> Shawn
>

Re: Data Import

2017-03-17 Thread Shawn Heisey

On 3/17/2017 3:04 AM, vishal jain wrote:
> I am new to Solr and am trying to move data from my RDBMS to Solr. I know the 
> available options are:
> 1) Post Tool
> 2) DIH
> 3) SolrJ (as ours is a J2EE application).
>
> I want to know what is the recommended way for Data import in production
> environment. Will sending data via SolrJ in batches be faster than posting a 
> csv using POST tool?

I've heard that CSV import runs EXTREMELY fast, but I have never tested
it.  The same threading problem that I discuss below would apply to
indexing this way.

DIH is extremely powerful, but it has one glaring problem:  It's
single-threaded, which means that only one stream of data is going into
Solr, and each batch of documents to be inserted must wait for the
previous one to finish inserting before it can start.  I do not know if
DIH batches documents or sends them in one at a time.  If you have a
manually sharded index, you can run DIH on each shard in parallel, but
each one will be single-threaded.  That single thread is pretty
efficient, but it's still only one thread.

Sending multiple index updates to Solr in parallel (multi-threading) is
how you radically speed up the Solr part of indexing.  This is usually
done with a custom indexing program, which might be written with SolrJ
or even in a completely different language.

One thing to keep in mind with ANY indexing method:  Once the situation
is examined closely, most people find that it's not Solr that makes
their indexing slow.  The bottleneck is usually the source system -- how
quickly the data can be retrieved.  It usually takes a lot longer to
obtain the data than it does for Solr to index it.

Thanks,
Shawn

Re: Data Import

2017-03-17 Thread Sujay Bawaskar

Hi Vishal,

As per my experience DIH is the best for RDBMS to solr index. DIH with
caching has best performance. DIH nested entities allow you to define
simple queries.
Also, solrj is good when you want your RDBMS updates make immediately
available in solr. DIH full import can be used for index all data first
time or restore index in case index is corrupted.

Thanks,
Sujay

On Fri, Mar 17, 2017 at 2:34 PM, vishal jain  wrote:

> Hi,
>
>
> I am new to Solr and am trying to move data from my RDBMS to Solr. I know
> the available options are:
> 1) Post Tool
> 2) DIH
> 3) SolrJ (as ours is a J2EE application).
>
> I want to know what is the recommended way for Data import in production
> environment.
> Will sending data via SolrJ in batches be faster than posting a csv using
> POST tool?
>
>
> Thanks,
> Vishal
>

-- 
Thanks,
Sujay P Bawaskar
M:+91-77091 53669

Re: Data Import Handler on 6.4.1

2017-03-15 Thread Walter Underwood

Also, upgrade to 6.4.2. There are serious performance problems in 6.4.0 and 
6.4.1.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 15, 2017, at 12:05 PM, Liu, Daphne  
> wrote:
> 
> For Solr 6.3,  I have to move mine to 
> ../solr-6.3.0/server/solr-webapp/webapp/WEB-INF/lib. If you are using jetty.
> 
> Kind regards,
> 
> Daphne Liu
> BI Architect - Matrix SCM
> 
> CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL 32256 
> USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 / 
> daphne@cevalogistics.com
> 
> 
> -Original Message-
> From: Michael Tobias [mailto:mtob...@btinternet.com]
> Sent: Wednesday, March 15, 2017 2:36 PM
> To: solr-user@lucene.apache.org
> Subject: Data Import Handler on 6.4.1
> 
> I am sure I am missing something simple but
> 
> I am running Solr 4.8.1 and trialling 6.4.1 on another computer.
> 
> I have had to manually modify the automatic 6.4.1 scheme config as we use a 
> set of specialised field types.  They work fine.
> 
> I am now trying to populate my core with data and having problems.
> 
> Exactly what names/paths should I be using in the solrconfig.xml file to get 
> this working - I don’t recall doing ANYTHING for 4.8.1
> 
>   regex=".*\.jar" />  
>   regex="solr-dataimporthandler-.*\.jar" /> ?
> 
> And where do I put the mysql-connector-java-5.1.29-bin.jar file and how do I 
> reference it to get it loaded?
> 
>
> ??
> 
> And then later in the solrconfig.xml I have:
> 
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
>  
>db-data-config.xml
>  
> 
> 
> 
> Any help much appreciated.
> 
> Regards
> 
> Michael
> 
> 
> -Original Message-
> From: David Hastings [mailto:hastings.recurs...@gmail.com]
> Sent: 15 March 2017 17:47
> To: solr-user@lucene.apache.org
> Subject: Re: Get handler not working
> 
> from your previous email:
> "There is no "id"
> field defined in the schema."
> 
> you need an id field to use the get handler
> 
> On Wed, Mar 15, 2017 at 1:45 PM, Chris Ulicny  wrote:
> 
>> I thought that "id" and "ids" were fixed parameters for the get
>> handler, but I never remember, so I've already tried both. Each time
>> it comes back with the same response of no document.
>> 
>> On Wed, Mar 15, 2017 at 1:31 PM Alexandre Rafalovitch
>> 
>> wrote:
>> 
>>> Actually.
>>> 
>>> I think Real Time Get handler has "id" as a magical parameter, not
>>> as a field name. It maps to the real id field via the uniqueKey
>>> definition:
>>> https://cwiki.apache.org/confluence/display/solr/RealTime+Get
>>> 
>>> So, if you have not, could you try the way you originally wrote it.
>>> 
>>> Regards,
>>>   Alex.
>>> 
>>> http://www.solr-start.com/ - Resources for Solr users, new and
>> experienced
>>> 
>>> 
>>> On 15 March 2017 at 13:22, Chris Ulicny  wrote:
 Sorry, that is a typo. The get is using the iqdocid field. There
 is no
>>> "id"
 field defined in the schema.
 
 solr/TestCollection/get?iqdocid=2957-TV-201604141900
 
 solr/TestCollection/select?q=*:*=iqdocid:2957-TV-201604141900
 
 On Wed, Mar 15, 2017 at 1:15 PM Erick Erickson <
>> erickerick...@gmail.com>
 wrote:
 
> Is this a typo or are you trying to use get with an "id" field
> and your filter query uses "iqdocid"?
> 
> Best,
> Erick
> 
> On Wed, Mar 15, 2017 at 8:31 AM, Chris Ulicny 
>> wrote:
>> Yes, we're using a fixed schema with the iqdocid field set as
>> the
> uniqueKey.
>> 
>> On Wed, Mar 15, 2017 at 11:28 AM Alexandre Rafalovitch <
> arafa...@gmail.com>
>> wrote:
>> 
>>> What is your uniqueKey? Is it iqdocid?
>>> 
>>> Regards,
>>>   Alex.
>>> 
>>> http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
>>> 
>>> 
>>> On 15 March 2017 at 11:24, Chris Ulicny  wrote:
 Hi,
 
 I've been trying to use the get handler for a new solr cloud
> collection
>>> we
 are using, and something seems to be amiss.
 
 We are running 6.3.0, so we did not explicitly define the
 request
> handler
 in the solrconfig since it's supposed to be implicitly defined.
>> We
> also
 have the update log enabled with the default configuration.
 
 Whenever I send a get query for a document already known to
 be in
>>> the
 collection, I get no documents returned. But when I use a
 filter
> query on
 the uniqueKey field for the same value I get the document
 back
 
 solr/TestCollection/get?id=2957-TV-201604141900
 
 solr/TestCollection/select?q=*:*=iqdocid:2957-TV-20160414
 1900
 
 Is there some configuration

RE: Data Import Handler on 6.4.1

2017-03-15 Thread Liu, Daphne

For Solr 6.3,  I have to move mine to 
../solr-6.3.0/server/solr-webapp/webapp/WEB-INF/lib. If you are using jetty.

Kind regards,

Daphne Liu
BI Architect - Matrix SCM

CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL 32256 
USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 / 
daphne@cevalogistics.com


-Original Message-
From: Michael Tobias [mailto:mtob...@btinternet.com]
Sent: Wednesday, March 15, 2017 2:36 PM
To: solr-user@lucene.apache.org
Subject: Data Import Handler on 6.4.1

I am sure I am missing something simple but

I am running Solr 4.8.1 and trialling 6.4.1 on another computer.

I have had to manually modify the automatic 6.4.1 scheme config as we use a set 
of specialised field types.  They work fine.

I am now trying to populate my core with data and having problems.

Exactly what names/paths should I be using in the solrconfig.xml file to get 
this working - I don’t recall doing ANYTHING for 4.8.1


   ?

And where do I put the mysql-connector-java-5.1.29-bin.jar file and how do I 
reference it to get it loaded?


??

And then later in the solrconfig.xml I have:


  
db-data-config.xml
  



Any help much appreciated.

Regards

Michael


-Original Message-
From: David Hastings [mailto:hastings.recurs...@gmail.com]
Sent: 15 March 2017 17:47
To: solr-user@lucene.apache.org
Subject: Re: Get handler not working

from your previous email:
"There is no "id"
field defined in the schema."

you need an id field to use the get handler

On Wed, Mar 15, 2017 at 1:45 PM, Chris Ulicny  wrote:

> I thought that "id" and "ids" were fixed parameters for the get
> handler, but I never remember, so I've already tried both. Each time
> it comes back with the same response of no document.
>
> On Wed, Mar 15, 2017 at 1:31 PM Alexandre Rafalovitch
> 
> wrote:
>
> > Actually.
> >
> > I think Real Time Get handler has "id" as a magical parameter, not
> > as a field name. It maps to the real id field via the uniqueKey
> > definition:
> > https://cwiki.apache.org/confluence/display/solr/RealTime+Get
> >
> > So, if you have not, could you try the way you originally wrote it.
> >
> > Regards,
> >Alex.
> > 
> > http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
> >
> >
> > On 15 March 2017 at 13:22, Chris Ulicny  wrote:
> > > Sorry, that is a typo. The get is using the iqdocid field. There
> > > is no
> > "id"
> > > field defined in the schema.
> > >
> > > solr/TestCollection/get?iqdocid=2957-TV-201604141900
> > >
> > > solr/TestCollection/select?q=*:*=iqdocid:2957-TV-201604141900
> > >
> > > On Wed, Mar 15, 2017 at 1:15 PM Erick Erickson <
> erickerick...@gmail.com>
> > > wrote:
> > >
> > >> Is this a typo or are you trying to use get with an "id" field
> > >> and your filter query uses "iqdocid"?
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> On Wed, Mar 15, 2017 at 8:31 AM, Chris Ulicny 
> wrote:
> > >> > Yes, we're using a fixed schema with the iqdocid field set as
> > >> > the
> > >> uniqueKey.
> > >> >
> > >> > On Wed, Mar 15, 2017 at 11:28 AM Alexandre Rafalovitch <
> > >> arafa...@gmail.com>
> > >> > wrote:
> > >> >
> > >> >> What is your uniqueKey? Is it iqdocid?
> > >> >>
> > >> >> Regards,
> > >> >>Alex.
> > >> >> 
> > >> >> http://www.solr-start.com/ - Resources for Solr users, new and
> > >> experienced
> > >> >>
> > >> >>
> > >> >> On 15 March 2017 at 11:24, Chris Ulicny  wrote:
> > >> >> > Hi,
> > >> >> >
> > >> >> > I've been trying to use the get handler for a new solr cloud
> > >> collection
> > >> >> we
> > >> >> > are using, and something seems to be amiss.
> > >> >> >
> > >> >> > We are running 6.3.0, so we did not explicitly define the
> > >> >> > request
> > >> handler
> > >> >> > in the solrconfig since it's supposed to be implicitly defined.
> We
> > >> also
> > >> >> > have the update log enabled with the default configuration.
> > >> >> >
> > >> >> > Whenever I send a get query for a document already known to
> > >> >> > be in
> > the
> > >> >> > collection, I get no documents returned. But when I use a
> > >> >> > filter
> > >> query on
> > >> >> > the uniqueKey field for the same value I get the document
> > >> >> > back
> > >> >> >
> > >> >> > solr/TestCollection/get?id=2957-TV-201604141900
> > >> >> >
> > >> >> > solr/TestCollection/select?q=*:*=iqdocid:2957-TV-20160414
> > >> >> > 1900
> > >> >> >
> > >> >> > Is there some configuration that I am missing?
> > >> >> >
> > >> >> > Thanks,
> > >> >> > Chris
> > >> >>
> > >>
> >
>

This e-mail message is intended for the above named recipient(s) only. It may 
contain confidential information that is privileged. If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of this e-mail and any attachment(s) is strictly 
prohibited. If you have received this e-mail

Re: Data Import Handler, also "Real Time" index updates

2017-03-05 Thread Damien Kamerman

You could configure the dataimporthandler to not delete at the start
(either do a delta or set the preimportdeltequery), and set a
postimportdeletequery if required.

On Saturday, 4 March 2017, Alexandre Rafalovitch  wrote:

> Commit is index global. So if you have overlapping timelines and commit is
> issued, it will affect all changes done to that point.
>
> So, the aliases may be better for you. You could potentially also reload a
> cure with changes solrconfig.XML settings, but that's heavy on caches.
>
> Regards,
>Alex
>
> On 3 Mar 2017 1:21 PM, "Sales"  >
> wrote:
>
>
> >
> > You have indicated that you have a way to avoid doing updates during the
> > full import.  Because of this, you do have another option that is likely
> > much easier for you to implement:  Set the "commitWithin" parameter on
> > each update request.  This works almost identically to autoSoftCommit,
> > but only after a request is made.  As long as there are never any of
> > these updates during a full import, these commits cannot affect that
> import.
>
> I had attempted at least to say that there may be a few updates that happen
> at the start of an import, so, they are while an import is happening just
> due to timing issues. Those will be detected, and, re-executed once the
> import is done though. But my question here is if the update is using
> commitWithin, then, does that only affect those updates that have the
> parameter, or, does it then also soft commit the in progress import? I
> cannot guarantee that zero updates will be done as there is a timing issue
> at the very start of the import, so, a few could cross over.
>
> Adding commitWithin is fine. Just want to make sure those that might
> execute for the first few seconds of an import don’t kill anything.
> >
> > No matter what is happening, you should have autoCommit (not
> > autoSoftCommit) configured with openSearcher set to false.  This will
> > ensure transaction log rollover, without affecting change visibility.  I
> > recommend a maxTime of one to five minutes for this.  You'll see 15
> > seconds as the recommended value in many places.
> >
> > https://lucidworks.com/2013/08/23/understanding-
> transaction-logs-softcommit-and-commit-in-sorlcloud/ <
> https://lucidworks.com/2013/08/23/understanding-
> transaction-logs-softcommit-
> and-commit-in-sorlcloud/>
>
> Oh, we are fine with much longer, does not have to be instant. 10-15
> minutes would be fine.
>
> >
> > Thanks
> > Shawn
> >
>

Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Alexandre Rafalovitch

Commit is index global. So if you have overlapping timelines and commit is
issued, it will affect all changes done to that point.

So, the aliases may be better for you. You could potentially also reload a
cure with changes solrconfig.XML settings, but that's heavy on caches.

Regards,
   Alex

On 3 Mar 2017 1:21 PM, "Sales" 
wrote:

>
> You have indicated that you have a way to avoid doing updates during the
> full import.  Because of this, you do have another option that is likely
> much easier for you to implement:  Set the "commitWithin" parameter on
> each update request.  This works almost identically to autoSoftCommit,
> but only after a request is made.  As long as there are never any of
> these updates during a full import, these commits cannot affect that
import.

I had attempted at least to say that there may be a few updates that happen
at the start of an import, so, they are while an import is happening just
due to timing issues. Those will be detected, and, re-executed once the
import is done though. But my question here is if the update is using
commitWithin, then, does that only affect those updates that have the
parameter, or, does it then also soft commit the in progress import? I
cannot guarantee that zero updates will be done as there is a timing issue
at the very start of the import, so, a few could cross over.

Adding commitWithin is fine. Just want to make sure those that might
execute for the first few seconds of an import don’t kill anything.
>
> No matter what is happening, you should have autoCommit (not
> autoSoftCommit) configured with openSearcher set to false.  This will
> ensure transaction log rollover, without affecting change visibility.  I
> recommend a maxTime of one to five minutes for this.  You'll see 15
> seconds as the recommended value in many places.
>
> https://lucidworks.com/2013/08/23/understanding-
transaction-logs-softcommit-and-commit-in-sorlcloud/ <
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-
and-commit-in-sorlcloud/>

Oh, we are fine with much longer, does not have to be instant. 10-15
minutes would be fine.

>
> Thanks
> Shawn
>

Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Sales


> 
> You have indicated that you have a way to avoid doing updates during the
> full import.  Because of this, you do have another option that is likely
> much easier for you to implement:  Set the "commitWithin" parameter on
> each update request.  This works almost identically to autoSoftCommit,
> but only after a request is made.  As long as there are never any of
> these updates during a full import, these commits cannot affect that import.

I had attempted at least to say that there may be a few updates that happen at 
the start of an import, so, they are while an import is happening just due to 
timing issues. Those will be detected, and, re-executed once the import is done 
though. But my question here is if the update is using commitWithin, then, does 
that only affect those updates that have the parameter, or, does it then also 
soft commit the in progress import? I cannot guarantee that zero updates will 
be done as there is a timing issue at the very start of the import, so, a few 
could cross over. 

Adding commitWithin is fine. Just want to make sure those that might execute 
for the first few seconds of an import don’t kill anything. 
> 
> No matter what is happening, you should have autoCommit (not
> autoSoftCommit) configured with openSearcher set to false.  This will
> ensure transaction log rollover, without affecting change visibility.  I
> recommend a maxTime of one to five minutes for this.  You'll see 15
> seconds as the recommended value in many places.
> 
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>  
> 

Oh, we are fine with much longer, does not have to be instant. 10-15 minutes 
would be fine.

> 
> Thanks
> Shawn
>

Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Shawn Heisey

On 3/3/2017 10:17 AM, Sales wrote:
> I am not sure how best to handle this. We use the data import handle to 
> re-sync all our data on a daily basis, takes 1-2 hours depending on system 
> load. It is set up to commit at the end, so, the old index remains until it’s 
> done, and, we lose no access while the import is happening.
>
> But, we now want to update certain fields in the index, but still regen 
> daily. So, it would seem we might need to autocommit, and, soft commit 
> potentially. When we enabled those, during the index, the data disappeared 
> since it kept soft committing during the import process, I see no way to 
> avoid soft commits during the import. But soft commits would appear to be 
> needed for the (non import) updates to the index. 
>
> I realize the import could happen while an update is done, but we can 
> actually avoid those. So, that is not an issue (one or two might go through, 
> but, we will redo those updates once the index is done, that part is all 
> handled.

Erick's solution of using aliases to swap a live index and a build index
is one very good way to go.  It does involve some additional complexity
that you may not be ready for.  Only you will know whether that's
something you can implement easily.  Collection aliasing was implemented
in Solr 4.2 by SOLR-4497, so 4.10 should definitely have it.

You have indicated that you have a way to avoid doing updates during the
full import.  Because of this, you do have another option that is likely
much easier for you to implement:  Set the "commitWithin" parameter on
each update request.  This works almost identically to autoSoftCommit,
but only after a request is made.  As long as there are never any of
these updates during a full import, these commits cannot affect that import.

No matter what is happening, you should have autoCommit (not
autoSoftCommit) configured with openSearcher set to false.  This will
ensure transaction log rollover, without affecting change visibility.  I
recommend a maxTime of one to five minutes for this.  You'll see 15
seconds as the recommended value in many places.

https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Thanks
Shawn

Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Sales

> On Mar 3, 2017, at 11:30 AM, Erick Erickson  wrote:
> 
> One way to handle this (presuming SolrCloud) is collection aliasing.
> You create two collections, c1 and c2. You then have two aliases. when
> you start "index" is aliased to c1 and "search" is aliased to c2. Now
> do your full import  to "index" (and, BTW, you'd be well advised to do
> at least a hard commit openSearcher=false during that time or you risk
> replaying all the docs in the tlog).
> 
> When the full import is done, switch the aliases so "search" points to c1 and
> "index" points to c2. Rinse. Repeat. Your client apps always use the same 
> alias,
> the alias switching makes whether c1 or c2 is being used transparent.
> By that I mean your user-facing app uses "search" and your indexing client
> uses "index".
> 
> You can now do your live updates to the "search" alias that has a soft
> commit set.
> Of course you have to have some mechanism for replaying all the live updates
> that came in when you were doing your full index into the "indexing"
> alias before
> you switch, but you say you have that handled.
> 
> Best,
> Erick
> 

Thanks. So, is this available on 4.10.4? 

If not, we used to gen another core, do the import, and, swap cores so this is 
possibly similar to collection aliases since in the end, the client did not 
care. I don’t see why that would not still work. Took a little effort to 
automate, but, not much. 

Regarding the import and commit, we use in data-config.xml readonly so this 
sets autocommit the way I understand it. Not sure what happens with 
opensearcher though. If that is not sufficient, how would I do hard commit and 
opensearcher false during that time? Surely not by modifying the config file?

Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Erick Erickson

One way to handle this (presuming SolrCloud) is collection aliasing.
You create two collections, c1 and c2. You then have two aliases. when
you start "index" is aliased to c1 and "search" is aliased to c2. Now
do your full import  to "index" (and, BTW, you'd be well advised to do
at least a hard commit openSearcher=false during that time or you risk
replaying all the docs in the tlog).

When the full import is done, switch the aliases so "search" points to c1 and
"index" points to c2. Rinse. Repeat. Your client apps always use the same alias,
the alias switching makes whether c1 or c2 is being used transparent.
By that I mean your user-facing app uses "search" and your indexing client
uses "index".

You can now do your live updates to the "search" alias that has a soft
commit set.
Of course you have to have some mechanism for replaying all the live updates
that came in when you were doing your full index into the "indexing"
alias before
you switch, but you say you have that handled.

Best,
Erick

On Fri, Mar 3, 2017 at 9:22 AM, Alexandre Rafalovitch
 wrote:
> On 3 March 2017 at 12:17, Sales  
> wrote:
>> When we enabled those, during the index, the data disappeared since it kept 
>> soft committing during the import process,
>
> This part does not quite make sense. Could you expand on this "data
> disappeared" part to understand what the issue is.
>
> The main issue with "update" is that all fields (apart from pure
> copyField destinations) need to be stored, so the document can be
> reconstructed, updated, re-indexed. Perhaps you have something strange
> happening around that?
>
> Regards,
>Alex.
>
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced

Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Sales

> 
> On Mar 3, 2017, at 11:22 AM, Alexandre Rafalovitch  wrote:
> 
> On 3 March 2017 at 12:17, Sales  
> wrote:
>> When we enabled those, during the index, the data disappeared since it kept 
>> soft committing during the import process,
> 
> This part does not quite make sense. Could you expand on this "data
> disappeared" part to understand what the issue is.
> 

So, the issue here is the first part of the import handler is to erase all the 
data, so, there are no products left in the index (it would appear based on 
what we see, after the first softcommit), and, a search returns no result at 
first, but, ever increasing number of records while the import is happening. We 
have 6 million indexed products.

I can't find a way to stop soft commits during the import?

Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Alexandre Rafalovitch

On 3 March 2017 at 12:17, Sales  wrote:
> When we enabled those, during the index, the data disappeared since it kept 
> soft committing during the import process,

This part does not quite make sense. Could you expand on this "data
disappeared" part to understand what the issue is.

The main issue with "update" is that all fields (apart from pure
copyField destinations) need to be stored, so the document can be
reconstructed, updated, re-indexed. Perhaps you have something strange
happening around that?

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced

Re: Data Import Handlers not working after upgrade from 6.3.0 to 6.4.0

2017-01-25 Thread Shawn Heisey

On 1/25/2017 4:06 PM, Dan Scarf wrote:
> I upgraded Solr 6.3.0 this morning to 6.4.0. All seemed good according to
> the logs but this afternoon we discovered that the DataImport tabs in our
> Collections now say:
>
>  'Sorry, no dataimport-handler defined!'.

This is a bug that only applies to 6.4.0.

https://issues.apache.org/jira/browse/SOLR-10035

Note that this is only a problem in the admin UI.  The DIH handler
itself should work just fine, independent of the admin UI.  Requests
really should be sent to the actual handler, not the admin UI.

Thanks,
Shawn

Re: Data Import Handler - maximum?

2016-12-12 Thread Shawn Heisey

On 12/11/2016 8:00 PM, Brian Narsi wrote:
> We are using Solr 5.1.0 and DIH to build index.
>
> We are using DIH with clean=true and commit=true and optimize=true.
> Currently retrieving about 10.5 million records in about an hour.
>
> I will like to find from other member's experiences as to how long can DIH
> run with no issues? What is the maximum number of records that anyone has
> pulled using DIH?
>
> Are there any limitations on the maximum number of records that can/should
> be pulled using DIH? What is the longest DIH can run?

There are no hard limits other than the Lucene limit of a little over
two billion docs per individual index.  With sharding, Solr is able to
easily overcome this limit on an entire index.

I have one index where each shard was over 50 million docs.  Each shard
has fewer docs now, because I changed it so there are more shards and
more machines.  For some reason the rebuild time (using DIH) got really
really long -- nearly 48 hours -- while building every shard in
parallel.  Still haven't figured out why the build time increased
dramatically.

One problem you might run into with DIH from a database has to do with
merging.  With default merge scheduler settings, eventually (typically
when there are millions of rows being imported) you'll run into a pause
in indexing that will take so long that the database connection will
close, causing the import to fail after the pause finishes.

I even opened a Lucene issue to get the default value for maxMergeCount
changed.  This issue went nowhere:

https://issues.apache.org/jira/browse/LUCENE-5705

Here's a thread from this mailing list discussing the problem and the
configuration solution:

http://lucene.472066.n3.nabble.com/What-does-quot-too-many-merges-stalling-quot-in-indexwriter-log-mean-td4077380.html

Thanks,
Shawn

Re: Data Import Handler - maximum?

2016-12-12 Thread Bernd Fehling


Am 12.12.2016 um 04:00 schrieb Brian Narsi:
> We are using Solr 5.1.0 and DIH to build index.
> 
> We are using DIH with clean=true and commit=true and optimize=true.
> Currently retrieving about 10.5 million records in about an hour.
> 
> I will like to find from other member's experiences as to how long can DIH
> run with no issues? What is the maximum number of records that anyone has
> pulled using DIH?

Afaik, DIH will run until maximum number of documents per index.
Our longest run took about 3.5 days for single DIH and over 100 mio. docs.
The runtime depends pretty much on the complexity of the analysis during 
loading.

Currently we are using concurrent DIH with 12 processes which takes 15 hours
for the same amount. Optimizing afterwards takes 9.5 hours.

SolrJ with 12 threads is doing the same indexing within 7.5 hours plus 
optimizing.
For huge amounts of data you should consider using SolrJ.

> 
> Are there any limitations on the maximum number of records that can/should
> be pulled using DIH? What is the longest DIH can run?
> 
> Thanks a bunch!
>

Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-26 Thread Marek Ščevlík

I ran my jar application beside solr running instance where I want to
trigger a DIH import.
I tried this approach:

String urlString1 = "http://localhost:8983/solr/db/dataimport;;
SolrClient solr1 = new HttpSolrClient.Builder(urlString1).build();
ModifiableSolrParams params = new ModifiableSolrParams();
params.set("command", "full-import");
SolrRequest request = new QueryRequest(params);
solr1.request(request);

.. and it returns now:

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://localhost:8983/solr/db/dataimport: Expected mime type
application/octet-stream but got text/html. 


Error 404 Not Found

HTTP ERROR 404
Problem accessing /solr/db/dataimport/select. Reason:
Not Found



So I am still confused now ...

What do you think ? Any ideas?

I am trying to figure it out. Silly think is when I create a simple URL
call with the URL string used in those solr request objects and fire it off
in java it does the right desired thing.

Weird. I think.

Thanks for any replies or help.


2016-11-26 20:03 GMT+01:00 Marek Ščevlík :

> Actually to be honest I realized that I only needed to trigger a data
> import handler from a jar file. Previously this was done in earlier
> versions via the SolrServer object. Now I am thinking if this is OK?:
>
> String urlString1 = "http://localhost:8983/solr/;;
> SolrClient solr1 = new HttpSolrClient.Builder(urlString).build();
>   
> ModifiableSolrParams params = new ModifiableSolrParams();
> params.set("db","/dataimport");
> params.set("command", "full-import");
> System.out.println(params.toString());
> QueryResponse qresponse1 = solr1.query(params);
>
> System.out.println("response = " + qresponse1);
>
> Output i get from this is: response = {responseHeader={status=0,
> QTime=0,params={wt=javabin,version=2,db=/dataimport,command=full-import}},
> response={numFound=0,start=0,docs=[]}}
>
> There is a core db which come with the examples in solr 6.3 package. It is
> loaded. From web ui admin I can operate it a run the dih reindex process.
>
> I wonder whether this could work ? What do you think? I am trying to call
> DIH whilst solr is running. This code is in a separate jar file that is run
> besides solr instance.
>
> This so far is not working for me. And I wonder why? What do you think?
> Should this work at all? OR perhaps someone else could help out.
>
>
> Thanks anyone for any help.
> 
>
> 2016-11-25 19:50 GMT+01:00 Marek Ščevlík :
>
>> I forgot to mention I am creating a jar file beside of a running solr 6.3
>> instance to which I am hoping to attach with java via the
>> SolrDispatchFilter to get at the cores and so then I could work with
>> data in code.
>>
>>
>> 2016-11-25 19:31 GMT+01:00 Marek Ščevlík :
>>
>>> Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
>>> release of Solr 6.3 to get hold of a running instance of the jetty server
>>> that is part of the solution? I found some code for previous versions where
>>> it was captured with this code and one could then obtain cores for a
>>> running solr instance ...
>>>
>>> SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty
>>>
>>> .getDispatchFilter().getFilter();
>>>
>>>
>>> I was trying to implement it this way but that is not working out very
>>> well now. I cant seem to get the jetty server object for the running
>>> instance. I tried several combinations but none seemed to work.
>>>
>>> Can you perhaps point me in the right direction?
>>>
>>> Perhaps you may know more than I do at the moment.
>>>
>>>
>>> Any help would be great.
>>>
>>>
>>> Thanks a lot
>>> Regards Marek Scevlik
>>>
>>>
>>>
>>> 2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] <
>>> daniel.da...@nih.gov>:
>>>
 Marek,

 I've wanted to do something like this in the past as well.  However, a
 rewrite that supports the same XML syntax might be better.   There are
 several problems with the design of the Data Import Handler that make it
 not quite suitable:

 - Not designed for Multi-threading
 - Bad implementation of XPath

 Another issue is that one of the big advantages of Data Import Handler
 goes away at this point, which is that it is hosted within Solr, and has a
 UI for testing within the Solr admin.

 A better open-source Java solution might be to connect Solr with Apache
 Camel - http://camel.apache.org/solr.html.

 If you are not tied absolutely to pure open-source, and freemium
 products will do, then you might look at Pentaho Spoon and Kettle.
  Although Talend is much more established in the market, I find Pentaho's
 XML-based ETL a bit easier to integrate as a developer, and unit test and
 such.   Talend does better when you have a full infrastructure set up, but
 then the attention required to unit tests and Git

Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-26 Thread Erick Erickson

on a quick glance, and not having tried this myself...

this seems wrong. You're setting a URL parameter "db":
params.set("db","/dataimport");

that's equivalent to a URL like
http://localhost:8983/solr=/dataimport

you'd want:
http://localhost:8983/solr/db/dataimport?command=full-import

I think you want to set your url for your HTTPClient to
the full solr path to dataimport handler, i.e something like
...solr/collection_or_core/dataimport
then set the params for dataimport handler like you are, i.e.:
params.set("command", "full-import");

Best,
Erick

On Sat, Nov 26, 2016 at 11:03 AM, Marek Ščevlík
 wrote:
> Actually to be honest I realized that I only needed to trigger a data
> import handler from a jar file. Previously this was done in earlier
> versions via the SolrServer object. Now I am thinking if this is OK?:
>
> String urlString1 = "http://localhost:8983/solr/;;
> SolrClient solr1 = new HttpSolrClient.Builder(urlString).build();
>
> ModifiableSolrParams params = new ModifiableSolrParams();
> params.set("db","/dataimport");
> params.set("command", "full-import");
> System.out.println(params.toString());
> QueryResponse qresponse1 = solr1.query(params);
>
> System.out.println("response = " + qresponse1);
>
> Output i get from this is: response =
> {responseHeader={status=0,QTime=0,params={wt=javabin,version=2,db=/dataimport,command=full-import}},response={numFound=0,start=0,docs=[]}}
>
> There is a core db which come with the examples in solr 6.3 package. It is
> loaded. From web ui admin I can operate it a run the dih reindex process.
>
> I wonder whether this could work ? What do you think? I am trying to call
> DIH whilst solr is running. This code is in a separate jar file that is run
> besides solr instance.
>
> This so far is not working for me. And I wonder why? What do you think?
> Should this work at all? OR perhaps someone else could help out.
>
>
> Thanks anyone for any help.
> 
>
> 2016-11-25 19:50 GMT+01:00 Marek Ščevlík :
>
>> I forgot to mention I am creating a jar file beside of a running solr 6.3
>> instance to which I am hoping to attach with java via the
>> SolrDispatchFilter to get at the cores and so then I could work with data
>> in code.
>>
>>
>> 2016-11-25 19:31 GMT+01:00 Marek Ščevlík :
>>
>>> Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
>>> release of Solr 6.3 to get hold of a running instance of the jetty server
>>> that is part of the solution? I found some code for previous versions where
>>> it was captured with this code and one could then obtain cores for a
>>> running solr instance ...
>>>
>>> SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty
>>>
>>> .getDispatchFilter().getFilter();
>>>
>>>
>>> I was trying to implement it this way but that is not working out very
>>> well now. I cant seem to get the jetty server object for the running
>>> instance. I tried several combinations but none seemed to work.
>>>
>>> Can you perhaps point me in the right direction?
>>>
>>> Perhaps you may know more than I do at the moment.
>>>
>>>
>>> Any help would be great.
>>>
>>>
>>> Thanks a lot
>>> Regards Marek Scevlik
>>>
>>>
>>>
>>> 2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] <
>>> daniel.da...@nih.gov>:
>>>
 Marek,

 I've wanted to do something like this in the past as well.  However, a
 rewrite that supports the same XML syntax might be better.   There are
 several problems with the design of the Data Import Handler that make it
 not quite suitable:

 - Not designed for Multi-threading
 - Bad implementation of XPath

 Another issue is that one of the big advantages of Data Import Handler
 goes away at this point, which is that it is hosted within Solr, and has a
 UI for testing within the Solr admin.

 A better open-source Java solution might be to connect Solr with Apache
 Camel - http://camel.apache.org/solr.html.

 If you are not tied absolutely to pure open-source, and freemium
 products will do, then you might look at Pentaho Spoon and Kettle.
  Although Talend is much more established in the market, I find Pentaho's
 XML-based ETL a bit easier to integrate as a developer, and unit test and
 such.   Talend does better when you have a full infrastructure set up, but
 then the attention required to unit tests and Git integration seems over
 the top.

 Another powerful way to get things done, depending on what you are
 indexing, is to use LogStash and couple that with Document processing
 chains.   Many of our projects benefit from having a single RDBMS view,
 perhaps a materialized view, that is used for the index.   LogStash does
 just fine here, pulling from the RDBMS and posting each row to Solr.  The
 hierarchical execution of Data Import Handler is very nice, but this can

Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-26 Thread Marek Ščevlík

Actually to be honest I realized that I only needed to trigger a data
import handler from a jar file. Previously this was done in earlier
versions via the SolrServer object. Now I am thinking if this is OK?:

String urlString1 = "http://localhost:8983/solr/;;
SolrClient solr1 = new HttpSolrClient.Builder(urlString).build();

ModifiableSolrParams params = new ModifiableSolrParams();
params.set("db","/dataimport");
params.set("command", "full-import");
System.out.println(params.toString());
QueryResponse qresponse1 = solr1.query(params);

System.out.println("response = " + qresponse1);

Output i get from this is: response =
{responseHeader={status=0,QTime=0,params={wt=javabin,version=2,db=/dataimport,command=full-import}},response={numFound=0,start=0,docs=[]}}

There is a core db which come with the examples in solr 6.3 package. It is
loaded. From web ui admin I can operate it a run the dih reindex process.

I wonder whether this could work ? What do you think? I am trying to call
DIH whilst solr is running. This code is in a separate jar file that is run
besides solr instance.

This so far is not working for me. And I wonder why? What do you think?
Should this work at all? OR perhaps someone else could help out.


Thanks anyone for any help.


2016-11-25 19:50 GMT+01:00 Marek Ščevlík :

> I forgot to mention I am creating a jar file beside of a running solr 6.3
> instance to which I am hoping to attach with java via the
> SolrDispatchFilter to get at the cores and so then I could work with data
> in code.
>
>
> 2016-11-25 19:31 GMT+01:00 Marek Ščevlík :
>
>> Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
>> release of Solr 6.3 to get hold of a running instance of the jetty server
>> that is part of the solution? I found some code for previous versions where
>> it was captured with this code and one could then obtain cores for a
>> running solr instance ...
>>
>> SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty
>>
>> .getDispatchFilter().getFilter();
>>
>>
>> I was trying to implement it this way but that is not working out very
>> well now. I cant seem to get the jetty server object for the running
>> instance. I tried several combinations but none seemed to work.
>>
>> Can you perhaps point me in the right direction?
>>
>> Perhaps you may know more than I do at the moment.
>>
>>
>> Any help would be great.
>>
>>
>> Thanks a lot
>> Regards Marek Scevlik
>>
>>
>>
>> 2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] <
>> daniel.da...@nih.gov>:
>>
>>> Marek,
>>>
>>> I've wanted to do something like this in the past as well.  However, a
>>> rewrite that supports the same XML syntax might be better.   There are
>>> several problems with the design of the Data Import Handler that make it
>>> not quite suitable:
>>>
>>> - Not designed for Multi-threading
>>> - Bad implementation of XPath
>>>
>>> Another issue is that one of the big advantages of Data Import Handler
>>> goes away at this point, which is that it is hosted within Solr, and has a
>>> UI for testing within the Solr admin.
>>>
>>> A better open-source Java solution might be to connect Solr with Apache
>>> Camel - http://camel.apache.org/solr.html.
>>>
>>> If you are not tied absolutely to pure open-source, and freemium
>>> products will do, then you might look at Pentaho Spoon and Kettle.
>>>  Although Talend is much more established in the market, I find Pentaho's
>>> XML-based ETL a bit easier to integrate as a developer, and unit test and
>>> such.   Talend does better when you have a full infrastructure set up, but
>>> then the attention required to unit tests and Git integration seems over
>>> the top.
>>>
>>> Another powerful way to get things done, depending on what you are
>>> indexing, is to use LogStash and couple that with Document processing
>>> chains.   Many of our projects benefit from having a single RDBMS view,
>>> perhaps a materialized view, that is used for the index.   LogStash does
>>> just fine here, pulling from the RDBMS and posting each row to Solr.  The
>>> hierarchical execution of Data Import Handler is very nice, but this can
>>> often be handled on the RDBMS side by creating a view, maybe using
>>> functions to provide some rows.   Many RDBMS systems also support
>>> federation and the import of XML from files, so that this brings XML
>>> processing into the picture.
>>>
>>> Hoping this helps,
>>>
>>> Dan Davis, Systems/Applications Architect (Contractor),
>>> Office of Computer and Communications Systems,
>>> National Library of Medicine, NIH
>>>
>>>
>>>
>>>
>>> -Original Message-
>>> From: Marek Ščevlík [mailto:mscev...@codenameprojects.com]
>>> Sent: Friday, November 18, 2016 9:29 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Data Import Request Handler isolated into its own project - any
>>> suggestions?
>>>
>>> Hello. My name is Marek Scevlik.
>>>

Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-25 Thread Marek Ščevlík

I forgot to mention I am creating a jar file beside of a running solr 6.3
instance to which I am hoping to attach with java via the SolrDispatchFilter
to get at the cores and so then I could work with data in code.


2016-11-25 19:31 GMT+01:00 Marek Ščevlík :

> Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
> release of Solr 6.3 to get hold of a running instance of the jetty server
> that is part of the solution? I found some code for previous versions where
> it was captured with this code and one could then obtain cores for a
> running solr instance ...
>
> SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty
>
> .getDispatchFilter().getFilter();
>
>
> I was trying to implement it this way but that is not working out very
> well now. I cant seem to get the jetty server object for the running
> instance. I tried several combinations but none seemed to work.
>
> Can you perhaps point me in the right direction?
>
> Perhaps you may know more than I do at the moment.
>
>
> Any help would be great.
>
>
> Thanks a lot
> Regards Marek Scevlik
>
>
>
> 2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] <
> daniel.da...@nih.gov>:
>
>> Marek,
>>
>> I've wanted to do something like this in the past as well.  However, a
>> rewrite that supports the same XML syntax might be better.   There are
>> several problems with the design of the Data Import Handler that make it
>> not quite suitable:
>>
>> - Not designed for Multi-threading
>> - Bad implementation of XPath
>>
>> Another issue is that one of the big advantages of Data Import Handler
>> goes away at this point, which is that it is hosted within Solr, and has a
>> UI for testing within the Solr admin.
>>
>> A better open-source Java solution might be to connect Solr with Apache
>> Camel - http://camel.apache.org/solr.html.
>>
>> If you are not tied absolutely to pure open-source, and freemium products
>> will do, then you might look at Pentaho Spoon and Kettle.   Although Talend
>> is much more established in the market, I find Pentaho's XML-based ETL a
>> bit easier to integrate as a developer, and unit test and such.   Talend
>> does better when you have a full infrastructure set up, but then the
>> attention required to unit tests and Git integration seems over the top.
>>
>> Another powerful way to get things done, depending on what you are
>> indexing, is to use LogStash and couple that with Document processing
>> chains.   Many of our projects benefit from having a single RDBMS view,
>> perhaps a materialized view, that is used for the index.   LogStash does
>> just fine here, pulling from the RDBMS and posting each row to Solr.  The
>> hierarchical execution of Data Import Handler is very nice, but this can
>> often be handled on the RDBMS side by creating a view, maybe using
>> functions to provide some rows.   Many RDBMS systems also support
>> federation and the import of XML from files, so that this brings XML
>> processing into the picture.
>>
>> Hoping this helps,
>>
>> Dan Davis, Systems/Applications Architect (Contractor),
>> Office of Computer and Communications Systems,
>> National Library of Medicine, NIH
>>
>>
>>
>>
>> -Original Message-
>> From: Marek Ščevlík [mailto:mscev...@codenameprojects.com]
>> Sent: Friday, November 18, 2016 9:29 AM
>> To: solr-user@lucene.apache.org
>> Subject: Data Import Request Handler isolated into its own project - any
>> suggestions?
>>
>> Hello. My name is Marek Scevlik.
>>
>>
>>
>> Currently I am working for a small company where we are interested in
>> implementing your Sorl 6.3 search engine.
>>
>>
>>
>> We are hoping to take out from the original source package the Data
>> Import Request Handler into its own project and create a usable .jar file
>> out of it.
>>
>>
>>
>> It should then serve as tool that would allow to connect to a remote
>> server and return data for us to our other application that would use the
>> returned data.
>>
>>
>>
>> What do you think? Would anything like this possible? To isolate out the
>> Data Import Request Handler into its own standalone project?
>>
>>
>>
>> If we could achieve this we won’t mind to share with the community this
>> new feature.
>>
>>
>>
>> I realize this is a first email and may lead into several hundreds so for
>> the start my request is very simple and not so high level detailed but I am
>> sure you realize it may lead into being quite complex.
>>
>>
>>
>> So I wonder if anyone replies.
>>
>>
>>
>> Thanks a lot for any replies and further info or guidance.
>>
>>
>>
>>
>>
>> Thanks.
>>
>> Regards Marek Scevlik
>>
>
>

Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-25 Thread Marek Ščevlík

Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
release of Solr 6.3 to get hold of a running instance of the jetty server
that is part of the solution? I found some code for previous versions where
it was captured with this code and one could then obtain cores for a
running solr instance ...

SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty

.getDispatchFilter().getFilter();


I was trying to implement it this way but that is not working out very well
now. I cant seem to get the jetty server object for the running instance. I
tried several combinations but none seemed to work.

Can you perhaps point me in the right direction?

Perhaps you may know more than I do at the moment.


Any help would be great.


Thanks a lot
Regards Marek Scevlik



2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] :

> Marek,
>
> I've wanted to do something like this in the past as well.  However, a
> rewrite that supports the same XML syntax might be better.   There are
> several problems with the design of the Data Import Handler that make it
> not quite suitable:
>
> - Not designed for Multi-threading
> - Bad implementation of XPath
>
> Another issue is that one of the big advantages of Data Import Handler
> goes away at this point, which is that it is hosted within Solr, and has a
> UI for testing within the Solr admin.
>
> A better open-source Java solution might be to connect Solr with Apache
> Camel - http://camel.apache.org/solr.html.
>
> If you are not tied absolutely to pure open-source, and freemium products
> will do, then you might look at Pentaho Spoon and Kettle.   Although Talend
> is much more established in the market, I find Pentaho's XML-based ETL a
> bit easier to integrate as a developer, and unit test and such.   Talend
> does better when you have a full infrastructure set up, but then the
> attention required to unit tests and Git integration seems over the top.
>
> Another powerful way to get things done, depending on what you are
> indexing, is to use LogStash and couple that with Document processing
> chains.   Many of our projects benefit from having a single RDBMS view,
> perhaps a materialized view, that is used for the index.   LogStash does
> just fine here, pulling from the RDBMS and posting each row to Solr.  The
> hierarchical execution of Data Import Handler is very nice, but this can
> often be handled on the RDBMS side by creating a view, maybe using
> functions to provide some rows.   Many RDBMS systems also support
> federation and the import of XML from files, so that this brings XML
> processing into the picture.
>
> Hoping this helps,
>
> Dan Davis, Systems/Applications Architect (Contractor),
> Office of Computer and Communications Systems,
> National Library of Medicine, NIH
>
>
>
>
> -Original Message-
> From: Marek Ščevlík [mailto:mscev...@codenameprojects.com]
> Sent: Friday, November 18, 2016 9:29 AM
> To: solr-user@lucene.apache.org
> Subject: Data Import Request Handler isolated into its own project - any
> suggestions?
>
> Hello. My name is Marek Scevlik.
>
>
>
> Currently I am working for a small company where we are interested in
> implementing your Sorl 6.3 search engine.
>
>
>
> We are hoping to take out from the original source package the Data Import
> Request Handler into its own project and create a usable .jar file out of
> it.
>
>
>
> It should then serve as tool that would allow to connect to a remote
> server and return data for us to our other application that would use the
> returned data.
>
>
>
> What do you think? Would anything like this possible? To isolate out the
> Data Import Request Handler into its own standalone project?
>
>
>
> If we could achieve this we won’t mind to share with the community this
> new feature.
>
>
>
> I realize this is a first email and may lead into several hundreds so for
> the start my request is very simple and not so high level detailed but I am
> sure you realize it may lead into being quite complex.
>
>
>
> So I wonder if anyone replies.
>
>
>
> Thanks a lot for any replies and further info or guidance.
>
>
>
>
>
> Thanks.
>
> Regards Marek Scevlik
>

Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-18 Thread Alexandre Rafalovitch

Is your goal to still index into Solr? It was not clear.

If yes, then it has been discussed quite a bit. The challenge is that
DIH is integrated into AdminUI, which makes it easier to see the
progress and set some flags. Plus the required jars are loaded via
solrconfig.xml, just like all other extra libraries. So, contribution
back would need to take that into account.

If you are not ready to face that, it may make sense to look at other
libraries first. Apache Camel, Apache NiFi, Cloudera morphline, etc.
All of them can send data into Solr, though their version support
differ. For example Camel seems to need Solr 3.5 still. Somebody
updating their implementation to Solr 6.3 and contributing that back
to that project would do a lot of good.

Regards,
Alex.

Solr Example reading group is starting November 2016, join us at
http://j.mp/SolrERG
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/

On 19 November 2016 at 01:29, Marek Ščevlík
 wrote:
> Hello. My name is Marek Scevlik.
>
>
>
> Currently I am working for a small company where we are interested in
> implementing your Sorl 6.3 search engine.
>
>
>
> We are hoping to take out from the original source package the Data Import
> Request Handler into its own project and create a usable .jar file out of
> it.
>
>
>
> It should then serve as tool that would allow to connect to a remote server
> and return data for us to our other application that would use the returned
> data.
>
>
>
> What do you think? Would anything like this possible? To isolate out the
> Data Import Request Handler into its own standalone project?
>
>
>
> If we could achieve this we won’t mind to share with the community this new
> feature.
>
>
>
> I realize this is a first email and may lead into several hundreds so for
> the start my request is very simple and not so high level detailed but I am
> sure you realize it may lead into being quite complex.
>
>
>
> So I wonder if anyone replies.
>
>
>
> Thanks a lot for any replies and further info or guidance.
>
>
>
>
>
> Thanks.
>
> Regards Marek Scevlik

RE: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-18 Thread Davis, Daniel (NIH/NLM) [C]

Marek,

I've wanted to do something like this in the past as well.  However, a rewrite 
that supports the same XML syntax might be better.   There are several problems 
with the design of the Data Import Handler that make it not quite suitable:

- Not designed for Multi-threading
- Bad implementation of XPath

Another issue is that one of the big advantages of Data Import Handler goes 
away at this point, which is that it is hosted within Solr, and has a UI for 
testing within the Solr admin.

A better open-source Java solution might be to connect Solr with Apache Camel - 
http://camel.apache.org/solr.html.

If you are not tied absolutely to pure open-source, and freemium products will 
do, then you might look at Pentaho Spoon and Kettle.   Although Talend is much 
more established in the market, I find Pentaho's XML-based ETL a bit easier to 
integrate as a developer, and unit test and such.   Talend does better when you 
have a full infrastructure set up, but then the attention required to unit 
tests and Git integration seems over the top.

Another powerful way to get things done, depending on what you are indexing, is 
to use LogStash and couple that with Document processing chains.   Many of our 
projects benefit from having a single RDBMS view, perhaps a materialized view, 
that is used for the index.   LogStash does just fine here, pulling from the 
RDBMS and posting each row to Solr.  The hierarchical execution of Data Import 
Handler is very nice, but this can often be handled on the RDBMS side by 
creating a view, maybe using functions to provide some rows.   Many RDBMS 
systems also support federation and the import of XML from files, so that this 
brings XML processing into the picture.

Hoping this helps,

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH




-Original Message-
From: Marek Ščevlík [mailto:mscev...@codenameprojects.com] 
Sent: Friday, November 18, 2016 9:29 AM
To: solr-user@lucene.apache.org
Subject: Data Import Request Handler isolated into its own project - any 
suggestions?

Hello. My name is Marek Scevlik.



Currently I am working for a small company where we are interested in 
implementing your Sorl 6.3 search engine.



We are hoping to take out from the original source package the Data Import 
Request Handler into its own project and create a usable .jar file out of it.



It should then serve as tool that would allow to connect to a remote server and 
return data for us to our other application that would use the returned data.



What do you think? Would anything like this possible? To isolate out the Data 
Import Request Handler into its own standalone project?



If we could achieve this we won’t mind to share with the community this new 
feature.



I realize this is a first email and may lead into several hundreds so for the 
start my request is very simple and not so high level detailed but I am sure 
you realize it may lead into being quite complex.



So I wonder if anyone replies.



Thanks a lot for any replies and further info or guidance.





Thanks.

Regards Marek Scevlik

RE: Data import handler in techproducts example

2016-07-07 Thread Brooks Chuck (FCA)

Hello Jonas,

Did you figure this out? 

Dr. Chuck Brooks
248-838-5070


-Original Message-
From: Jonas Vasiliauskas [mailto:jonas.vasiliaus...@yahoo.com.INVALID] 
Sent: Saturday, July 02, 2016 11:37 AM
To: solr-user@lucene.apache.org
Subject: Data import handler in techproducts example

Hey,

I'm quite new to solr and java environments. I have a goal for myself to import 
some data from mysql database in techproducts (core) example.

I have setup data import handler (DIH) for techproducts based on instructions 
here https://wiki.apache.org/solr/DIHQuickStart , but looks like solr doesn't 
load DIH libraries, could someone please explain in quick words on how to check 
if DIH is loaded and if not - how can I load it ?

Stacktrace is here: http://pastebin.ca/3654347

Thanks,

Re: Data import handler in techproducts example

2016-07-02 Thread Ahmet Arslan

Hi Jonas,

Search for the 
solr-dataimporthandler-*.jar place it under a lib directory (same level as the 
solr.xml file) along with the mysql jdbc driver (mysql-connector-java-*.jar)

Please see:
https://cwiki.apache.org/confluence/display/solr/Lib+Directives+in+SolrConfig




On Saturday, July 2, 2016 9:56 PM, Jonas Vasiliauskas 
 wrote:
Hey,

I'm quite new to solr and java environments. I have a goal for myself to 
import some data from mysql database in techproducts (core) example.

I have setup data import handler (DIH) for techproducts based on 
instructions here https://wiki.apache.org/solr/DIHQuickStart , but looks 
like solr doesn't load DIH libraries, could someone please explain in 
quick words on how to check if DIH is loaded and if not - how can I load 
it ?

Stacktrace is here: http://pastebin.ca/3654347

Thanks,

Re: "data import handler : import data from sql database :how to search in all fields"

2016-05-26 Thread Erick Erickson

There's nothing saying you have
to highlight fields you search on. So you
can specify hl.fl to be the "normal" (perhaps
stored-only) fields and still search on the
uber-field.

Best,
Erick

On Thu, May 26, 2016 at 2:08 PM, kostali hassan
 wrote:
> I did it , I copied all my dynamic field into text field and it work great.
> just one question even if I copied text into content and the inverse for
> get highliting , thats not work ,they are another way to get highliting?
> thank you eric
>
> 2016-05-26 18:28 GMT+01:00 Erick Erickson :
>
>> And, you can copy all of the fields into an "uber field" using the
>> copyField directive and just search the "uber field".
>>
>> Best,
>> Erick
>>
>> On Thu, May 26, 2016 at 7:35 AM, kostali hassan
>>  wrote:
>> > thank you it make sence .
>> > have a good day
>> >
>> > 2016-05-26 15:31 GMT+01:00 Siddhartha Singh Sandhu > >:
>> >
>> >> The schema.xml/managed_schema defines the default search field as
>> `text`.
>> >>
>> >> You can make all fields that you want searchable type `text`.
>> >>
>> >> On Thu, May 26, 2016 at 10:23 AM, kostali hassan <
>> >> med.has.kost...@gmail.com>
>> >> wrote:
>> >>
>> >> > I import data from sql databases with DIH . I am looking for serch
>> term
>> >> in
>> >> > all fields not by field.
>> >> >
>> >>
>>

Re: "data import handler : import data from sql database :how to search in all fields"

2016-05-26 Thread kostali hassan

I did it , I copied all my dynamic field into text field and it work great.
just one question even if I copied text into content and the inverse for
get highliting , thats not work ,they are another way to get highliting?
thank you eric

2016-05-26 18:28 GMT+01:00 Erick Erickson :

> And, you can copy all of the fields into an "uber field" using the
> copyField directive and just search the "uber field".
>
> Best,
> Erick
>
> On Thu, May 26, 2016 at 7:35 AM, kostali hassan
>  wrote:
> > thank you it make sence .
> > have a good day
> >
> > 2016-05-26 15:31 GMT+01:00 Siddhartha Singh Sandhu  >:
> >
> >> The schema.xml/managed_schema defines the default search field as
> `text`.
> >>
> >> You can make all fields that you want searchable type `text`.
> >>
> >> On Thu, May 26, 2016 at 10:23 AM, kostali hassan <
> >> med.has.kost...@gmail.com>
> >> wrote:
> >>
> >> > I import data from sql databases with DIH . I am looking for serch
> term
> >> in
> >> > all fields not by field.
> >> >
> >>
>

Re: "data import handler : import data from sql database :how to search in all fields"

2016-05-26 Thread Erick Erickson

And, you can copy all of the fields into an "uber field" using the
copyField directive and just search the "uber field".

Best,
Erick

On Thu, May 26, 2016 at 7:35 AM, kostali hassan
 wrote:
> thank you it make sence .
> have a good day
>
> 2016-05-26 15:31 GMT+01:00 Siddhartha Singh Sandhu :
>
>> The schema.xml/managed_schema defines the default search field as `text`.
>>
>> You can make all fields that you want searchable type `text`.
>>
>> On Thu, May 26, 2016 at 10:23 AM, kostali hassan <
>> med.has.kost...@gmail.com>
>> wrote:
>>
>> > I import data from sql databases with DIH . I am looking for serch term
>> in
>> > all fields not by field.
>> >
>>

Re: "data import handler : import data from sql database :how to search in all fields"

2016-05-26 Thread kostali hassan

thank you it make sence .
have a good day

2016-05-26 15:31 GMT+01:00 Siddhartha Singh Sandhu :

> The schema.xml/managed_schema defines the default search field as `text`.
>
> You can make all fields that you want searchable type `text`.
>
> On Thu, May 26, 2016 at 10:23 AM, kostali hassan <
> med.has.kost...@gmail.com>
> wrote:
>
> > I import data from sql databases with DIH . I am looking for serch term
> in
> > all fields not by field.
> >
>

Re: "data import handler : import data from sql database :how to search in all fields"

2016-05-26 Thread Siddhartha Singh Sandhu

The schema.xml/managed_schema defines the default search field as `text`.

You can make all fields that you want searchable type `text`.

On Thu, May 26, 2016 at 10:23 AM, kostali hassan 
wrote:

> I import data from sql databases with DIH . I am looking for serch term in
> all fields not by field.
>

Re: Data Import Handler - Multivalued fields - splitBy

2016-02-27 Thread saravanan1980

It's resolved after changing my column name..its all case sensitive...





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Multivalued-fields-splitBy-tp4243667p4260301.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler - Multivalued fields - splitBy

2016-02-27 Thread saravanan1980

I am also having the same problem.

Have you resolved this issue?

"response": {
"numFound": 3,
"start": 0,
"docs": [
  {
"genre": [
  "Action|Adventure",
  "Action",
  "Adventure"
]
  },
  {
"genre": [
  "Drama|Suspense",
  "Drama",
  "Suspense"
]
  },
  {
"genre": [
  "Adventure|Family|Fantasy|Science Fiction",
  "Adventure",
  "Family",
  "Fantasy",
  "Science Fiction"
]
  }
]
  }

Please let me know, if it is resolved...

 








--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Multivalued-fields-splitBy-tp4243667p4260300.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler Usage

2016-02-16 Thread vidya

Hi

Dataimport section in web ui page still shows me that no data import handler
is defined. And no data is being added to my new collection.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Usage-tp4257518p4257576.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler Usage

2016-02-16 Thread Erik Hatcher

The "other" collection (destination of the import) is the collection where that 
data import handler definition resides. 

   Erik

> On Feb 16, 2016, at 01:54, vidya  wrote:
> 
> Hi
> 
> I have gone through documents to define data import handler in solr. But i
> couldnot implement it.
> I have created data-config.xml file that specifies moving data from
> collection1 core to another collection, i donno where i need to specify that
> second collection.
> 
> 
>  
> url="http://localhost:8983/solr/collection1; query="*:*"/>
>  
> 
> 
> and request handler is defined as follows in solrconfig.xml
> 
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
>
>  /home/username/data-config.xml
>
>  
> 
> Even after adding this, i couldnot get any data import handler in web url
> page for importing.
> Why is it so? And what changes need to be done?
> I have followed the following url : 
> http://www.codewrecks.com/blog/index.php/2013/4/29/loading-data-from-sql-server-to-solr-with-a-data-import-handler
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Data-Import-Handler-Usage-tp4257518.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler - autoSoftCommit and autoCommit

2016-02-08 Thread Rajesh Hazari

we have this for a collection which updated every 3mins with min of 500
documents and once in a day of 10k documents in start of the day

   ${solr.autoCommit.maxTime:30}
1
true
true

  ${solr.autoSoftCommit.maxTime:6000}

As per solr documentation, If you have solr client to index documents,
its not suggested to use commit=true and optimize=true explicitly.

we have not tested data import handle with 10 million records.

we have settled with this config after many tests and after understanding
the need and requirements.

*Rajesh**.*

On Mon, Feb 8, 2016 at 10:15 AM, Troy Edwards 
wrote:

> We are running the data import handler to retrieve about 10 million records
> during work hours every day of the week. We are using Clean = true, Commit
> = true and Optimize = true. The entire process takes about 1 hour.
>
> What would be a good setting for autoCommit and autoSoftCommit?
>
> Thanks
>

Re: Data Import Handler - autoSoftCommit and autoCommit

2016-02-08 Thread Susheel Kumar

You can start with one of the suggestions from this link based on your
indexing and query load.


https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/


Thanks,
Susheel

On Mon, Feb 8, 2016 at 10:15 AM, Troy Edwards 
wrote:

> We are running the data import handler to retrieve about 10 million records
> during work hours every day of the week. We are using Clean = true, Commit
> = true and Optimize = true. The entire process takes about 1 hour.
>
> What would be a good setting for autoCommit and autoSoftCommit?
>
> Thanks
>

Re: Data Import Handler takes different time on different machines

2016-02-03 Thread Troy Edwards

While researching the space on the servers, I found that log files from
Sept 2015 are still there. These are solr_gc_log_datetime and
solr_log_datetime.

Is the default logging for Solr ok for production systems or does it need
to be changed/tuned?

Thanks,

On Tue, Feb 2, 2016 at 2:04 PM, Troy Edwards 
wrote:

> That is help!
>
> Thank you for the thoughts.
>
>
> On Tue, Feb 2, 2016 at 12:17 PM, Erick Erickson 
> wrote:
>
>> Scratch that installation and start over?
>>
>> Really, it sounds like something is fundamentally messed up with the
>> Linux install. Perhaps something as simple as file paths, or you have
>> old jars hanging around that are mis-matched. Or someone manually
>> deleted files from the Solr install. Or your disk filled up. Or
>>
>> How sure are you that the linux setup was done properly?
>>
>> Not much help I know,
>> Erick
>>
>> On Tue, Feb 2, 2016 at 10:11 AM, Troy Edwards 
>> wrote:
>> > Rerunning the Data Import Handler again on the the linux machine has
>> > started producing some errors and warnings:
>> >
>> > On the node on which DIH was started:
>> >
>> > WARN SolrWriter Error creating document : SolrInputDocument
>> >
>> > org.apache.solr.common.SolrException: No registered leader was found
>> > after waiting for 4000ms , collection: collectionmain slice: shard1
>> >
>> >
>> >
>> > On the second node:
>> >
>> > WARN ReplicationHandler Exception while writing response for params:
>> >
>> command=filecontent=true=1047=/replication=filestream=_1oo_Lucene50_0.tip
>> >
>> > java.nio.file.NoSuchFileException:
>> >
>> /var/solr/data/collectionmain_shard2_replica1/data/index/_1oo_Lucene50_0.tip
>> >
>> >
>> > ERROR
>> >
>> > Index fetch failed :org.apache.solr.common.SolrException: Unable to
>> > download _169.si completely. Downloaded 0!=466
>> >
>> >
>> > ReplicationHandler Index fetch failed
>> > :org.apache.solr.common.SolrException: Unable to download _169.si
>> > completely. Downloaded 0!=466
>> >
>> > WARN
>> > IndexFetcher File _1pd_Lucene50_0.tim did not match. expected checksum
>> is
>> > 3549855722 and actual is checksum 2062372352. expected length is 72522
>> and
>> > actual length is 39227
>> >
>> > WARN UpdateLog Log replay finished.
>> recoveryInfo=RecoveryInfo{adds=840638
>> > deletes=0 deleteByQuery=0 errors=0 positionOfStart=554264}
>> >
>> >
>> > Any suggestions about this?
>> >
>> > Thanks
>> >
>> > On Mon, Feb 1, 2016 at 10:03 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> The first thing I'd be looking at is how I the JDBC batch size compares
>> >> between the two machines.
>> >>
>> >> AFAIK, Solr shouldn't notice the difference, and since a large majority
>> >> of the development is done on Linux-based systems, I'd be surprised if
>> >> this was worse than Windows, which would lead me to the one thing that
>> >> is definitely different between the two: Your JDBC driver and its
>> settings.
>> >> At least that's where I'd look first.
>> >>
>> >> If nothing immediate pops up, I'd probably write a small driver
>> program to
>> >> just access the database from the two machines and process your 10M
>> >> records _without_ sending them to Solr and see what the comparison is.
>> >>
>> >> You can also forgo DIH and do a simple import program via SolrJ. The
>> >> advantage here is that the comparison I'm talking about above is
>> >> really simple, just comment out the call that sends data to Solr.
>> Here's an
>> >> example...
>> >>
>> >> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Mon, Feb 1, 2016 at 7:34 PM, Troy Edwards > >
>> >> wrote:
>> >> > Sorry, I should explain further. The Data Import Handler had been
>> running
>> >> > for a while retrieving only about 15 records from the database.
>> Both
>> >> in
>> >> > development env (windows) and linux machine it took about 3 mins.
>> >> >
>> >> > The query has been changed and we are now trying to retrieve about 10
>> >> > million records. We do expect the time to increase.
>> >> >
>> >> > With the new query the time taken on windows machine is consistently
>> >> around
>> >> > 40 mins. While the DIH is running queries slow down i.e. a query that
>> >> > typically took 60 msec takes 100 msec.
>> >> >
>> >> > The time taken on linux machine is consistently around 2.5 hours.
>> While
>> >> the
>> >> > DIH is running queries take about 200  to 400 msec.
>> >> >
>> >> > Thanks!
>> >> >
>> >> > On Mon, Feb 1, 2016 at 8:45 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> What happens if you run just the SQL query from the
>> >> >> windows box and from the linux box? Is there any chance
>> >> >> that somehow the connection from the linux box is
>> >> >> just slower?
>> >> >>
>> >> >> Best,
>> >> >> Erick
>> >> >>
>> >> >> On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch
>> >> >>

Re: Data Import Handler takes different time on different machines

2016-02-02 Thread Erick Erickson

Scratch that installation and start over?

Really, it sounds like something is fundamentally messed up with the
Linux install. Perhaps something as simple as file paths, or you have
old jars hanging around that are mis-matched. Or someone manually
deleted files from the Solr install. Or your disk filled up. Or

How sure are you that the linux setup was done properly?

Not much help I know,
Erick

On Tue, Feb 2, 2016 at 10:11 AM, Troy Edwards  wrote:
> Rerunning the Data Import Handler again on the the linux machine has
> started producing some errors and warnings:
>
> On the node on which DIH was started:
>
> WARN SolrWriter Error creating document : SolrInputDocument
>
> org.apache.solr.common.SolrException: No registered leader was found
> after waiting for 4000ms , collection: collectionmain slice: shard1
>
>
>
> On the second node:
>
> WARN ReplicationHandler Exception while writing response for params:
> command=filecontent=true=1047=/replication=filestream=_1oo_Lucene50_0.tip
>
> java.nio.file.NoSuchFileException:
> /var/solr/data/collectionmain_shard2_replica1/data/index/_1oo_Lucene50_0.tip
>
>
> ERROR
>
> Index fetch failed :org.apache.solr.common.SolrException: Unable to
> download _169.si completely. Downloaded 0!=466
>
>
> ReplicationHandler Index fetch failed
> :org.apache.solr.common.SolrException: Unable to download _169.si
> completely. Downloaded 0!=466
>
> WARN
> IndexFetcher File _1pd_Lucene50_0.tim did not match. expected checksum is
> 3549855722 and actual is checksum 2062372352. expected length is 72522 and
> actual length is 39227
>
> WARN UpdateLog Log replay finished. recoveryInfo=RecoveryInfo{adds=840638
> deletes=0 deleteByQuery=0 errors=0 positionOfStart=554264}
>
>
> Any suggestions about this?
>
> Thanks
>
> On Mon, Feb 1, 2016 at 10:03 PM, Erick Erickson 
> wrote:
>
>> The first thing I'd be looking at is how I the JDBC batch size compares
>> between the two machines.
>>
>> AFAIK, Solr shouldn't notice the difference, and since a large majority
>> of the development is done on Linux-based systems, I'd be surprised if
>> this was worse than Windows, which would lead me to the one thing that
>> is definitely different between the two: Your JDBC driver and its settings.
>> At least that's where I'd look first.
>>
>> If nothing immediate pops up, I'd probably write a small driver program to
>> just access the database from the two machines and process your 10M
>> records _without_ sending them to Solr and see what the comparison is.
>>
>> You can also forgo DIH and do a simple import program via SolrJ. The
>> advantage here is that the comparison I'm talking about above is
>> really simple, just comment out the call that sends data to Solr. Here's an
>> example...
>>
>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>>
>> Best,
>> Erick
>>
>> On Mon, Feb 1, 2016 at 7:34 PM, Troy Edwards 
>> wrote:
>> > Sorry, I should explain further. The Data Import Handler had been running
>> > for a while retrieving only about 15 records from the database. Both
>> in
>> > development env (windows) and linux machine it took about 3 mins.
>> >
>> > The query has been changed and we are now trying to retrieve about 10
>> > million records. We do expect the time to increase.
>> >
>> > With the new query the time taken on windows machine is consistently
>> around
>> > 40 mins. While the DIH is running queries slow down i.e. a query that
>> > typically took 60 msec takes 100 msec.
>> >
>> > The time taken on linux machine is consistently around 2.5 hours. While
>> the
>> > DIH is running queries take about 200  to 400 msec.
>> >
>> > Thanks!
>> >
>> > On Mon, Feb 1, 2016 at 8:45 PM, Erick Erickson 
>> > wrote:
>> >
>> >> What happens if you run just the SQL query from the
>> >> windows box and from the linux box? Is there any chance
>> >> that somehow the connection from the linux box is
>> >> just slower?
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch
>> >>  wrote:
>> >> > What are you importing from? Is the source and Solr machine collocated
>> >> > in the same fashion on dev and prod?
>> >> >
>> >> > Have you tried running this on a Linux dev machine? Perhaps your prod
>> >> > machine is loaded much more than a dev.
>> >> >
>> >> > Regards,
>> >> >Alex.
>> >> > 
>> >> > Newsletter and resources for Solr beginners and intermediates:
>> >> > http://www.solr-start.com/
>> >> >
>> >> >
>> >> > On 2 February 2016 at 13:21, Troy Edwards 
>> >> wrote:
>> >> >> We have a windows development machine on which the Data Import
>> Handler
>> >> >> consistently takes about 40 mins to finish. Queries run fine. JVM
>> >> memory is
>> >> >> 2 GB per node.
>> >> >>
>> >> >> But on a linux machine it consistently takes about 2.5 hours. The
>> >> queries
>> >> >> also run slower. JVM memory

Re: Data Import Handler takes different time on different machines

2016-02-02 Thread Troy Edwards

That is help!

Thank you for the thoughts.


On Tue, Feb 2, 2016 at 12:17 PM, Erick Erickson 
wrote:

> Scratch that installation and start over?
>
> Really, it sounds like something is fundamentally messed up with the
> Linux install. Perhaps something as simple as file paths, or you have
> old jars hanging around that are mis-matched. Or someone manually
> deleted files from the Solr install. Or your disk filled up. Or
>
> How sure are you that the linux setup was done properly?
>
> Not much help I know,
> Erick
>
> On Tue, Feb 2, 2016 at 10:11 AM, Troy Edwards 
> wrote:
> > Rerunning the Data Import Handler again on the the linux machine has
> > started producing some errors and warnings:
> >
> > On the node on which DIH was started:
> >
> > WARN SolrWriter Error creating document : SolrInputDocument
> >
> > org.apache.solr.common.SolrException: No registered leader was found
> > after waiting for 4000ms , collection: collectionmain slice: shard1
> >
> >
> >
> > On the second node:
> >
> > WARN ReplicationHandler Exception while writing response for params:
> >
> command=filecontent=true=1047=/replication=filestream=_1oo_Lucene50_0.tip
> >
> > java.nio.file.NoSuchFileException:
> >
> /var/solr/data/collectionmain_shard2_replica1/data/index/_1oo_Lucene50_0.tip
> >
> >
> > ERROR
> >
> > Index fetch failed :org.apache.solr.common.SolrException: Unable to
> > download _169.si completely. Downloaded 0!=466
> >
> >
> > ReplicationHandler Index fetch failed
> > :org.apache.solr.common.SolrException: Unable to download _169.si
> > completely. Downloaded 0!=466
> >
> > WARN
> > IndexFetcher File _1pd_Lucene50_0.tim did not match. expected checksum is
> > 3549855722 and actual is checksum 2062372352. expected length is 72522
> and
> > actual length is 39227
> >
> > WARN UpdateLog Log replay finished. recoveryInfo=RecoveryInfo{adds=840638
> > deletes=0 deleteByQuery=0 errors=0 positionOfStart=554264}
> >
> >
> > Any suggestions about this?
> >
> > Thanks
> >
> > On Mon, Feb 1, 2016 at 10:03 PM, Erick Erickson  >
> > wrote:
> >
> >> The first thing I'd be looking at is how I the JDBC batch size compares
> >> between the two machines.
> >>
> >> AFAIK, Solr shouldn't notice the difference, and since a large majority
> >> of the development is done on Linux-based systems, I'd be surprised if
> >> this was worse than Windows, which would lead me to the one thing that
> >> is definitely different between the two: Your JDBC driver and its
> settings.
> >> At least that's where I'd look first.
> >>
> >> If nothing immediate pops up, I'd probably write a small driver program
> to
> >> just access the database from the two machines and process your 10M
> >> records _without_ sending them to Solr and see what the comparison is.
> >>
> >> You can also forgo DIH and do a simple import program via SolrJ. The
> >> advantage here is that the comparison I'm talking about above is
> >> really simple, just comment out the call that sends data to Solr.
> Here's an
> >> example...
> >>
> >> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Feb 1, 2016 at 7:34 PM, Troy Edwards 
> >> wrote:
> >> > Sorry, I should explain further. The Data Import Handler had been
> running
> >> > for a while retrieving only about 15 records from the database.
> Both
> >> in
> >> > development env (windows) and linux machine it took about 3 mins.
> >> >
> >> > The query has been changed and we are now trying to retrieve about 10
> >> > million records. We do expect the time to increase.
> >> >
> >> > With the new query the time taken on windows machine is consistently
> >> around
> >> > 40 mins. While the DIH is running queries slow down i.e. a query that
> >> > typically took 60 msec takes 100 msec.
> >> >
> >> > The time taken on linux machine is consistently around 2.5 hours.
> While
> >> the
> >> > DIH is running queries take about 200  to 400 msec.
> >> >
> >> > Thanks!
> >> >
> >> > On Mon, Feb 1, 2016 at 8:45 PM, Erick Erickson <
> erickerick...@gmail.com>
> >> > wrote:
> >> >
> >> >> What happens if you run just the SQL query from the
> >> >> windows box and from the linux box? Is there any chance
> >> >> that somehow the connection from the linux box is
> >> >> just slower?
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch
> >> >>  wrote:
> >> >> > What are you importing from? Is the source and Solr machine
> collocated
> >> >> > in the same fashion on dev and prod?
> >> >> >
> >> >> > Have you tried running this on a Linux dev machine? Perhaps your
> prod
> >> >> > machine is loaded much more than a dev.
> >> >> >
> >> >> > Regards,
> >> >> >Alex.
> >> >> > 
> >> >> > Newsletter and resources for Solr beginners and intermediates:
> >> >> > http://www.solr-start.com/
> >> >> >
> >> >> >
> >> >> >

Re: Data Import Handler takes different time on different machines

2016-02-02 Thread Troy Edwards

Rerunning the Data Import Handler again on the the linux machine has
started producing some errors and warnings:

On the node on which DIH was started:

WARN SolrWriter Error creating document : SolrInputDocument

org.apache.solr.common.SolrException: No registered leader was found
after waiting for 4000ms , collection: collectionmain slice: shard1



On the second node:

WARN ReplicationHandler Exception while writing response for params:
command=filecontent=true=1047=/replication=filestream=_1oo_Lucene50_0.tip

java.nio.file.NoSuchFileException:
/var/solr/data/collectionmain_shard2_replica1/data/index/_1oo_Lucene50_0.tip


ERROR

Index fetch failed :org.apache.solr.common.SolrException: Unable to
download _169.si completely. Downloaded 0!=466


ReplicationHandler Index fetch failed
:org.apache.solr.common.SolrException: Unable to download _169.si
completely. Downloaded 0!=466

WARN
IndexFetcher File _1pd_Lucene50_0.tim did not match. expected checksum is
3549855722 and actual is checksum 2062372352. expected length is 72522 and
actual length is 39227

WARN UpdateLog Log replay finished. recoveryInfo=RecoveryInfo{adds=840638
deletes=0 deleteByQuery=0 errors=0 positionOfStart=554264}


Any suggestions about this?

Thanks

On Mon, Feb 1, 2016 at 10:03 PM, Erick Erickson 
wrote:

> The first thing I'd be looking at is how I the JDBC batch size compares
> between the two machines.
>
> AFAIK, Solr shouldn't notice the difference, and since a large majority
> of the development is done on Linux-based systems, I'd be surprised if
> this was worse than Windows, which would lead me to the one thing that
> is definitely different between the two: Your JDBC driver and its settings.
> At least that's where I'd look first.
>
> If nothing immediate pops up, I'd probably write a small driver program to
> just access the database from the two machines and process your 10M
> records _without_ sending them to Solr and see what the comparison is.
>
> You can also forgo DIH and do a simple import program via SolrJ. The
> advantage here is that the comparison I'm talking about above is
> really simple, just comment out the call that sends data to Solr. Here's an
> example...
>
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Mon, Feb 1, 2016 at 7:34 PM, Troy Edwards 
> wrote:
> > Sorry, I should explain further. The Data Import Handler had been running
> > for a while retrieving only about 15 records from the database. Both
> in
> > development env (windows) and linux machine it took about 3 mins.
> >
> > The query has been changed and we are now trying to retrieve about 10
> > million records. We do expect the time to increase.
> >
> > With the new query the time taken on windows machine is consistently
> around
> > 40 mins. While the DIH is running queries slow down i.e. a query that
> > typically took 60 msec takes 100 msec.
> >
> > The time taken on linux machine is consistently around 2.5 hours. While
> the
> > DIH is running queries take about 200  to 400 msec.
> >
> > Thanks!
> >
> > On Mon, Feb 1, 2016 at 8:45 PM, Erick Erickson 
> > wrote:
> >
> >> What happens if you run just the SQL query from the
> >> windows box and from the linux box? Is there any chance
> >> that somehow the connection from the linux box is
> >> just slower?
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch
> >>  wrote:
> >> > What are you importing from? Is the source and Solr machine collocated
> >> > in the same fashion on dev and prod?
> >> >
> >> > Have you tried running this on a Linux dev machine? Perhaps your prod
> >> > machine is loaded much more than a dev.
> >> >
> >> > Regards,
> >> >Alex.
> >> > 
> >> > Newsletter and resources for Solr beginners and intermediates:
> >> > http://www.solr-start.com/
> >> >
> >> >
> >> > On 2 February 2016 at 13:21, Troy Edwards 
> >> wrote:
> >> >> We have a windows development machine on which the Data Import
> Handler
> >> >> consistently takes about 40 mins to finish. Queries run fine. JVM
> >> memory is
> >> >> 2 GB per node.
> >> >>
> >> >> But on a linux machine it consistently takes about 2.5 hours. The
> >> queries
> >> >> also run slower. JVM memory here is also 2 GB per node.
> >> >>
> >> >> How should I go about analyzing and tuning the linux machine?
> >> >>
> >> >> Thanks
> >>
>

Re: Data Import Handler takes different time on different machines

2016-02-01 Thread Erick Erickson

What happens if you run just the SQL query from the
windows box and from the linux box? Is there any chance
that somehow the connection from the linux box is
just slower?

Best,
Erick

On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch
 wrote:
> What are you importing from? Is the source and Solr machine collocated
> in the same fashion on dev and prod?
>
> Have you tried running this on a Linux dev machine? Perhaps your prod
> machine is loaded much more than a dev.
>
> Regards,
>Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 2 February 2016 at 13:21, Troy Edwards  wrote:
>> We have a windows development machine on which the Data Import Handler
>> consistently takes about 40 mins to finish. Queries run fine. JVM memory is
>> 2 GB per node.
>>
>> But on a linux machine it consistently takes about 2.5 hours. The queries
>> also run slower. JVM memory here is also 2 GB per node.
>>
>> How should I go about analyzing and tuning the linux machine?
>>
>> Thanks

Re: Data Import Handler takes different time on different machines

2016-02-01 Thread Alexandre Rafalovitch

What are you importing from? Is the source and Solr machine collocated
in the same fashion on dev and prod?

Have you tried running this on a Linux dev machine? Perhaps your prod
machine is loaded much more than a dev.

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/

On 2 February 2016 at 13:21, Troy Edwards  wrote:
> We have a windows development machine on which the Data Import Handler
> consistently takes about 40 mins to finish. Queries run fine. JVM memory is
> 2 GB per node.
>
> But on a linux machine it consistently takes about 2.5 hours. The queries
> also run slower. JVM memory here is also 2 GB per node.
>
> How should I go about analyzing and tuning the linux machine?
>
> Thanks

Re: Data Import Handler takes different time on different machines

2016-02-01 Thread Erick Erickson

The first thing I'd be looking at is how I the JDBC batch size compares
between the two machines.

AFAIK, Solr shouldn't notice the difference, and since a large majority
of the development is done on Linux-based systems, I'd be surprised if
this was worse than Windows, which would lead me to the one thing that
is definitely different between the two: Your JDBC driver and its settings.
At least that's where I'd look first.

If nothing immediate pops up, I'd probably write a small driver program to
just access the database from the two machines and process your 10M
records _without_ sending them to Solr and see what the comparison is.

You can also forgo DIH and do a simple import program via SolrJ. The
advantage here is that the comparison I'm talking about above is
really simple, just comment out the call that sends data to Solr. Here's an
example...

https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

Best,
Erick

On Mon, Feb 1, 2016 at 7:34 PM, Troy Edwards  wrote:
> Sorry, I should explain further. The Data Import Handler had been running
> for a while retrieving only about 15 records from the database. Both in
> development env (windows) and linux machine it took about 3 mins.
>
> The query has been changed and we are now trying to retrieve about 10
> million records. We do expect the time to increase.
>
> With the new query the time taken on windows machine is consistently around
> 40 mins. While the DIH is running queries slow down i.e. a query that
> typically took 60 msec takes 100 msec.
>
> The time taken on linux machine is consistently around 2.5 hours. While the
> DIH is running queries take about 200  to 400 msec.
>
> Thanks!
>
> On Mon, Feb 1, 2016 at 8:45 PM, Erick Erickson 
> wrote:
>
>> What happens if you run just the SQL query from the
>> windows box and from the linux box? Is there any chance
>> that somehow the connection from the linux box is
>> just slower?
>>
>> Best,
>> Erick
>>
>> On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch
>>  wrote:
>> > What are you importing from? Is the source and Solr machine collocated
>> > in the same fashion on dev and prod?
>> >
>> > Have you tried running this on a Linux dev machine? Perhaps your prod
>> > machine is loaded much more than a dev.
>> >
>> > Regards,
>> >Alex.
>> > 
>> > Newsletter and resources for Solr beginners and intermediates:
>> > http://www.solr-start.com/
>> >
>> >
>> > On 2 February 2016 at 13:21, Troy Edwards 
>> wrote:
>> >> We have a windows development machine on which the Data Import Handler
>> >> consistently takes about 40 mins to finish. Queries run fine. JVM
>> memory is
>> >> 2 GB per node.
>> >>
>> >> But on a linux machine it consistently takes about 2.5 hours. The
>> queries
>> >> also run slower. JVM memory here is also 2 GB per node.
>> >>
>> >> How should I go about analyzing and tuning the linux machine?
>> >>
>> >> Thanks
>>

Re: Data Import Handler takes different time on different machines

2016-02-01 Thread Troy Edwards

Sorry, I should explain further. The Data Import Handler had been running
for a while retrieving only about 15 records from the database. Both in
development env (windows) and linux machine it took about 3 mins.

The query has been changed and we are now trying to retrieve about 10
million records. We do expect the time to increase.

With the new query the time taken on windows machine is consistently around
40 mins. While the DIH is running queries slow down i.e. a query that
typically took 60 msec takes 100 msec.

The time taken on linux machine is consistently around 2.5 hours. While the
DIH is running queries take about 200  to 400 msec.

Thanks!

On Mon, Feb 1, 2016 at 8:45 PM, Erick Erickson 
wrote:

> What happens if you run just the SQL query from the
> windows box and from the linux box? Is there any chance
> that somehow the connection from the linux box is
> just slower?
>
> Best,
> Erick
>
> On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch
>  wrote:
> > What are you importing from? Is the source and Solr machine collocated
> > in the same fashion on dev and prod?
> >
> > Have you tried running this on a Linux dev machine? Perhaps your prod
> > machine is loaded much more than a dev.
> >
> > Regards,
> >Alex.
> > 
> > Newsletter and resources for Solr beginners and intermediates:
> > http://www.solr-start.com/
> >
> >
> > On 2 February 2016 at 13:21, Troy Edwards 
> wrote:
> >> We have a windows development machine on which the Data Import Handler
> >> consistently takes about 40 mins to finish. Queries run fine. JVM
> memory is
> >> 2 GB per node.
> >>
> >> But on a linux machine it consistently takes about 2.5 hours. The
> queries
> >> also run slower. JVM memory here is also 2 GB per node.
> >>
> >> How should I go about analyzing and tuning the linux machine?
> >>
> >> Thanks
>

Re: Data import issue

2015-12-25 Thread Alexandre Rafalovitch

Do you have a full stack trace? A bit hard to help without that.
On 24 Dec 2015 2:54 pm, "Midas A"  wrote:

> Hi ,
>
>
> Please provide the steps to resolve the issue.
>
>
> com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException:
> Communications link failure during rollback(). Transaction resolution
> unknown.
>

RE: Data Import Handler - Multivalued fields - splitBy

2015-12-04 Thread Dyer, James

Brian,

Be sure to have...

transformer="RegexTransformer"

...in your  tag.  It’s the RegexTransformer class that looks for 
"splitBy".

See https://wiki.apache.org/solr/DataImportHandler#RegexTransformer for more 
information.

James Dyer
Ingram Content Group


-Original Message-
From: Brian Narsi [mailto:bnars...@gmail.com] 
Sent: Friday, December 04, 2015 3:10 PM
To: solr-user@lucene.apache.org
Subject: Data Import Handler - Multivalued fields - splitBy

I have the following:





I believe I had the following working (splitting on pipe delimited)



But it does not work now.



In-fact now I have even tried



But I cannot get the values to split into an array.

Any thoughts/suggestions what may be wrong?

Thanks,

Re: Data Import Handler - Multivalued fields - splitBy

2015-12-04 Thread Brian Narsi

That was it! Thank you!

On Fri, Dec 4, 2015 at 3:13 PM, Dyer, James 
wrote:

> Brian,
>
> Be sure to have...
>
> transformer="RegexTransformer"
>
> ...in your  tag.  It’s the RegexTransformer class that looks
> for "splitBy".
>
> See https://wiki.apache.org/solr/DataImportHandler#RegexTransformer for
> more information.
>
> James Dyer
> Ingram Content Group
>
>
> -Original Message-
> From: Brian Narsi [mailto:bnars...@gmail.com]
> Sent: Friday, December 04, 2015 3:10 PM
> To: solr-user@lucene.apache.org
> Subject: Data Import Handler - Multivalued fields - splitBy
>
> I have the following:
>
>  required="true" multiValued="true" />
>
>
>
> I believe I had the following working (splitting on pipe delimited)
>
> 
>
> But it does not work now.
>
>
>
> In-fact now I have even tried
>
> 
>
> But I cannot get the values to split into an array.
>
> Any thoughts/suggestions what may be wrong?
>
> Thanks,
>

Re: Data Import Handler / Backup indexes

2015-11-23 Thread Jeff Wartes

The backup/restore approach in SOLR-5750 and in solrcloud_manager is
really just that - copying the index files.
On backup, it saves your index directories, and on restore, it puts them
in the data dir, moves a pointer for the current index dir, and opens a
new searcher. Both are mostly just wrappers on the proper Solr
replication-handler commands, since Solr already has some lower level APIs
for these operations.

There is a shared filesystem requirement for backup/restore though, which
is to account for the fact that when you make the backup you don’t know
which nodes will need to restore a given shard.

The commands would look something like:

java -jar solrcloud_manager-assembly-1.4.0.jar backupindex -z
zk0.example.com:2181/myapp -c collection1 --dir 
java -jar solrcloud_manager-assembly-1.4.0.jar restoreindex -z
zk0.example.com:2181/myapp -c collection1 --dir 

Or you could restore into a new collection:
java -jar solrcloud_manager-assembly-1.4.0.jar backupindex -z
zk0.example.com:2181/myapp -c collection1 --dir 
java -jar solrcloud_manager-assembly-1.4.0.jar clonecollection -z
zk0.example.com:2181/myapp -c newcollection --fromCollection collection1
java -jar solrcloud_manager-assembly-1.4.0.jar restoreindex -z
zk0.example.com:2181/myapp -c newcollection --dir 
--restoreFrom collection1

If you don’t have a shared filesystem, you can still do the copy
collection route:
java -jar solrcloud_manager-assembly-1.4.0.jar clonecollection -z
zk0.example.com:2181/myapp -c newcollection --fromCollection collection1

java -jar solrcloud_manager-assembly-1.4.0.jar copycollection -z
zk0.example.com:2181/myapp -c newcollection --fromCollection collection1

This creates a new collection with the same settings, (clonecollection)
and triggers a one-shot “replication” into it. (copycollection) Again,
this is just framework for the proper (largely undocumented) Solr API
commands, to work around the lack of a convenient collections-level API
command.

One nice thing about using copy collection is that it can be used to keep
a backup collection up to date, only copying if necessary. Honestly
though, I don’t have as much experience with this use case as KNitin does
in solrcloud-haft, which is why I suggest using an empty collection in the
README right now. If you try that use case with solrcloud_manager, I’d be
interested in your experience. It should work, but you’ll need to disable
the verification with --skipCheck and check manually.

Having said all that though, yes, with your simple use case and small
collection, you can do everything you want with just cp. The easiest way
would be to make a backup copy of your index dir. If you need to restore,
shut down solr, nuke your index dir, and copy the backup in there. You’d
probably need to do this on all nodes at once though, to prevent a
non-leader from coming up and re-syncing with a piece of the index you
hadn’t restored yet.

On 11/21/15, 10:12 PM, "Brian Narsi"  wrote:

>What are the caveats regarding the copy of a collection?
>
>At this time DIH takes only about 10 minutes. So in case of accidental
>delete we can just re-run the DIH. The reason I am thinking about backup
>is
>just in case records are deleted accidentally and the DIH cannot be run
>because the database is unavailable.
>
>Our collection is simple: 2 nodes - 1 collection - 2 shards with 2
>replicas
>each
>
>So a simple copy (cp command) for both the nodes/shards might work for us?
>How do I restore the data back?
>
>
>
>On Tue, Nov 17, 2015 at 4:56 PM, Jeff Wartes 
>wrote:
>
>>
>> https://github.com/whitepages/solrcloud_manager supports 5.x, and I
>>added
>> some backup/restore functionality similar to SOLR-5750 in the last
>> release.
>> Like SOLR-5750, this backup strategy requires a shared filesystem, but
>> note that unlike SOLR-5750, I haven’t yet added any backup functionality
>> for the contents of ZK. I’m currently working on some parts of that.
>>
>>
>> Making a copy of a collection is supported too, with some caveats.
>>
>>
>> On 11/17/15, 10:20 AM, "Brian Narsi"  wrote:
>>
>> >Sorry I forgot to mention that we are using SolrCloud 5.1.0.
>> >
>> >
>> >
>> >On Tue, Nov 17, 2015 at 12:09 PM, KNitin  wrote:
>> >
>> >> afaik Data import handler does not offer backups. You can try using
>>the
>> >> replication handler to backup data as you wish to any custom end
>>point.
>> >>
>> >> You can also try out : https://github.com/bloomreach/solrcloud-haft.
>> >>This
>> >> helps backup solr indices across clusters.
>> >>
>> >> On Tue, Nov 17, 2015 at 7:08 AM, Brian Narsi 
>> wrote:
>> >>
>> >> > I am using Data Import Handler to retrieve data from a database
>>with
>> >> >
>> >> > full-import, clean = true, commit = true and optimize = true
>> >> >
>> >> > This has always worked correctly without any errors.
>> >> >
>> >> > But just to be on the safe side, I am

Re: Data Import Handler / Backup indexes

2015-11-22 Thread Erick Erickson

These are just Lucene indexes. There's the Cloud backup and restore
that is being worked on.

But if the index is static (i.e. not being indexed to), simply copying
the data/index (well, actually the whole data index and subdirs)
directory will backup and restore it. Copying the index directory back
(I'd have Solr shut down when copying back) would restore the index.

Best,
Erick

On Sat, Nov 21, 2015 at 10:12 PM, Brian Narsi  wrote:
> What are the caveats regarding the copy of a collection?
>
> At this time DIH takes only about 10 minutes. So in case of accidental
> delete we can just re-run the DIH. The reason I am thinking about backup is
> just in case records are deleted accidentally and the DIH cannot be run
> because the database is unavailable.
>
> Our collection is simple: 2 nodes - 1 collection - 2 shards with 2 replicas
> each
>
> So a simple copy (cp command) for both the nodes/shards might work for us?
> How do I restore the data back?
>
>
>
> On Tue, Nov 17, 2015 at 4:56 PM, Jeff Wartes  wrote:
>
>>
>> https://github.com/whitepages/solrcloud_manager supports 5.x, and I added
>> some backup/restore functionality similar to SOLR-5750 in the last
>> release.
>> Like SOLR-5750, this backup strategy requires a shared filesystem, but
>> note that unlike SOLR-5750, I haven’t yet added any backup functionality
>> for the contents of ZK. I’m currently working on some parts of that.
>>
>>
>> Making a copy of a collection is supported too, with some caveats.
>>
>>
>> On 11/17/15, 10:20 AM, "Brian Narsi"  wrote:
>>
>> >Sorry I forgot to mention that we are using SolrCloud 5.1.0.
>> >
>> >
>> >
>> >On Tue, Nov 17, 2015 at 12:09 PM, KNitin  wrote:
>> >
>> >> afaik Data import handler does not offer backups. You can try using the
>> >> replication handler to backup data as you wish to any custom end point.
>> >>
>> >> You can also try out : https://github.com/bloomreach/solrcloud-haft.
>> >>This
>> >> helps backup solr indices across clusters.
>> >>
>> >> On Tue, Nov 17, 2015 at 7:08 AM, Brian Narsi 
>> wrote:
>> >>
>> >> > I am using Data Import Handler to retrieve data from a database with
>> >> >
>> >> > full-import, clean = true, commit = true and optimize = true
>> >> >
>> >> > This has always worked correctly without any errors.
>> >> >
>> >> > But just to be on the safe side, I am thinking that we should do a
>> >>backup
>> >> > before initiating Data Import Handler. And just in case something
>> >>happens
>> >> > restore the backup.
>> >> >
>> >> > Can backup be done automatically (before initiating Data Import
>> >>Handler)?
>> >> >
>> >> > Thanks
>> >> >
>> >>
>>
>>

Re: Data Import Handler / Backup indexes

2015-11-21 Thread Brian Narsi

What are the caveats regarding the copy of a collection?

At this time DIH takes only about 10 minutes. So in case of accidental
delete we can just re-run the DIH. The reason I am thinking about backup is
just in case records are deleted accidentally and the DIH cannot be run
because the database is unavailable.

Our collection is simple: 2 nodes - 1 collection - 2 shards with 2 replicas
each

So a simple copy (cp command) for both the nodes/shards might work for us?
How do I restore the data back?



On Tue, Nov 17, 2015 at 4:56 PM, Jeff Wartes  wrote:

>
> https://github.com/whitepages/solrcloud_manager supports 5.x, and I added
> some backup/restore functionality similar to SOLR-5750 in the last
> release.
> Like SOLR-5750, this backup strategy requires a shared filesystem, but
> note that unlike SOLR-5750, I haven’t yet added any backup functionality
> for the contents of ZK. I’m currently working on some parts of that.
>
>
> Making a copy of a collection is supported too, with some caveats.
>
>
> On 11/17/15, 10:20 AM, "Brian Narsi"  wrote:
>
> >Sorry I forgot to mention that we are using SolrCloud 5.1.0.
> >
> >
> >
> >On Tue, Nov 17, 2015 at 12:09 PM, KNitin  wrote:
> >
> >> afaik Data import handler does not offer backups. You can try using the
> >> replication handler to backup data as you wish to any custom end point.
> >>
> >> You can also try out : https://github.com/bloomreach/solrcloud-haft.
> >>This
> >> helps backup solr indices across clusters.
> >>
> >> On Tue, Nov 17, 2015 at 7:08 AM, Brian Narsi 
> wrote:
> >>
> >> > I am using Data Import Handler to retrieve data from a database with
> >> >
> >> > full-import, clean = true, commit = true and optimize = true
> >> >
> >> > This has always worked correctly without any errors.
> >> >
> >> > But just to be on the safe side, I am thinking that we should do a
> >>backup
> >> > before initiating Data Import Handler. And just in case something
> >>happens
> >> > restore the backup.
> >> >
> >> > Can backup be done automatically (before initiating Data Import
> >>Handler)?
> >> >
> >> > Thanks
> >> >
> >>
>
>

Re: Data Import Handler / Backup indexes

2015-11-17 Thread Brian Narsi

Sorry I forgot to mention that we are using SolrCloud 5.1.0.



On Tue, Nov 17, 2015 at 12:09 PM, KNitin  wrote:

> afaik Data import handler does not offer backups. You can try using the
> replication handler to backup data as you wish to any custom end point.
>
> You can also try out : https://github.com/bloomreach/solrcloud-haft.  This
> helps backup solr indices across clusters.
>
> On Tue, Nov 17, 2015 at 7:08 AM, Brian Narsi  wrote:
>
> > I am using Data Import Handler to retrieve data from a database with
> >
> > full-import, clean = true, commit = true and optimize = true
> >
> > This has always worked correctly without any errors.
> >
> > But just to be on the safe side, I am thinking that we should do a backup
> > before initiating Data Import Handler. And just in case something happens
> > restore the backup.
> >
> > Can backup be done automatically (before initiating Data Import Handler)?
> >
> > Thanks
> >
>

Re: Data Import Handler / Backup indexes

2015-11-17 Thread KNitin

afaik Data import handler does not offer backups. You can try using the
replication handler to backup data as you wish to any custom end point.

You can also try out : https://github.com/bloomreach/solrcloud-haft.  This
helps backup solr indices across clusters.

On Tue, Nov 17, 2015 at 7:08 AM, Brian Narsi  wrote:

> I am using Data Import Handler to retrieve data from a database with
>
> full-import, clean = true, commit = true and optimize = true
>
> This has always worked correctly without any errors.
>
> But just to be on the safe side, I am thinking that we should do a backup
> before initiating Data Import Handler. And just in case something happens
> restore the backup.
>
> Can backup be done automatically (before initiating Data Import Handler)?
>
> Thanks
>

Re: Data Import Handler / Backup indexes

2015-11-17 Thread Jeff Wartes

https://github.com/whitepages/solrcloud_manager supports 5.x, and I added
some backup/restore functionality similar to SOLR-5750 in the last
release. 
Like SOLR-5750, this backup strategy requires a shared filesystem, but
note that unlike SOLR-5750, I haven’t yet added any backup functionality
for the contents of ZK. I’m currently working on some parts of that.

Making a copy of a collection is supported too, with some caveats.

On 11/17/15, 10:20 AM, "Brian Narsi"  wrote:

>Sorry I forgot to mention that we are using SolrCloud 5.1.0.
>
>
>
>On Tue, Nov 17, 2015 at 12:09 PM, KNitin  wrote:
>
>> afaik Data import handler does not offer backups. You can try using the
>> replication handler to backup data as you wish to any custom end point.
>>
>> You can also try out : https://github.com/bloomreach/solrcloud-haft.
>>This
>> helps backup solr indices across clusters.
>>
>> On Tue, Nov 17, 2015 at 7:08 AM, Brian Narsi  wrote:
>>
>> > I am using Data Import Handler to retrieve data from a database with
>> >
>> > full-import, clean = true, commit = true and optimize = true
>> >
>> > This has always worked correctly without any errors.
>> >
>> > But just to be on the safe side, I am thinking that we should do a
>>backup
>> > before initiating Data Import Handler. And just in case something
>>happens
>> > restore the backup.
>> >
>> > Can backup be done automatically (before initiating Data Import
>>Handler)?
>> >
>> > Thanks
>> >
>>

Re: data import extremely slow

2015-11-07 Thread Yangrui Guo

I just realized that not everything was ok. Three child entities were not
imported. Had set batchSize to -1 but again solr was stuck :(

On Fri, Nov 6, 2015 at 3:11 PM, Yangrui Guo  wrote:

> Thanks for the reply. I just removed CacheKeyLookUp and CachedKey and used
> WHERE clause instead. Everything works fine now.
>
> Yangrui
>
>
> On Friday, November 6, 2015, Shawn Heisey  wrote:
>
>> On 11/6/2015 10:32 AM, Yangrui Guo wrote:
>> > >
>> There's a good chance that JDBC is trying to read the entire result set
>> (all three million rows) into memory before sending any of that info to
>> Solr.
>>
>> Set the batchSize to -1 for MySQL so that it will stream results to Solr
>> as soon as they are available, and not wait for all of them.  Here's
>> more info on the situation, which frequently causes OutOfMemory problems
>> for users:
>>
>>
>> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29|%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
>> 
>>
>>
>> Thanks,
>> Shawn
>>
>>

Re: data import extremely slow

2015-11-07 Thread Yangrui Guo

I found multiple strange things besides the slowness. I performed count(*)
in MySQL but only one-fifth of the records were imported. Also sometimes
dataimporthandler  either doesn't import at all or only imports a portion
of the table. How can I debug the importer?

On Saturday, November 7, 2015, Yangrui Guo  wrote:

> I just realized that not everything was ok. Three child entities were not
> imported. Had set batchSize to -1 but again solr was stuck :(
>
> On Fri, Nov 6, 2015 at 3:11 PM, Yangrui Guo  > wrote:
>
>> Thanks for the reply. I just removed CacheKeyLookUp and CachedKey and
>> used WHERE clause instead. Everything works fine now.
>>
>> Yangrui
>>
>>
>> On Friday, November 6, 2015, Shawn Heisey > > wrote:
>>
>>> On 11/6/2015 10:32 AM, Yangrui Guo wrote:
>>> > >>
>>> There's a good chance that JDBC is trying to read the entire result set
>>> (all three million rows) into memory before sending any of that info to
>>> Solr.
>>>
>>> Set the batchSize to -1 for MySQL so that it will stream results to Solr
>>> as soon as they are available, and not wait for all of them.  Here's
>>> more info on the situation, which frequently causes OutOfMemory problems
>>> for users:
>>>
>>>
>>> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29|%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
>>> 
>>>
>>>
>>> Thanks,
>>> Shawn
>>>
>>>
>

Re: data import extremely slow

2015-11-07 Thread Alexandre Rafalovitch

Have you thought of just using Solr. Might be faster than troubleshooting
DIH for complex scenarios.
On 7 Nov 2015 3:39 pm, "Yangrui Guo"  wrote:

> I found multiple strange things besides the slowness. I performed count(*)
> in MySQL but only one-fifth of the records were imported. Also sometimes
> dataimporthandler  either doesn't import at all or only imports a portion
> of the table. How can I debug the importer?
>
> On Saturday, November 7, 2015, Yangrui Guo  wrote:
>
> > I just realized that not everything was ok. Three child entities were not
> > imported. Had set batchSize to -1 but again solr was stuck :(
> >
> > On Fri, Nov 6, 2015 at 3:11 PM, Yangrui Guo  > > wrote:
> >
> >> Thanks for the reply. I just removed CacheKeyLookUp and CachedKey and
> >> used WHERE clause instead. Everything works fine now.
> >>
> >> Yangrui
> >>
> >>
> >> On Friday, November 6, 2015, Shawn Heisey  >> > wrote:
> >>
> >>> On 11/6/2015 10:32 AM, Yangrui Guo wrote:
> >>> >  >>>
> >>> There's a good chance that JDBC is trying to read the entire result set
> >>> (all three million rows) into memory before sending any of that info to
> >>> Solr.
> >>>
> >>> Set the batchSize to -1 for MySQL so that it will stream results to
> Solr
> >>> as soon as they are available, and not wait for all of them.  Here's
> >>> more info on the situation, which frequently causes OutOfMemory
> problems
> >>> for users:
> >>>
> >>>
> >>>
> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29|%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
> >>> <
> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29%7C%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
> >
> >>>
> >>>
> >>> Thanks,
> >>> Shawn
> >>>
> >>>
> >
>

Re: Data import handler not indexing all data

2015-11-07 Thread Alexandre Rafalovitch

Just to get the paranoid option out of the way, is 'id' actually the
column that has unique ids in your database? If you do "select
distinct id from imdb.director" - how many items do you get?

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 7 November 2015 at 18:21, Yangrui Guo  wrote:
> Hello
>
> I'm being troubled by solr's data import handler. My solr version is 5.3.1
> and mysql is 5.5. I tried to index imdb data but found solr only partially
> indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the query
> result was 1636549. However DIH only fetched and indexed 287041 rows. I
> didn't see any error in the log. Why was this happening?
>
> Here's my data-config.xml
>
> 
>  url="jdbc:mysql://localhost:3306/imdb" user="root" password="password" />
> 
> 
> 
> 
> 
> 
> 
>
> Yangrui Guo

Re: Data import handler not indexing all data

2015-11-07 Thread Yangrui Guo

Hi thanks for the continued support. I'm really worried as my project
deadline is near. It was 1636549 in MySQL vs 287041 in Solr. I put select
distinct in the beginning of the query because IMDB doesn't have a table
for cast & crew. It puts movie and person and their roles into one huge
table 'cast_info'. Hence there are multiple rows for a director, one row
per his movie.

On Saturday, November 7, 2015, Alexandre Rafalovitch 
wrote:

> Just to get the paranoid option out of the way, is 'id' actually the
> column that has unique ids in your database? If you do "select
> distinct id from imdb.director" - how many items do you get?
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 7 November 2015 at 18:21, Yangrui Guo  > wrote:
> > Hello
> >
> > I'm being troubled by solr's data import handler. My solr version is
> 5.3.1
> > and mysql is 5.5. I tried to index imdb data but found solr only
> partially
> > indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the
> query
> > result was 1636549. However DIH only fetched and indexed 287041 rows. I
> > didn't see any error in the log. Why was this happening?
> >
> > Here's my data-config.xml
> >
> > 
> >  > url="jdbc:mysql://localhost:3306/imdb" user="root" password="password" />
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >
> > Yangrui Guo
>

Re: data import extremely slow

2015-11-07 Thread Alexandre Rafalovitch

LoL. Of course I meant SolrJ. I had to misspell the most important
word of the hundreds I wrote in this thread :-)

Thank you Erick for the correction.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 7 November 2015 at 19:18, Erick Erickson  wrote:
> Alexandre, did you mean SolrJ?
>
> Here's a way to get started
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Sat, Nov 7, 2015 at 2:22 PM, Alexandre Rafalovitch
>  wrote:
>> Have you thought of just using Solr. Might be faster than troubleshooting
>> DIH for complex scenarios.
>> On 7 Nov 2015 3:39 pm, "Yangrui Guo"  wrote:
>>
>>> I found multiple strange things besides the slowness. I performed count(*)
>>> in MySQL but only one-fifth of the records were imported. Also sometimes
>>> dataimporthandler  either doesn't import at all or only imports a portion
>>> of the table. How can I debug the importer?
>>>
>>> On Saturday, November 7, 2015, Yangrui Guo  wrote:
>>>
>>> > I just realized that not everything was ok. Three child entities were not
>>> > imported. Had set batchSize to -1 but again solr was stuck :(
>>> >
>>> > On Fri, Nov 6, 2015 at 3:11 PM, Yangrui Guo >> > > wrote:
>>> >
>>> >> Thanks for the reply. I just removed CacheKeyLookUp and CachedKey and
>>> >> used WHERE clause instead. Everything works fine now.
>>> >>
>>> >> Yangrui
>>> >>
>>> >>
>>> >> On Friday, November 6, 2015, Shawn Heisey >> >> > wrote:
>>> >>
>>> >>> On 11/6/2015 10:32 AM, Yangrui Guo wrote:
>>> >>> > >> >>>
>>> >>> There's a good chance that JDBC is trying to read the entire result set
>>> >>> (all three million rows) into memory before sending any of that info to
>>> >>> Solr.
>>> >>>
>>> >>> Set the batchSize to -1 for MySQL so that it will stream results to
>>> Solr
>>> >>> as soon as they are available, and not wait for all of them.  Here's
>>> >>> more info on the situation, which frequently causes OutOfMemory
>>> problems
>>> >>> for users:
>>> >>>
>>> >>>
>>> >>>
>>> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29|%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
>>> >>> <
>>> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29%7C%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
>>> >
>>> >>>
>>> >>>
>>> >>> Thanks,
>>> >>> Shawn
>>> >>>
>>> >>>
>>> >
>>>

Re: Data import handler not indexing all data

2015-11-07 Thread Alexandre Rafalovitch

That's not quite the question I asked. Do a distinct on 'id' only in
the database itself. If your ids are NOT unique, you need to create a
composite or a virtual id for Solr. Because whatever your
solrconfig.xml say is uniqueKey will be used to deduplicate the
documents. If you have 10 documents with the same id value, only one
will be in the final Solr.

I am not saying that's where the problem is, DIH is fiddly. But just
get that out of the way.

If that's not the case, you may need to isolate which documents are
failing. The easiest way to do so is probably to index a smaller
subset of records, say 1000. Pick a condition in your SQL to do so
(e.g. id value range). Then, see how many made it into Solr. If not
all 1000, export the list of IDs from SQL, then a list of IDs from
Solr (use CSV format and just fl=id). Sort both, compare, see what ids
are missing. Look what is strange about those documents as opposed to
the documents that did make it into Solr. Try to push one of those
missing documents explicitly into Solr by either modifying SQL query
in DIH or as CSV or whatever.

Good luck,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/

On 7 November 2015 at 19:07, Yangrui Guo  wrote:
> Hi thanks for the continued support. I'm really worried as my project
> deadline is near. It was 1636549 in MySQL vs 287041 in Solr. I put select
> distinct in the beginning of the query because IMDB doesn't have a table
> for cast & crew. It puts movie and person and their roles into one huge
> table 'cast_info'. Hence there are multiple rows for a director, one row
> per his movie.
>
> On Saturday, November 7, 2015, Alexandre Rafalovitch 
> wrote:
>
>> Just to get the paranoid option out of the way, is 'id' actually the
>> column that has unique ids in your database? If you do "select
>> distinct id from imdb.director" - how many items do you get?
>>
>> Regards,
>>Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 7 November 2015 at 18:21, Yangrui Guo > > wrote:
>> > Hello
>> >
>> > I'm being troubled by solr's data import handler. My solr version is
>> 5.3.1
>> > and mysql is 5.5. I tried to index imdb data but found solr only
>> partially
>> > indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the
>> query
>> > result was 1636549. However DIH only fetched and indexed 287041 rows. I
>> > didn't see any error in the log. Why was this happening?
>> >
>> > Here's my data-config.xml
>> >
>> > 
>> > > > url="jdbc:mysql://localhost:3306/imdb" user="root" password="password" />
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> >
>> > Yangrui Guo
>>

Re: data import extremely slow

2015-11-07 Thread Erick Erickson

Alexandre, did you mean SolrJ?

Here's a way to get started
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

Best,
Erick

On Sat, Nov 7, 2015 at 2:22 PM, Alexandre Rafalovitch
 wrote:
> Have you thought of just using Solr. Might be faster than troubleshooting
> DIH for complex scenarios.
> On 7 Nov 2015 3:39 pm, "Yangrui Guo"  wrote:
>
>> I found multiple strange things besides the slowness. I performed count(*)
>> in MySQL but only one-fifth of the records were imported. Also sometimes
>> dataimporthandler  either doesn't import at all or only imports a portion
>> of the table. How can I debug the importer?
>>
>> On Saturday, November 7, 2015, Yangrui Guo  wrote:
>>
>> > I just realized that not everything was ok. Three child entities were not
>> > imported. Had set batchSize to -1 but again solr was stuck :(
>> >
>> > On Fri, Nov 6, 2015 at 3:11 PM, Yangrui Guo > > > wrote:
>> >
>> >> Thanks for the reply. I just removed CacheKeyLookUp and CachedKey and
>> >> used WHERE clause instead. Everything works fine now.
>> >>
>> >> Yangrui
>> >>
>> >>
>> >> On Friday, November 6, 2015, Shawn Heisey > >> > wrote:
>> >>
>> >>> On 11/6/2015 10:32 AM, Yangrui Guo wrote:
>> >>> > > >>>
>> >>> There's a good chance that JDBC is trying to read the entire result set
>> >>> (all three million rows) into memory before sending any of that info to
>> >>> Solr.
>> >>>
>> >>> Set the batchSize to -1 for MySQL so that it will stream results to
>> Solr
>> >>> as soon as they are available, and not wait for all of them.  Here's
>> >>> more info on the situation, which frequently causes OutOfMemory
>> problems
>> >>> for users:
>> >>>
>> >>>
>> >>>
>> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29|%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
>> >>> <
>> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29%7C%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
>> >
>> >>>
>> >>>
>> >>> Thanks,
>> >>> Shawn
>> >>>
>> >>>
>> >
>>

Re: Data import handler not indexing all data

2015-11-07 Thread Yangrui Guo

Yes the id is unique. If I only select distinct id,count(id) I get the same
results. However I found this is more likely a MySQL issue. I created a new
table called director1 and ran query "insert into director1 select * from
director" I got only 287041 results inserted, which was the same as Solr. I
don't know why the same query is causing two different results.

On Saturday, November 7, 2015, Alexandre Rafalovitch 
wrote:

> That's not quite the question I asked. Do a distinct on 'id' only in
> the database itself. If your ids are NOT unique, you need to create a
> composite or a virtual id for Solr. Because whatever your
> solrconfig.xml say is uniqueKey will be used to deduplicate the
> documents. If you have 10 documents with the same id value, only one
> will be in the final Solr.
>
> I am not saying that's where the problem is, DIH is fiddly. But just
> get that out of the way.
>
> If that's not the case, you may need to isolate which documents are
> failing. The easiest way to do so is probably to index a smaller
> subset of records, say 1000. Pick a condition in your SQL to do so
> (e.g. id value range). Then, see how many made it into Solr. If not
> all 1000, export the list of IDs from SQL, then a list of IDs from
> Solr (use CSV format and just fl=id). Sort both, compare, see what ids
> are missing. Look what is strange about those documents as opposed to
> the documents that did make it into Solr. Try to push one of those
> missing documents explicitly into Solr by either modifying SQL query
> in DIH or as CSV or whatever.
>
> Good luck,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 7 November 2015 at 19:07, Yangrui Guo  > wrote:
> > Hi thanks for the continued support. I'm really worried as my project
> > deadline is near. It was 1636549 in MySQL vs 287041 in Solr. I put select
> > distinct in the beginning of the query because IMDB doesn't have a table
> > for cast & crew. It puts movie and person and their roles into one huge
> > table 'cast_info'. Hence there are multiple rows for a director, one row
> > per his movie.
> >
> > On Saturday, November 7, 2015, Alexandre Rafalovitch  >
> > wrote:
> >
> >> Just to get the paranoid option out of the way, is 'id' actually the
> >> column that has unique ids in your database? If you do "select
> >> distinct id from imdb.director" - how many items do you get?
> >>
> >> Regards,
> >>Alex.
> >> 
> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 7 November 2015 at 18:21, Yangrui Guo  
> >> > wrote:
> >> > Hello
> >> >
> >> > I'm being troubled by solr's data import handler. My solr version is
> >> 5.3.1
> >> > and mysql is 5.5. I tried to index imdb data but found solr only
> >> partially
> >> > indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the
> >> query
> >> > result was 1636549. However DIH only fetched and indexed 287041 rows.
> I
> >> > didn't see any error in the log. Why was this happening?
> >> >
> >> > Here's my data-config.xml
> >> >
> >> > 
> >> >  >> > url="jdbc:mysql://localhost:3306/imdb" user="root"
> password="password" />
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> >
> >> > Yangrui Guo
> >>
>

Re: data import extremely slow

2015-11-07 Thread Yangrui Guo

Thanks for your kind reply. I tried using both sqlentityprocessor and set
batchSize to -1but didn't get any improvement. It'd be helpful if I can see
data import handler's log.

On Saturday, November 7, 2015, Alexandre Rafalovitch 
wrote:

> LoL. Of course I meant SolrJ. I had to misspell the most important
> word of the hundreds I wrote in this thread :-)
>
> Thank you Erick for the correction.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 7 November 2015 at 19:18, Erick Erickson  > wrote:
> > Alexandre, did you mean SolrJ?
> >
> > Here's a way to get started
> > https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
> >
> > Best,
> > Erick
> >
> > On Sat, Nov 7, 2015 at 2:22 PM, Alexandre Rafalovitch
> > > wrote:
> >> Have you thought of just using Solr. Might be faster than
> troubleshooting
> >> DIH for complex scenarios.
> >> On 7 Nov 2015 3:39 pm, "Yangrui Guo"  > wrote:
> >>
> >>> I found multiple strange things besides the slowness. I performed
> count(*)
> >>> in MySQL but only one-fifth of the records were imported. Also
> sometimes
> >>> dataimporthandler  either doesn't import at all or only imports a
> portion
> >>> of the table. How can I debug the importer?
> >>>
> >>> On Saturday, November 7, 2015, Yangrui Guo  > wrote:
> >>>
> >>> > I just realized that not everything was ok. Three child entities
> were not
> >>> > imported. Had set batchSize to -1 but again solr was stuck :(
> >>> >
> >>> > On Fri, Nov 6, 2015 at 3:11 PM, Yangrui Guo  
> >>> > ');>>
> wrote:
> >>> >
> >>> >> Thanks for the reply. I just removed CacheKeyLookUp and CachedKey
> and
> >>> >> used WHERE clause instead. Everything works fine now.
> >>> >>
> >>> >> Yangrui
> >>> >>
> >>> >>
> >>> >> On Friday, November 6, 2015, Shawn Heisey  
> >>> >> ');>>
> wrote:
> >>> >>
> >>> >>> On 11/6/2015 10:32 AM, Yangrui Guo wrote:
> >>> >>> >  >>> >>>
> >>> >>> There's a good chance that JDBC is trying to read the entire
> result set
> >>> >>> (all three million rows) into memory before sending any of that
> info to
> >>> >>> Solr.
> >>> >>>
> >>> >>> Set the batchSize to -1 for MySQL so that it will stream results to
> >>> Solr
> >>> >>> as soon as they are available, and not wait for all of them.
> Here's
> >>> >>> more info on the situation, which frequently causes OutOfMemory
> >>> problems
> >>> >>> for users:
> >>> >>>
> >>> >>>
> >>> >>>
> >>>
> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29|%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
> >>> >>> <
> >>>
> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29%7C%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
> >>> >
> >>> >>>
> >>> >>>
> >>> >>> Thanks,
> >>> >>> Shawn
> >>> >>>
> >>> >>>
> >>> >
> >>>
>

1 2 3 >

1 - 100 of 269 matches

Mail list logo