Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-26 Thread Marek Ščevlík
r testing within the Solr admin.
>>>>
>>>> A better open-source Java solution might be to connect Solr with Apache
>>>> Camel - http://camel.apache.org/solr.html.
>>>>
>>>> If you are not tied absolutely to pure open-source, and freemium
>>>> products will do, then you might look at Pentaho Spoon and Kettle.
>>>>  Although Talend is much more established in the market, I find Pentaho's
>>>> XML-based ETL a bit easier to integrate as a developer, and unit test and
>>>> such.   Talend does better when you have a full infrastructure set up, but
>>>> then the attention required to unit tests and Git integration seems over
>>>> the top.
>>>>
>>>> Another powerful way to get things done, depending on what you are
>>>> indexing, is to use LogStash and couple that with Document processing
>>>> chains.   Many of our projects benefit from having a single RDBMS view,
>>>> perhaps a materialized view, that is used for the index.   LogStash does
>>>> just fine here, pulling from the RDBMS and posting each row to Solr.  The
>>>> hierarchical execution of Data Import Handler is very nice, but this can
>>>> often be handled on the RDBMS side by creating a view, maybe using
>>>> functions to provide some rows.   Many RDBMS systems also support
>>>> federation and the import of XML from files, so that this brings XML
>>>> processing into the picture.
>>>>
>>>> Hoping this helps,
>>>>
>>>> Dan Davis, Systems/Applications Architect (Contractor),
>>>> Office of Computer and Communications Systems,
>>>> National Library of Medicine, NIH
>>>>
>>>>
>>>>
>>>>
>>>> -Original Message-
>>>> From: Marek Ščevlík [mailto:mscev...@codenameprojects.com]
>>>> Sent: Friday, November 18, 2016 9:29 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Data Import Request Handler isolated into its own project -
>>>> any suggestions?
>>>>
>>>> Hello. My name is Marek Scevlik.
>>>>
>>>>
>>>>
>>>> Currently I am working for a small company where we are interested in
>>>> implementing your Sorl 6.3 search engine.
>>>>
>>>>
>>>>
>>>> We are hoping to take out from the original source package the Data
>>>> Import Request Handler into its own project and create a usable .jar file
>>>> out of it.
>>>>
>>>>
>>>>
>>>> It should then serve as tool that would allow to connect to a remote
>>>> server and return data for us to our other application that would use the
>>>> returned data.
>>>>
>>>>
>>>>
>>>> What do you think? Would anything like this possible? To isolate out
>>>> the Data Import Request Handler into its own standalone project?
>>>>
>>>>
>>>>
>>>> If we could achieve this we won’t mind to share with the community this
>>>> new feature.
>>>>
>>>>
>>>>
>>>> I realize this is a first email and may lead into several hundreds so
>>>> for the start my request is very simple and not so high level detailed but
>>>> I am sure you realize it may lead into being quite complex.
>>>>
>>>>
>>>>
>>>> So I wonder if anyone replies.
>>>>
>>>>
>>>>
>>>> Thanks a lot for any replies and further info or guidance.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> Regards Marek Scevlik
>>>>
>>>
>>>
>>
>


Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-26 Thread Erick Erickson
s
>>>> XML-based ETL a bit easier to integrate as a developer, and unit test and
>>>> such.   Talend does better when you have a full infrastructure set up, but
>>>> then the attention required to unit tests and Git integration seems over
>>>> the top.
>>>>
>>>> Another powerful way to get things done, depending on what you are
>>>> indexing, is to use LogStash and couple that with Document processing
>>>> chains.   Many of our projects benefit from having a single RDBMS view,
>>>> perhaps a materialized view, that is used for the index.   LogStash does
>>>> just fine here, pulling from the RDBMS and posting each row to Solr.  The
>>>> hierarchical execution of Data Import Handler is very nice, but this can
>>>> often be handled on the RDBMS side by creating a view, maybe using
>>>> functions to provide some rows.   Many RDBMS systems also support
>>>> federation and the import of XML from files, so that this brings XML
>>>> processing into the picture.
>>>>
>>>> Hoping this helps,
>>>>
>>>> Dan Davis, Systems/Applications Architect (Contractor),
>>>> Office of Computer and Communications Systems,
>>>> National Library of Medicine, NIH
>>>>
>>>>
>>>>
>>>>
>>>> -Original Message-
>>>> From: Marek Ščevlík [mailto:mscev...@codenameprojects.com]
>>>> Sent: Friday, November 18, 2016 9:29 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Data Import Request Handler isolated into its own project - any
>>>> suggestions?
>>>>
>>>> Hello. My name is Marek Scevlik.
>>>>
>>>>
>>>>
>>>> Currently I am working for a small company where we are interested in
>>>> implementing your Sorl 6.3 search engine.
>>>>
>>>>
>>>>
>>>> We are hoping to take out from the original source package the Data
>>>> Import Request Handler into its own project and create a usable .jar file
>>>> out of it.
>>>>
>>>>
>>>>
>>>> It should then serve as tool that would allow to connect to a remote
>>>> server and return data for us to our other application that would use the
>>>> returned data.
>>>>
>>>>
>>>>
>>>> What do you think? Would anything like this possible? To isolate out the
>>>> Data Import Request Handler into its own standalone project?
>>>>
>>>>
>>>>
>>>> If we could achieve this we won’t mind to share with the community this
>>>> new feature.
>>>>
>>>>
>>>>
>>>> I realize this is a first email and may lead into several hundreds so
>>>> for the start my request is very simple and not so high level detailed but
>>>> I am sure you realize it may lead into being quite complex.
>>>>
>>>>
>>>>
>>>> So I wonder if anyone replies.
>>>>
>>>>
>>>>
>>>> Thanks a lot for any replies and further info or guidance.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> Regards Marek Scevlik
>>>>
>>>
>>>
>>


Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-26 Thread Marek Ščevlík
Actually to be honest I realized that I only needed to trigger a data
import handler from a jar file. Previously this was done in earlier
versions via the SolrServer object. Now I am thinking if this is OK?:

String urlString1 = "http://localhost:8983/solr/";;
SolrClient solr1 = new HttpSolrClient.Builder(urlString).build();

ModifiableSolrParams params = new ModifiableSolrParams();
params.set("db","/dataimport");
params.set("command", "full-import");
System.out.println(params.toString());
QueryResponse qresponse1 = solr1.query(params);

System.out.println("response = " + qresponse1);

Output i get from this is: response =
{responseHeader={status=0,QTime=0,params={wt=javabin,version=2,db=/dataimport,command=full-import}},response={numFound=0,start=0,docs=[]}}

There is a core db which come with the examples in solr 6.3 package. It is
loaded. From web ui admin I can operate it a run the dih reindex process.

I wonder whether this could work ? What do you think? I am trying to call
DIH whilst solr is running. This code is in a separate jar file that is run
besides solr instance.

This so far is not working for me. And I wonder why? What do you think?
Should this work at all? OR perhaps someone else could help out.


Thanks anyone for any help.


2016-11-25 19:50 GMT+01:00 Marek Ščevlík :

> I forgot to mention I am creating a jar file beside of a running solr 6.3
> instance to which I am hoping to attach with java via the
> SolrDispatchFilter to get at the cores and so then I could work with data
> in code.
>
>
> 2016-11-25 19:31 GMT+01:00 Marek Ščevlík :
>
>> Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
>> release of Solr 6.3 to get hold of a running instance of the jetty server
>> that is part of the solution? I found some code for previous versions where
>> it was captured with this code and one could then obtain cores for a
>> running solr instance ...
>>
>> SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty
>>
>> .getDispatchFilter().getFilter();
>>
>>
>> I was trying to implement it this way but that is not working out very
>> well now. I cant seem to get the jetty server object for the running
>> instance. I tried several combinations but none seemed to work.
>>
>> Can you perhaps point me in the right direction?
>>
>> Perhaps you may know more than I do at the moment.
>>
>>
>> Any help would be great.
>>
>>
>> Thanks a lot
>> Regards Marek Scevlik
>>
>>
>>
>> 2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] <
>> daniel.da...@nih.gov>:
>>
>>> Marek,
>>>
>>> I've wanted to do something like this in the past as well.  However, a
>>> rewrite that supports the same XML syntax might be better.   There are
>>> several problems with the design of the Data Import Handler that make it
>>> not quite suitable:
>>>
>>> - Not designed for Multi-threading
>>> - Bad implementation of XPath
>>>
>>> Another issue is that one of the big advantages of Data Import Handler
>>> goes away at this point, which is that it is hosted within Solr, and has a
>>> UI for testing within the Solr admin.
>>>
>>> A better open-source Java solution might be to connect Solr with Apache
>>> Camel - http://camel.apache.org/solr.html.
>>>
>>> If you are not tied absolutely to pure open-source, and freemium
>>> products will do, then you might look at Pentaho Spoon and Kettle.
>>>  Although Talend is much more established in the market, I find Pentaho's
>>> XML-based ETL a bit easier to integrate as a developer, and unit test and
>>> such.   Talend does better when you have a full infrastructure set up, but
>>> then the attention required to unit tests and Git integration seems over
>>> the top.
>>>
>>> Another powerful way to get things done, depending on what you are
>>> indexing, is to use LogStash and couple that with Document processing
>>> chains.   Many of our projects benefit from having a single RDBMS view,
>>> perhaps a materialized view, that is used for the index.   LogStash does
>>> just fine here, pulling from the RDBMS and posting each row to Solr.  The
>>> hierarchical execution of Data Import Handler is very nice, but this can
>>> often be handled on the RDBMS side by creating a view, maybe using
>>> functions to provide some rows.   Many RDBMS systems also support
>>> federation and the import of XML from files, so that this brings XML
>>> 

Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-25 Thread Marek Ščevlík
I forgot to mention I am creating a jar file beside of a running solr 6.3
instance to which I am hoping to attach with java via the SolrDispatchFilter
to get at the cores and so then I could work with data in code.


2016-11-25 19:31 GMT+01:00 Marek Ščevlík :

> Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
> release of Solr 6.3 to get hold of a running instance of the jetty server
> that is part of the solution? I found some code for previous versions where
> it was captured with this code and one could then obtain cores for a
> running solr instance ...
>
> SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty
>
> .getDispatchFilter().getFilter();
>
>
> I was trying to implement it this way but that is not working out very
> well now. I cant seem to get the jetty server object for the running
> instance. I tried several combinations but none seemed to work.
>
> Can you perhaps point me in the right direction?
>
> Perhaps you may know more than I do at the moment.
>
>
> Any help would be great.
>
>
> Thanks a lot
> Regards Marek Scevlik
>
>
>
> 2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] <
> daniel.da...@nih.gov>:
>
>> Marek,
>>
>> I've wanted to do something like this in the past as well.  However, a
>> rewrite that supports the same XML syntax might be better.   There are
>> several problems with the design of the Data Import Handler that make it
>> not quite suitable:
>>
>> - Not designed for Multi-threading
>> - Bad implementation of XPath
>>
>> Another issue is that one of the big advantages of Data Import Handler
>> goes away at this point, which is that it is hosted within Solr, and has a
>> UI for testing within the Solr admin.
>>
>> A better open-source Java solution might be to connect Solr with Apache
>> Camel - http://camel.apache.org/solr.html.
>>
>> If you are not tied absolutely to pure open-source, and freemium products
>> will do, then you might look at Pentaho Spoon and Kettle.   Although Talend
>> is much more established in the market, I find Pentaho's XML-based ETL a
>> bit easier to integrate as a developer, and unit test and such.   Talend
>> does better when you have a full infrastructure set up, but then the
>> attention required to unit tests and Git integration seems over the top.
>>
>> Another powerful way to get things done, depending on what you are
>> indexing, is to use LogStash and couple that with Document processing
>> chains.   Many of our projects benefit from having a single RDBMS view,
>> perhaps a materialized view, that is used for the index.   LogStash does
>> just fine here, pulling from the RDBMS and posting each row to Solr.  The
>> hierarchical execution of Data Import Handler is very nice, but this can
>> often be handled on the RDBMS side by creating a view, maybe using
>> functions to provide some rows.   Many RDBMS systems also support
>> federation and the import of XML from files, so that this brings XML
>> processing into the picture.
>>
>> Hoping this helps,
>>
>> Dan Davis, Systems/Applications Architect (Contractor),
>> Office of Computer and Communications Systems,
>> National Library of Medicine, NIH
>>
>>
>>
>>
>> -Original Message-
>> From: Marek Ščevlík [mailto:mscev...@codenameprojects.com]
>> Sent: Friday, November 18, 2016 9:29 AM
>> To: solr-user@lucene.apache.org
>> Subject: Data Import Request Handler isolated into its own project - any
>> suggestions?
>>
>> Hello. My name is Marek Scevlik.
>>
>>
>>
>> Currently I am working for a small company where we are interested in
>> implementing your Sorl 6.3 search engine.
>>
>>
>>
>> We are hoping to take out from the original source package the Data
>> Import Request Handler into its own project and create a usable .jar file
>> out of it.
>>
>>
>>
>> It should then serve as tool that would allow to connect to a remote
>> server and return data for us to our other application that would use the
>> returned data.
>>
>>
>>
>> What do you think? Would anything like this possible? To isolate out the
>> Data Import Request Handler into its own standalone project?
>>
>>
>>
>> If we could achieve this we won’t mind to share with the community this
>> new feature.
>>
>>
>>
>> I realize this is a first email and may lead into several hundreds so for
>> the start my request is very simple and not so high level detailed but I am
>> sure you realize it may lead into being quite complex.
>>
>>
>>
>> So I wonder if anyone replies.
>>
>>
>>
>> Thanks a lot for any replies and further info or guidance.
>>
>>
>>
>>
>>
>> Thanks.
>>
>> Regards Marek Scevlik
>>
>
>


Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-25 Thread Marek Ščevlík
Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
release of Solr 6.3 to get hold of a running instance of the jetty server
that is part of the solution? I found some code for previous versions where
it was captured with this code and one could then obtain cores for a
running solr instance ...

SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty

.getDispatchFilter().getFilter();


I was trying to implement it this way but that is not working out very well
now. I cant seem to get the jetty server object for the running instance. I
tried several combinations but none seemed to work.

Can you perhaps point me in the right direction?

Perhaps you may know more than I do at the moment.


Any help would be great.


Thanks a lot
Regards Marek Scevlik



2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] :

> Marek,
>
> I've wanted to do something like this in the past as well.  However, a
> rewrite that supports the same XML syntax might be better.   There are
> several problems with the design of the Data Import Handler that make it
> not quite suitable:
>
> - Not designed for Multi-threading
> - Bad implementation of XPath
>
> Another issue is that one of the big advantages of Data Import Handler
> goes away at this point, which is that it is hosted within Solr, and has a
> UI for testing within the Solr admin.
>
> A better open-source Java solution might be to connect Solr with Apache
> Camel - http://camel.apache.org/solr.html.
>
> If you are not tied absolutely to pure open-source, and freemium products
> will do, then you might look at Pentaho Spoon and Kettle.   Although Talend
> is much more established in the market, I find Pentaho's XML-based ETL a
> bit easier to integrate as a developer, and unit test and such.   Talend
> does better when you have a full infrastructure set up, but then the
> attention required to unit tests and Git integration seems over the top.
>
> Another powerful way to get things done, depending on what you are
> indexing, is to use LogStash and couple that with Document processing
> chains.   Many of our projects benefit from having a single RDBMS view,
> perhaps a materialized view, that is used for the index.   LogStash does
> just fine here, pulling from the RDBMS and posting each row to Solr.  The
> hierarchical execution of Data Import Handler is very nice, but this can
> often be handled on the RDBMS side by creating a view, maybe using
> functions to provide some rows.   Many RDBMS systems also support
> federation and the import of XML from files, so that this brings XML
> processing into the picture.
>
> Hoping this helps,
>
> Dan Davis, Systems/Applications Architect (Contractor),
> Office of Computer and Communications Systems,
> National Library of Medicine, NIH
>
>
>
>
> -Original Message-
> From: Marek Ščevlík [mailto:mscev...@codenameprojects.com]
> Sent: Friday, November 18, 2016 9:29 AM
> To: solr-user@lucene.apache.org
> Subject: Data Import Request Handler isolated into its own project - any
> suggestions?
>
> Hello. My name is Marek Scevlik.
>
>
>
> Currently I am working for a small company where we are interested in
> implementing your Sorl 6.3 search engine.
>
>
>
> We are hoping to take out from the original source package the Data Import
> Request Handler into its own project and create a usable .jar file out of
> it.
>
>
>
> It should then serve as tool that would allow to connect to a remote
> server and return data for us to our other application that would use the
> returned data.
>
>
>
> What do you think? Would anything like this possible? To isolate out the
> Data Import Request Handler into its own standalone project?
>
>
>
> If we could achieve this we won’t mind to share with the community this
> new feature.
>
>
>
> I realize this is a first email and may lead into several hundreds so for
> the start my request is very simple and not so high level detailed but I am
> sure you realize it may lead into being quite complex.
>
>
>
> So I wonder if anyone replies.
>
>
>
> Thanks a lot for any replies and further info or guidance.
>
>
>
>
>
> Thanks.
>
> Regards Marek Scevlik
>


Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-18 Thread Alexandre Rafalovitch
Is your goal to still index into Solr? It was not clear.

If yes, then it has been discussed quite a bit. The challenge is that
DIH is integrated into AdminUI, which makes it easier to see the
progress and set some flags. Plus the required jars are loaded via
solrconfig.xml, just like all other extra libraries. So, contribution
back would need to take that into account.

If you are not ready to face that, it may make sense to look at other
libraries first. Apache Camel, Apache NiFi, Cloudera morphline, etc.
All of them can send data into Solr, though their version support
differ. For example Camel seems to need Solr 3.5 still. Somebody
updating their implementation to Solr 6.3 and contributing that back
to that project would do a lot of good.

Regards,
Alex.

Solr Example reading group is starting November 2016, join us at
http://j.mp/SolrERG
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 19 November 2016 at 01:29, Marek Ščevlík
 wrote:
> Hello. My name is Marek Scevlik.
>
>
>
> Currently I am working for a small company where we are interested in
> implementing your Sorl 6.3 search engine.
>
>
>
> We are hoping to take out from the original source package the Data Import
> Request Handler into its own project and create a usable .jar file out of
> it.
>
>
>
> It should then serve as tool that would allow to connect to a remote server
> and return data for us to our other application that would use the returned
> data.
>
>
>
> What do you think? Would anything like this possible? To isolate out the
> Data Import Request Handler into its own standalone project?
>
>
>
> If we could achieve this we won’t mind to share with the community this new
> feature.
>
>
>
> I realize this is a first email and may lead into several hundreds so for
> the start my request is very simple and not so high level detailed but I am
> sure you realize it may lead into being quite complex.
>
>
>
> So I wonder if anyone replies.
>
>
>
> Thanks a lot for any replies and further info or guidance.
>
>
>
>
>
> Thanks.
>
> Regards Marek Scevlik


RE: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-18 Thread Davis, Daniel (NIH/NLM) [C]
Marek,

I've wanted to do something like this in the past as well.  However, a rewrite 
that supports the same XML syntax might be better.   There are several problems 
with the design of the Data Import Handler that make it not quite suitable:

- Not designed for Multi-threading
- Bad implementation of XPath

Another issue is that one of the big advantages of Data Import Handler goes 
away at this point, which is that it is hosted within Solr, and has a UI for 
testing within the Solr admin.

A better open-source Java solution might be to connect Solr with Apache Camel - 
http://camel.apache.org/solr.html.

If you are not tied absolutely to pure open-source, and freemium products will 
do, then you might look at Pentaho Spoon and Kettle.   Although Talend is much 
more established in the market, I find Pentaho's XML-based ETL a bit easier to 
integrate as a developer, and unit test and such.   Talend does better when you 
have a full infrastructure set up, but then the attention required to unit 
tests and Git integration seems over the top.

Another powerful way to get things done, depending on what you are indexing, is 
to use LogStash and couple that with Document processing chains.   Many of our 
projects benefit from having a single RDBMS view, perhaps a materialized view, 
that is used for the index.   LogStash does just fine here, pulling from the 
RDBMS and posting each row to Solr.  The hierarchical execution of Data Import 
Handler is very nice, but this can often be handled on the RDBMS side by 
creating a view, maybe using functions to provide some rows.   Many RDBMS 
systems also support federation and the import of XML from files, so that this 
brings XML processing into the picture.

Hoping this helps,

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH




-Original Message-
From: Marek Ščevlík [mailto:mscev...@codenameprojects.com] 
Sent: Friday, November 18, 2016 9:29 AM
To: solr-user@lucene.apache.org
Subject: Data Import Request Handler isolated into its own project - any 
suggestions?

Hello. My name is Marek Scevlik.



Currently I am working for a small company where we are interested in 
implementing your Sorl 6.3 search engine.



We are hoping to take out from the original source package the Data Import 
Request Handler into its own project and create a usable .jar file out of it.



It should then serve as tool that would allow to connect to a remote server and 
return data for us to our other application that would use the returned data.



What do you think? Would anything like this possible? To isolate out the Data 
Import Request Handler into its own standalone project?



If we could achieve this we won’t mind to share with the community this new 
feature.



I realize this is a first email and may lead into several hundreds so for the 
start my request is very simple and not so high level detailed but I am sure 
you realize it may lead into being quite complex.



So I wonder if anyone replies.



Thanks a lot for any replies and further info or guidance.





Thanks.

Regards Marek Scevlik


Data Import Request Handler isolated into its own project - any suggestions?

2016-11-18 Thread Marek Ščevlík
Hello. My name is Marek Scevlik.



Currently I am working for a small company where we are interested in
implementing your Sorl 6.3 search engine.



We are hoping to take out from the original source package the Data Import
Request Handler into its own project and create a usable .jar file out of
it.



It should then serve as tool that would allow to connect to a remote server
and return data for us to our other application that would use the returned
data.



What do you think? Would anything like this possible? To isolate out the
Data Import Request Handler into its own standalone project?



If we could achieve this we won’t mind to share with the community this new
feature.



I realize this is a first email and may lead into several hundreds so for
the start my request is very simple and not so high level detailed but I am
sure you realize it may lead into being quite complex.



So I wonder if anyone replies.



Thanks a lot for any replies and further info or guidance.





Thanks.

Regards Marek Scevlik