Re: DIH - stream file with solrEntityProcessor

Josh Lincoln Mon, 14 Oct 2013 16:46:30 -0700

Shawn, I'm able to read in a 4mb file using SEP, so I think that rules out
the POST buffer being the issue. Thanks for suggesting I test this. The
full file is over a gig.

Lance, I'm actually pointing SEP at a static file (I simply named the file
"select" and put it on a Web server). SEP thinks it's a large solr
response, which it was, though now it's just static xml. Works well until I
hit the memory limit of the new solr instance.

I can't query the old solr from the new one b/c they're on two different
networks. I can't copy the index files b/c I only want a subset of the data
(identified with a query and dumped to xml...all fields of interest were
stored). To further complicate things, the old solr is 1.4. I was hoping to
use the result xml format to backup the old, and DIH SEP to import to the
new dev solr4.x. It's promising as a simple and repeatable migration
process, except that SEP fails on largish files.

It seems my options are 1) use the xpathprocessor and identify each field
(there are many fields); 2) write a small script to act as a proxy to the
xml file and accept the row and start parameters from the SEP iterative
calls and return just a subset of the docs; 3) a script to process the xml
and push to solr, not using DIH; 4) consider XSLT to transform the result
xml to an update message and use XPathEntityProcessor
with useSolrAddSchema=true and streaming. The latter seems like the most
elegant and reusable approach, though I'm not certain it'll work.

It'd be great if solrEntityProcessor could stream static files, or if I
could specify the solr result format while using the xpathentityprocessor
(i.e. a useSolrResultSchema option)

Any other ideas?

On Mon, Oct 14, 2013 at 6:24 PM, Lance Norskog <goks...@gmail.com> wrote:

> On 10/13/2013 10:02 AM, Shawn Heisey wrote:
>
>> On 10/13/2013 10:16 AM, Josh Lincoln wrote:
>>
>>> I have a large solr response in xml format and would like to import it
>>> into
>>> a new solr collection. I'm able to use DIH with solrEntityProcessor, but
>>> only if I first truncate the file to a small subset of the records. I was
>>> hoping to set stream="true" to handle the full file, but I still get an
>>> out
>>> of memory error, so I believe stream does not work with
>>> solrEntityProcessor
>>> (I know the docs only mention the stream option for the
>>> XPathEntityProcessor, but I was hoping solrEntityProcessor just might
>>> have
>>> the same capability).
>>>
>>> Before I open a jira to request stream support for solrEntityProcessor in
>>> DIH, is there an alternate approach for importing large files that are in
>>> the solr results format?
>>> Maybe a way to use xpath to get the values and a transformer to set the
>>> field names? I'm hoping to not have to declare the field names in
>>> dataConfig so I can reuse the process across data sets.
>>>
>> How big is the XML file?  You might be running into a size limit for
>> HTTP POST.
>>
>> In newer 4.x versions, Solr itself sets the size of the POST buffer
>> regardless of what the container config has.  That size defaults to 2MB
>> but is configurable using the formdataUploadLimitInKB setting that you
>> can find in the example solrconfig.xml file, on the requestParsers tag.
>>
>> In Solr 3.x, if you used the included jetty, it had a configured HTTP
>> POST size limit of 1MB.  In early Solr 4.x, there was a bug in the
>> included Jetty that prevented the configuration element from working, so
>> the actual limit was Jetty's default of 200KB.  With other containers
>> and these older versions, you would need to change your container
>> configuration.
>>
>> https://bugs.eclipse.org/bugs/**show_bug.cgi?id=397130<https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130>
>>
>> Thanks,
>> Shawn
>>
>>  The SEP calls out to another Solr and reads. Are you importing data from
> another Solr and cross-connecting it with your uploaded XML?
>
> If the memory errors are a problem with streaming, you could try "piping"
> your uploaded documents through a processor that supports streaming. This
> would then push one document at a time into your processor that calls out
> to Solr and combines records.
>
>

Re: DIH - stream file with solrEntityProcessor

Reply via email to