Can you do this data in CSV format? There is a CSV reader in the DIH.
The SEP was not intended to read from files, since there are already better tools that do that.

Lance

On 10/14/2013 04:44 PM, Josh Lincoln wrote:
Shawn, I'm able to read in a 4mb file using SEP, so I think that rules out
the POST buffer being the issue. Thanks for suggesting I test this. The
full file is over a gig.

Lance, I'm actually pointing SEP at a static file (I simply named the file
"select" and put it on a Web server). SEP thinks it's a large solr
response, which it was, though now it's just static xml. Works well until I
hit the memory limit of the new solr instance.

I can't query the old solr from the new one b/c they're on two different
networks. I can't copy the index files b/c I only want a subset of the data
(identified with a query and dumped to xml...all fields of interest were
stored). To further complicate things, the old solr is 1.4. I was hoping to
use the result xml format to backup the old, and DIH SEP to import to the
new dev solr4.x. It's promising as a simple and repeatable migration
process, except that SEP fails on largish files.

It seems my options are 1) use the xpathprocessor and identify each field
(there are many fields); 2) write a small script to act as a proxy to the
xml file and accept the row and start parameters from the SEP iterative
calls and return just a subset of the docs; 3) a script to process the xml
and push to solr, not using DIH; 4) consider XSLT to transform the result
xml to an update message and use XPathEntityProcessor
with useSolrAddSchema=true and streaming. The latter seems like the most
elegant and reusable approach, though I'm not certain it'll work.

It'd be great if solrEntityProcessor could stream static files, or if I
could specify the solr result format while using the xpathentityprocessor
(i.e. a useSolrResultSchema option)

Any other ideas?






On Mon, Oct 14, 2013 at 6:24 PM, Lance Norskog <goks...@gmail.com> wrote:

On 10/13/2013 10:02 AM, Shawn Heisey wrote:

On 10/13/2013 10:16 AM, Josh Lincoln wrote:

I have a large solr response in xml format and would like to import it
into
a new solr collection. I'm able to use DIH with solrEntityProcessor, but
only if I first truncate the file to a small subset of the records. I was
hoping to set stream="true" to handle the full file, but I still get an
out
of memory error, so I believe stream does not work with
solrEntityProcessor
(I know the docs only mention the stream option for the
XPathEntityProcessor, but I was hoping solrEntityProcessor just might
have
the same capability).

Before I open a jira to request stream support for solrEntityProcessor in
DIH, is there an alternate approach for importing large files that are in
the solr results format?
Maybe a way to use xpath to get the values and a transformer to set the
field names? I'm hoping to not have to declare the field names in
dataConfig so I can reuse the process across data sets.

How big is the XML file?  You might be running into a size limit for
HTTP POST.

In newer 4.x versions, Solr itself sets the size of the POST buffer
regardless of what the container config has.  That size defaults to 2MB
but is configurable using the formdataUploadLimitInKB setting that you
can find in the example solrconfig.xml file, on the requestParsers tag.

In Solr 3.x, if you used the included jetty, it had a configured HTTP
POST size limit of 1MB.  In early Solr 4.x, there was a bug in the
included Jetty that prevented the configuration element from working, so
the actual limit was Jetty's default of 200KB.  With other containers
and these older versions, you would need to change your container
configuration.

https://bugs.eclipse.org/bugs/**show_bug.cgi?id=397130<https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130>

Thanks,
Shawn

  The SEP calls out to another Solr and reads. Are you importing data from
another Solr and cross-connecting it with your uploaded XML?

If the memory errors are a problem with streaming, you could try "piping"
your uploaded documents through a processor that supports streaming. This
would then push one document at a time into your processor that calls out
to Solr and combines records.



Reply via email to