Re: DIH - stream file with solrEntityProcessor

2013-10-15 Thread Josh Lincoln
ultimately I just temporarily increased the memory to handle this data set,
but that won't always be practical.

I did try the csv export/import and it worked well in this case. I hadn't
considered it at first. I am wary that the escaping and splitting may be
problematic with some data sets, so I'll look into adding XMLResponseParser
support to XPathEntityProcessor (essentially an option to
useSolrResponseSchema), though I have a feeling only a few other people
would be interested in this.

Thanks for the replies.


On Mon, Oct 14, 2013 at 11:19 PM, Lance Norskog  wrote:

> Can you do this data in CSV format? There is a CSV reader in the DIH.
> The SEP was not intended to read from files, since there are already
> better tools that do that.
>
> Lance
>
>
> On 10/14/2013 04:44 PM, Josh Lincoln wrote:
>
>> Shawn, I'm able to read in a 4mb file using SEP, so I think that rules out
>> the POST buffer being the issue. Thanks for suggesting I test this. The
>> full file is over a gig.
>>
>> Lance, I'm actually pointing SEP at a static file (I simply named the file
>> "select" and put it on a Web server). SEP thinks it's a large solr
>> response, which it was, though now it's just static xml. Works well until
>> I
>> hit the memory limit of the new solr instance.
>>
>> I can't query the old solr from the new one b/c they're on two different
>> networks. I can't copy the index files b/c I only want a subset of the
>> data
>> (identified with a query and dumped to xml...all fields of interest were
>> stored). To further complicate things, the old solr is 1.4. I was hoping
>> to
>> use the result xml format to backup the old, and DIH SEP to import to the
>> new dev solr4.x. It's promising as a simple and repeatable migration
>> process, except that SEP fails on largish files.
>>
>> It seems my options are 1) use the xpathprocessor and identify each field
>> (there are many fields); 2) write a small script to act as a proxy to the
>> xml file and accept the row and start parameters from the SEP iterative
>> calls and return just a subset of the docs; 3) a script to process the xml
>> and push to solr, not using DIH; 4) consider XSLT to transform the result
>> xml to an update message and use XPathEntityProcessor
>> with useSolrAddSchema=true and streaming. The latter seems like the most
>> elegant and reusable approach, though I'm not certain it'll work.
>>
>> It'd be great if solrEntityProcessor could stream static files, or if I
>> could specify the solr result format while using the xpathentityprocessor
>> (i.e. a useSolrResultSchema option)
>>
>> Any other ideas?
>>
>>
>>
>>
>>
>>
>> On Mon, Oct 14, 2013 at 6:24 PM, Lance Norskog  wrote:
>>
>>  On 10/13/2013 10:02 AM, Shawn Heisey wrote:
>>>
>>>  On 10/13/2013 10:16 AM, Josh Lincoln wrote:

  I have a large solr response in xml format and would like to import it
> into
> a new solr collection. I'm able to use DIH with solrEntityProcessor,
> but
> only if I first truncate the file to a small subset of the records. I
> was
> hoping to set stream="true" to handle the full file, but I still get an
> out
> of memory error, so I believe stream does not work with
> solrEntityProcessor
> (I know the docs only mention the stream option for the
> XPathEntityProcessor, but I was hoping solrEntityProcessor just might
> have
> the same capability).
>
> Before I open a jira to request stream support for solrEntityProcessor
> in
> DIH, is there an alternate approach for importing large files that are
> in
> the solr results format?
> Maybe a way to use xpath to get the values and a transformer to set the
> field names? I'm hoping to not have to declare the field names in
> dataConfig so I can reuse the process across data sets.
>
>  How big is the XML file?  You might be running into a size limit for
 HTTP POST.

 In newer 4.x versions, Solr itself sets the size of the POST buffer
 regardless of what the container config has.  That size defaults to 2MB
 but is configurable using the formdataUploadLimitInKB setting that you
 can find in the example solrconfig.xml file, on the requestParsers tag.

 In Solr 3.x, if you used the included jetty, it had a configured HTTP
 POST size limit of 1MB.  In early Solr 4.x, there was a bug in the
 included Jetty that prevented the configuration element from working, so
 the actual limit was Jetty's default of 200KB.  With other containers
 and these older versions, you would need to change your container
 configuration.

 https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130
 https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130>
 >


 Thanks,
 Shawn

   The SEP calls out to another Solr and reads. Are you importing data
 from

>>> another Solr and cross-connecting it with your uploade

Re: DIH - stream file with solrEntityProcessor

2013-10-14 Thread Lance Norskog

Can you do this data in CSV format? There is a CSV reader in the DIH.
The SEP was not intended to read from files, since there are already 
better tools that do that.


Lance

On 10/14/2013 04:44 PM, Josh Lincoln wrote:

Shawn, I'm able to read in a 4mb file using SEP, so I think that rules out
the POST buffer being the issue. Thanks for suggesting I test this. The
full file is over a gig.

Lance, I'm actually pointing SEP at a static file (I simply named the file
"select" and put it on a Web server). SEP thinks it's a large solr
response, which it was, though now it's just static xml. Works well until I
hit the memory limit of the new solr instance.

I can't query the old solr from the new one b/c they're on two different
networks. I can't copy the index files b/c I only want a subset of the data
(identified with a query and dumped to xml...all fields of interest were
stored). To further complicate things, the old solr is 1.4. I was hoping to
use the result xml format to backup the old, and DIH SEP to import to the
new dev solr4.x. It's promising as a simple and repeatable migration
process, except that SEP fails on largish files.

It seems my options are 1) use the xpathprocessor and identify each field
(there are many fields); 2) write a small script to act as a proxy to the
xml file and accept the row and start parameters from the SEP iterative
calls and return just a subset of the docs; 3) a script to process the xml
and push to solr, not using DIH; 4) consider XSLT to transform the result
xml to an update message and use XPathEntityProcessor
with useSolrAddSchema=true and streaming. The latter seems like the most
elegant and reusable approach, though I'm not certain it'll work.

It'd be great if solrEntityProcessor could stream static files, or if I
could specify the solr result format while using the xpathentityprocessor
(i.e. a useSolrResultSchema option)

Any other ideas?






On Mon, Oct 14, 2013 at 6:24 PM, Lance Norskog  wrote:


On 10/13/2013 10:02 AM, Shawn Heisey wrote:


On 10/13/2013 10:16 AM, Josh Lincoln wrote:


I have a large solr response in xml format and would like to import it
into
a new solr collection. I'm able to use DIH with solrEntityProcessor, but
only if I first truncate the file to a small subset of the records. I was
hoping to set stream="true" to handle the full file, but I still get an
out
of memory error, so I believe stream does not work with
solrEntityProcessor
(I know the docs only mention the stream option for the
XPathEntityProcessor, but I was hoping solrEntityProcessor just might
have
the same capability).

Before I open a jira to request stream support for solrEntityProcessor in
DIH, is there an alternate approach for importing large files that are in
the solr results format?
Maybe a way to use xpath to get the values and a transformer to set the
field names? I'm hoping to not have to declare the field names in
dataConfig so I can reuse the process across data sets.


How big is the XML file?  You might be running into a size limit for
HTTP POST.

In newer 4.x versions, Solr itself sets the size of the POST buffer
regardless of what the container config has.  That size defaults to 2MB
but is configurable using the formdataUploadLimitInKB setting that you
can find in the example solrconfig.xml file, on the requestParsers tag.

In Solr 3.x, if you used the included jetty, it had a configured HTTP
POST size limit of 1MB.  In early Solr 4.x, there was a bug in the
included Jetty that prevented the configuration element from working, so
the actual limit was Jetty's default of 200KB.  With other containers
and these older versions, you would need to change your container
configuration.

https://bugs.eclipse.org/bugs/**show_bug.cgi?id=397130

Thanks,
Shawn

  The SEP calls out to another Solr and reads. Are you importing data from

another Solr and cross-connecting it with your uploaded XML?

If the memory errors are a problem with streaming, you could try "piping"
your uploaded documents through a processor that supports streaming. This
would then push one document at a time into your processor that calls out
to Solr and combines records.






Re: DIH - stream file with solrEntityProcessor

2013-10-14 Thread Josh Lincoln
Shawn, I'm able to read in a 4mb file using SEP, so I think that rules out
the POST buffer being the issue. Thanks for suggesting I test this. The
full file is over a gig.

Lance, I'm actually pointing SEP at a static file (I simply named the file
"select" and put it on a Web server). SEP thinks it's a large solr
response, which it was, though now it's just static xml. Works well until I
hit the memory limit of the new solr instance.

I can't query the old solr from the new one b/c they're on two different
networks. I can't copy the index files b/c I only want a subset of the data
(identified with a query and dumped to xml...all fields of interest were
stored). To further complicate things, the old solr is 1.4. I was hoping to
use the result xml format to backup the old, and DIH SEP to import to the
new dev solr4.x. It's promising as a simple and repeatable migration
process, except that SEP fails on largish files.

It seems my options are 1) use the xpathprocessor and identify each field
(there are many fields); 2) write a small script to act as a proxy to the
xml file and accept the row and start parameters from the SEP iterative
calls and return just a subset of the docs; 3) a script to process the xml
and push to solr, not using DIH; 4) consider XSLT to transform the result
xml to an update message and use XPathEntityProcessor
with useSolrAddSchema=true and streaming. The latter seems like the most
elegant and reusable approach, though I'm not certain it'll work.

It'd be great if solrEntityProcessor could stream static files, or if I
could specify the solr result format while using the xpathentityprocessor
(i.e. a useSolrResultSchema option)

Any other ideas?






On Mon, Oct 14, 2013 at 6:24 PM, Lance Norskog  wrote:

> On 10/13/2013 10:02 AM, Shawn Heisey wrote:
>
>> On 10/13/2013 10:16 AM, Josh Lincoln wrote:
>>
>>> I have a large solr response in xml format and would like to import it
>>> into
>>> a new solr collection. I'm able to use DIH with solrEntityProcessor, but
>>> only if I first truncate the file to a small subset of the records. I was
>>> hoping to set stream="true" to handle the full file, but I still get an
>>> out
>>> of memory error, so I believe stream does not work with
>>> solrEntityProcessor
>>> (I know the docs only mention the stream option for the
>>> XPathEntityProcessor, but I was hoping solrEntityProcessor just might
>>> have
>>> the same capability).
>>>
>>> Before I open a jira to request stream support for solrEntityProcessor in
>>> DIH, is there an alternate approach for importing large files that are in
>>> the solr results format?
>>> Maybe a way to use xpath to get the values and a transformer to set the
>>> field names? I'm hoping to not have to declare the field names in
>>> dataConfig so I can reuse the process across data sets.
>>>
>> How big is the XML file?  You might be running into a size limit for
>> HTTP POST.
>>
>> In newer 4.x versions, Solr itself sets the size of the POST buffer
>> regardless of what the container config has.  That size defaults to 2MB
>> but is configurable using the formdataUploadLimitInKB setting that you
>> can find in the example solrconfig.xml file, on the requestParsers tag.
>>
>> In Solr 3.x, if you used the included jetty, it had a configured HTTP
>> POST size limit of 1MB.  In early Solr 4.x, there was a bug in the
>> included Jetty that prevented the configuration element from working, so
>> the actual limit was Jetty's default of 200KB.  With other containers
>> and these older versions, you would need to change your container
>> configuration.
>>
>> https://bugs.eclipse.org/bugs/**show_bug.cgi?id=397130
>>
>> Thanks,
>> Shawn
>>
>>  The SEP calls out to another Solr and reads. Are you importing data from
> another Solr and cross-connecting it with your uploaded XML?
>
> If the memory errors are a problem with streaming, you could try "piping"
> your uploaded documents through a processor that supports streaming. This
> would then push one document at a time into your processor that calls out
> to Solr and combines records.
>
>


Re: DIH - stream file with solrEntityProcessor

2013-10-14 Thread Lance Norskog

On 10/13/2013 10:02 AM, Shawn Heisey wrote:

On 10/13/2013 10:16 AM, Josh Lincoln wrote:

I have a large solr response in xml format and would like to import it into
a new solr collection. I'm able to use DIH with solrEntityProcessor, but
only if I first truncate the file to a small subset of the records. I was
hoping to set stream="true" to handle the full file, but I still get an out
of memory error, so I believe stream does not work with solrEntityProcessor
(I know the docs only mention the stream option for the
XPathEntityProcessor, but I was hoping solrEntityProcessor just might have
the same capability).

Before I open a jira to request stream support for solrEntityProcessor in
DIH, is there an alternate approach for importing large files that are in
the solr results format?
Maybe a way to use xpath to get the values and a transformer to set the
field names? I'm hoping to not have to declare the field names in
dataConfig so I can reuse the process across data sets.

How big is the XML file?  You might be running into a size limit for
HTTP POST.

In newer 4.x versions, Solr itself sets the size of the POST buffer
regardless of what the container config has.  That size defaults to 2MB
but is configurable using the formdataUploadLimitInKB setting that you
can find in the example solrconfig.xml file, on the requestParsers tag.

In Solr 3.x, if you used the included jetty, it had a configured HTTP
POST size limit of 1MB.  In early Solr 4.x, there was a bug in the
included Jetty that prevented the configuration element from working, so
the actual limit was Jetty's default of 200KB.  With other containers
and these older versions, you would need to change your container
configuration.

https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130

Thanks,
Shawn

The SEP calls out to another Solr and reads. Are you importing data from 
another Solr and cross-connecting it with your uploaded XML?


If the memory errors are a problem with streaming, you could try 
"piping" your uploaded documents through a processor that supports 
streaming. This would then push one document at a time into your 
processor that calls out to Solr and combines records.




Re: DIH - stream file with solrEntityProcessor

2013-10-13 Thread Shawn Heisey
On 10/13/2013 10:16 AM, Josh Lincoln wrote:
> I have a large solr response in xml format and would like to import it into
> a new solr collection. I'm able to use DIH with solrEntityProcessor, but
> only if I first truncate the file to a small subset of the records. I was
> hoping to set stream="true" to handle the full file, but I still get an out
> of memory error, so I believe stream does not work with solrEntityProcessor
> (I know the docs only mention the stream option for the
> XPathEntityProcessor, but I was hoping solrEntityProcessor just might have
> the same capability).
> 
> Before I open a jira to request stream support for solrEntityProcessor in
> DIH, is there an alternate approach for importing large files that are in
> the solr results format?
> Maybe a way to use xpath to get the values and a transformer to set the
> field names? I'm hoping to not have to declare the field names in
> dataConfig so I can reuse the process across data sets.

How big is the XML file?  You might be running into a size limit for
HTTP POST.

In newer 4.x versions, Solr itself sets the size of the POST buffer
regardless of what the container config has.  That size defaults to 2MB
but is configurable using the formdataUploadLimitInKB setting that you
can find in the example solrconfig.xml file, on the requestParsers tag.

In Solr 3.x, if you used the included jetty, it had a configured HTTP
POST size limit of 1MB.  In early Solr 4.x, there was a bug in the
included Jetty that prevented the configuration element from working, so
the actual limit was Jetty's default of 200KB.  With other containers
and these older versions, you would need to change your container
configuration.

https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130

Thanks,
Shawn



DIH - stream file with solrEntityProcessor

2013-10-13 Thread Josh Lincoln
I have a large solr response in xml format and would like to import it into
a new solr collection. I'm able to use DIH with solrEntityProcessor, but
only if I first truncate the file to a small subset of the records. I was
hoping to set stream="true" to handle the full file, but I still get an out
of memory error, so I believe stream does not work with solrEntityProcessor
(I know the docs only mention the stream option for the
XPathEntityProcessor, but I was hoping solrEntityProcessor just might have
the same capability).

Before I open a jira to request stream support for solrEntityProcessor in
DIH, is there an alternate approach for importing large files that are in
the solr results format?
Maybe a way to use xpath to get the values and a transformer to set the
field names? I'm hoping to not have to declare the field names in
dataConfig so I can reuse the process across data sets.

Anyone have ideas? thanks