Re: reading large XML files

2014-05-20 Thread Nathan Kronenfeld
Thanks, that sounds perfect



On Tue, May 20, 2014 at 1:38 PM, Xiangrui Meng  wrote:

> You can search for XMLInputFormat on Google. There are some
> implementations that allow you to specify the  to split on, e.g.:
>
> https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/collection/XMLInputFormat.java
>
> On Tue, May 20, 2014 at 10:31 AM, Nathan Kronenfeld
>  wrote:
> > Unfortunately, I don't have a bunch of moderately big xml files; I have
> one,
> > really big file - big enough that reading it into memory as a single
> string
> > is not feasible.
> >
> >
> > On Tue, May 20, 2014 at 1:24 PM, Xiangrui Meng  wrote:
> >>
> >> Try sc.wholeTextFiles(). It reads the entire file into a string
> >> record. -Xiangrui
> >>
> >> On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld
> >>  wrote:
> >> > We are trying to read some large GraphML files to use in spark.
> >> >
> >> > Is there an easy way to read XML-based files like this that accounts
> for
> >> > partition boundaries and the like?
> >> >
> >> >  Thanks,
> >> >  Nathan
> >> >
> >> >
> >> > --
> >> > Nathan Kronenfeld
> >> > Senior Visualization Developer
> >> > Oculus Info Inc
> >> > 2 Berkeley Street, Suite 600,
> >> > Toronto, Ontario M5A 4J5
> >> > Phone:  +1-416-203-3003 x 238
> >> > Email:  nkronenf...@oculusinfo.com
> >
> >
> >
> >
> > --
> > Nathan Kronenfeld
> > Senior Visualization Developer
> > Oculus Info Inc
> > 2 Berkeley Street, Suite 600,
> > Toronto, Ontario M5A 4J5
> > Phone:  +1-416-203-3003 x 238
> > Email:  nkronenf...@oculusinfo.com
>



-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com


Re: reading large XML files

2014-05-20 Thread Xiangrui Meng
You can search for XMLInputFormat on Google. There are some
implementations that allow you to specify the  to split on, e.g.:
https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/collection/XMLInputFormat.java

On Tue, May 20, 2014 at 10:31 AM, Nathan Kronenfeld
 wrote:
> Unfortunately, I don't have a bunch of moderately big xml files; I have one,
> really big file - big enough that reading it into memory as a single string
> is not feasible.
>
>
> On Tue, May 20, 2014 at 1:24 PM, Xiangrui Meng  wrote:
>>
>> Try sc.wholeTextFiles(). It reads the entire file into a string
>> record. -Xiangrui
>>
>> On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld
>>  wrote:
>> > We are trying to read some large GraphML files to use in spark.
>> >
>> > Is there an easy way to read XML-based files like this that accounts for
>> > partition boundaries and the like?
>> >
>> >  Thanks,
>> >  Nathan
>> >
>> >
>> > --
>> > Nathan Kronenfeld
>> > Senior Visualization Developer
>> > Oculus Info Inc
>> > 2 Berkeley Street, Suite 600,
>> > Toronto, Ontario M5A 4J5
>> > Phone:  +1-416-203-3003 x 238
>> > Email:  nkronenf...@oculusinfo.com
>
>
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenf...@oculusinfo.com


Re: reading large XML files

2014-05-20 Thread Nathan Kronenfeld
Unfortunately, I don't have a bunch of moderately big xml files; I have
one, really big file - big enough that reading it into memory as a single
string is not feasible.


On Tue, May 20, 2014 at 1:24 PM, Xiangrui Meng  wrote:

> Try sc.wholeTextFiles(). It reads the entire file into a string
> record. -Xiangrui
>
> On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld
>  wrote:
> > We are trying to read some large GraphML files to use in spark.
> >
> > Is there an easy way to read XML-based files like this that accounts for
> > partition boundaries and the like?
> >
> >  Thanks,
> >  Nathan
> >
> >
> > --
> > Nathan Kronenfeld
> > Senior Visualization Developer
> > Oculus Info Inc
> > 2 Berkeley Street, Suite 600,
> > Toronto, Ontario M5A 4J5
> > Phone:  +1-416-203-3003 x 238
> > Email:  nkronenf...@oculusinfo.com
>



-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com


Re: reading large XML files

2014-05-20 Thread Xiangrui Meng
Try sc.wholeTextFiles(). It reads the entire file into a string
record. -Xiangrui

On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld
 wrote:
> We are trying to read some large GraphML files to use in spark.
>
> Is there an easy way to read XML-based files like this that accounts for
> partition boundaries and the like?
>
>  Thanks,
>  Nathan
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenf...@oculusinfo.com