Re: Splitting files on new line using hadoop fs

2012-02-22 Thread bejoy . hadoop
Hi Mohit
AFAIK there is no default mechanism available for the same in hadoop. 
File is split into blocks just based on the configured block size during hdfs 
copy. While processing the file using Mapreduce the record reader takes care of 
the new lines even if a line spans across multiple blocks. 

Could you explain more on the use case that demands such a requirement while 
hdfs copy itself?

--Original Message--
From: Mohit Anchlia
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
Subject: Splitting files on new line using hadoop fs
Sent: Feb 23, 2012 01:45

How can I copy large text files using hadoop fs such that split occurs
based on blocks + new lines instead of blocks alone? Is there a way to do
this?



Regards
Bejoy K S

From handheld, Please excuse typos.


Re: Splitting files on new line using hadoop fs

2012-02-22 Thread Mohit Anchlia
On Wed, Feb 22, 2012 at 12:23 PM, bejoy.had...@gmail.com wrote:

 Hi Mohit
AFAIK there is no default mechanism available for the same in
 hadoop. File is split into blocks just based on the configured block size
 during hdfs copy. While processing the file using Mapreduce the record
 reader takes care of the new lines even if a line spans across multiple
 blocks.

 Could you explain more on the use case that demands such a requirement
 while hdfs copy itself?


 I am using pig's XMLLoader in piggybank to read xml files concatenated in
a text file. But pig script doesn't work when file is big that causes
hadoop to split the files.

Any suggestions on how I can make it work? Below is my simple script that I
would like to enhance, only if it starts working. Please note this works
for small files.


register '/root/pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'

raw = LOAD '/examples/testfile5.txt using
org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray);

dump raw;


 --Original Message--
 From: Mohit Anchlia
 To: common-user@hadoop.apache.org
 ReplyTo: common-user@hadoop.apache.org
 Subject: Splitting files on new line using hadoop fs
 Sent: Feb 23, 2012 01:45

 How can I copy large text files using hadoop fs such that split occurs
 based on blocks + new lines instead of blocks alone? Is there a way to do
 this?



 Regards
 Bejoy K S

 From handheld, Please excuse typos.



Re: Splitting files on new line using hadoop fs

2012-02-22 Thread bejoy . hadoop
Hi Mohit
I'm not an expert in pig and it'd be better using the pig user group 
for pig specific queries. I'd try to help you with some basic trouble shooting 
of the same

It sounds strange that pig's XML Loader can't load larger XML files that 
consists of multiple blocks. Or is it like, pig is not able to load the 
concatenated files that you are trying with? If that is the case then it could 
be because of some issues since you are just appending multiple xml file 
contents into a single file.

Pig users can give you some workarounds how they are dealing with loading of 
small xml files that are stored efficiently.

Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: Mohit Anchlia mohitanch...@gmail.com
Date: Wed, 22 Feb 2012 12:29:26 
To: common-user@hadoop.apache.org; bejoy.had...@gmail.com
Subject: Re: Splitting files on new line using hadoop fs

On Wed, Feb 22, 2012 at 12:23 PM, bejoy.had...@gmail.com wrote:

 Hi Mohit
AFAIK there is no default mechanism available for the same in
 hadoop. File is split into blocks just based on the configured block size
 during hdfs copy. While processing the file using Mapreduce the record
 reader takes care of the new lines even if a line spans across multiple
 blocks.

 Could you explain more on the use case that demands such a requirement
 while hdfs copy itself?


 I am using pig's XMLLoader in piggybank to read xml files concatenated in
a text file. But pig script doesn't work when file is big that causes
hadoop to split the files.

Any suggestions on how I can make it work? Below is my simple script that I
would like to enhance, only if it starts working. Please note this works
for small files.


register '/root/pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'

raw = LOAD '/examples/testfile5.txt using
org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray);

dump raw;


 --Original Message--
 From: Mohit Anchlia
 To: common-user@hadoop.apache.org
 ReplyTo: common-user@hadoop.apache.org
 Subject: Splitting files on new line using hadoop fs
 Sent: Feb 23, 2012 01:45

 How can I copy large text files using hadoop fs such that split occurs
 based on blocks + new lines instead of blocks alone? Is there a way to do
 this?



 Regards
 Bejoy K S

 From handheld, Please excuse typos.




Re: Splitting files on new line using hadoop fs

2012-02-22 Thread Mohit Anchlia
Thanks I did post this question to that group. All xml document are
separated by a new line so that shouldn't be the issue, I think.

On Wed, Feb 22, 2012 at 12:44 PM, bejoy.had...@gmail.com wrote:

 **
 Hi Mohit
 I'm not an expert in pig and it'd be better using the pig user group for
 pig specific queries. I'd try to help you with some basic trouble shooting
 of the same

 It sounds strange that pig's XML Loader can't load larger XML files that
 consists of multiple blocks. Or is it like, pig is not able to load the
 concatenated files that you are trying with? If that is the case then it
 could be because of some issues since you are just appending multiple xml
 file contents into a single file.

 Pig users can give you some workarounds how they are dealing with loading
 of small xml files that are stored efficiently.

 Regards
 Bejoy K S

 From handheld, Please excuse typos.
 --
 *From: *Mohit Anchlia mohitanch...@gmail.com
 *Date: *Wed, 22 Feb 2012 12:29:26 -0800
 *To: *common-user@hadoop.apache.org; bejoy.had...@gmail.com
 *Subject: *Re: Splitting files on new line using hadoop fs


 On Wed, Feb 22, 2012 at 12:23 PM, bejoy.had...@gmail.com wrote:

 Hi Mohit
AFAIK there is no default mechanism available for the same in
 hadoop. File is split into blocks just based on the configured block size
 during hdfs copy. While processing the file using Mapreduce the record
 reader takes care of the new lines even if a line spans across multiple
 blocks.

 Could you explain more on the use case that demands such a requirement
 while hdfs copy itself?


  I am using pig's XMLLoader in piggybank to read xml files concatenated
 in a text file. But pig script doesn't work when file is big that causes
 hadoop to split the files.

 Any suggestions on how I can make it work? Below is my simple script that
 I would like to enhance, only if it starts working. Please note this works
 for small files.


 register '/root/pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'

 raw = LOAD '/examples/testfile5.txt using
 org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray);

 dump raw;


 --Original Message--
 From: Mohit Anchlia
 To: common-user@hadoop.apache.org
 ReplyTo: common-user@hadoop.apache.org
 Subject: Splitting files on new line using hadoop fs
 Sent: Feb 23, 2012 01:45

 How can I copy large text files using hadoop fs such that split occurs
 based on blocks + new lines instead of blocks alone? Is there a way to do
 this?



 Regards
 Bejoy K S

 From handheld, Please excuse typos.