RE: Loading data containing newlines

2016-01-15 Thread Mich Talebzadeh
anuary 2016 23:31 To: user@hive.apache.org Subject: RE: Loading data containing newlines Mich, if you have a toolpath that you can use to pipeline the required edits to the source file, you can use a chain similar to this: hadoop fs -text ${hdfs_path}/${orig_filename} | iconv -f EBCDIC-US -t ASCII |

RE: Loading data containing newlines

2016-01-15 Thread Ryan Harris
execution engine, but setting up spark strictly to resolve this issue seems like overkill to me. From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: Friday, January 15, 2016 4:04 PM To: user@hive.apache.org Subject: RE: Loading data containing newlines Ok but I believe there are oth

RE: Loading data containing newlines

2016-01-15 Thread Mich Talebzadeh
either Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility. From: Alexander Pivovarov [mailto:apivova...@gmail.com] Sent: 15 January 2016 23:07 To: user@hive.apache.org Subject: Re: Loading data containing newlines Probably Bryan can try both Hive and

Re: Loading data containing newlines

2016-01-15 Thread Alexander Pivovarov
> >> >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be

Re: Loading data containing newlines

2016-01-15 Thread Alexander Pivovarov
so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > *From:* Marcin Tustin [mailto:mtus...@handybook.com] > *S

RE: Loading data containing newlines

2016-01-15 Thread Mich Talebzadeh
Marcin Tustin [mailto:mtus...@handybook.com] Sent: 15 January 2016 21:51 To: user@hive.apache.org Subject: Re: Loading data containing newlines You can open a file as an RDD of lines, and map whatever custom tokenisation function you want over it; alternatively you can partition down to a reasonable size

Re: Loading data containing newlines

2016-01-15 Thread Gopal Vijayaraghavan
> You can open a file as an RDD of lines, and map whatever custom >tokenisation function you want over it; That's what a SerDe does in Hive (like OpenCSVSerDe). Once your record gets split into multiple lines, then the problem becomes more complex since Spark's functional nature demands side-eff

Re: Loading data containing newlines

2016-01-15 Thread Marcin Tustin
neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > *From:* Marcin Tustin [mailto:mtus...@handybook.com] > *Sent:* 15 January 2016 21:39 > *To:* user@hive.apache.org > *Subject:* Re: Loading data containing newlines >

RE: Loading data containing newlines

2016-01-15 Thread Mich Talebzadeh
Marcin Tustin [mailto:mtus...@handybook.com] Sent: 15 January 2016 21:39 To: user@hive.apache.org Subject: Re: Loading data containing newlines I second this. I've generally found anything else to be disappointing when working with data which is at all funky. On Wed, Jan 13, 2016

Re: Loading data containing newlines

2016-01-15 Thread Marcin Tustin
well, until our newest data contains fields >> with embedded newlines. >> >> >> >> We are now looking into options further up the pipeline to see if we can >> condition the data earlier in the process. >> >> >> >> *From:* Mich Tale

RE: Loading data containing newlines

2016-01-15 Thread Mich Talebzadeh
y responsibility. From: Gerber, Bryan W [mailto:bryan.ger...@pnnl.gov] Sent: 14 January 2016 00:13 To: user@hive.apache.org Subject: RE: Loading data containing newlines 1. hdfs dfs -copyFromLocal /incoming/files/*.bz2 hdfs://host.name/data/stg/table/ 2. CREATE EXTERNAL TABLE stg_ (cols.) ROW

Re: Loading data containing newlines

2016-01-13 Thread Alexander Pivovarov
tains fields > with embedded newlines. > > > > We are now looking into options further up the pipeline to see if we can > condition the data earlier in the process. > > > > *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk] > *Sent:* Wednesday, January 13, 2016 10

RE: Loading data containing newlines

2016-01-13 Thread Gerber, Bryan W
fields with embedded newlines. We are now looking into options further up the pipeline to see if we can condition the data earlier in the process. From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: Wednesday, January 13, 2016 10:34 AM To: user@hive.apache.org Subject: RE: Loading data con

Re: Loading data containing newlines

2016-01-13 Thread Gopal Vijayaraghavan
> We are pushing the compressed text files into HDFS directory for Hive >EXTERNAL table, then using an INSERT on the table using ORC storage. We >are letting Hive handle the ORC file creation process. Are the compressed text files small enough to process one by one? I did write something similar

RE: Loading data containing newlines

2016-01-13 Thread Mich Talebzadeh
r Peridale Ltd, its subsidiaries nor their employees accept any responsibility. From: Gerber, Bryan W [mailto:bryan.ger...@pnnl.gov] Sent: 13 January 2016 18:12 To: user@hive.apache.org Subject: RE: Loading data containing newlines We are pushing the compressed text files into HDFS directory f

RE: Loading data containing newlines

2016-01-13 Thread Gerber, Bryan W
: user@hive.apache.org Subject: RE: Loading data containing newlines Hi Bryan, As a matter of interest are you loading text files into local directories in encrypted format at all and then push it into HDFS/Hive as ORC? Thanks Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id

RE: Loading data containing newlines

2016-01-12 Thread Mich Talebzadeh
Hi Bryan, As a matter of interest are you loading text files into local directories in encrypted format at all and then push it into HDFS/Hive as ORC? Thanks Dr Mich Talebzadeh LinkedIn https://ww

RE: Loading data containing newlines

2016-01-12 Thread Alexander Pivovarov
an give it a > different line delimiter, but Hive 1.2.1 does not support it: "FAILED: > SemanticException 3:20 LINES TERMINATED BY only supports newline '\n' right > now." > > > > *From:* Alexander Pivovarov [mailto:apivova...@gmail.com] > *Sent:* Tuesday,

RE: Loading data containing newlines

2016-01-12 Thread Gerber, Bryan W
support it: "FAILED: SemanticException 3:20 LINES TERMINATED BY only supports newline '\n' right now." From: Alexander Pivovarov [mailto:apivova...@gmail.com] Sent: Tuesday, January 12, 2016 9:52 AM To: user@hive.apache.org Subject: Re: Loading data containing newlines Try

Re: Loading data containing newlines

2016-01-12 Thread Alexander Pivovarov
Try CSV serde. It should correctly parse quoted field value having newline inside https://cwiki.apache.org/confluence/display/Hive/CSV+Serde Hadoop should automatically read bz2 files On Tue, Jan 12, 2016 at 9:40 AM, Gerber, Bryan W wrote: > We are attempting to load CSV text files (compressed