Re: Hive splits/adds rows when outputting dataset with new lines

2014-10-07 Thread Navis류승우
Try with set hive.default.fileformat=SequenceFile;

Thanks,
Navis

2014-10-06 20:51 GMT+09:00 Maciek mac...@sonra.io:

 Hello,

 I've encountered a situation when printing new lines corrupts (multiplies)
 the returned dataset.
 This seem to be similar to HIVE-3012
 https://issues.apache.org/jira/browse/HIVE-3012 (fixed on 0.11), but as
 I'm on Hive 0.13 it's still the case.
 Here are the steps to illustrate/reproduce:

 1. Fist let'e create table with one row and one column by selecting from
 any existing table (substitute ANYTABLE respecitvely):

 CREATE TABLE singlerow AS SELECT 'worldofhostels' wordsmerged FROM
 ANYTABLE LIMIT 1;

 and verify:

 SELECT * FROM singlerow;

 OK---
 worldofhostels

 Time taken: 0.028 seconds, Fetched: 1 row(s)

 All good so far.
 2. Now let's introduce newline here by:

 SELECT regexp_replace(wordsmerged,'of',\nof\n) wordsseparate FROM
 singlerow;

 OK--

 world
 of
 hostels

 Time taken: 6.404 seconds, Fetched: 3 row(s)
 and I'm suddenly getting 3 rows now.
 3. This is not just for CLI output as when submitting CTAS, it
 materializes such corrupted result set:

 CREATE TABLE corrupted AS
 SELECT regexp_replace(wordsmerged,'of',\nof\n) wordsseparate,
 wordsmerged FROM singlerow;

 hive select * from corrupted;

 OK

 world NULL
 of NULL
 hostels worldofhostels

 Time taken: 0.029 seconds, Fetched: 3 row(s)
 Apparently, the same happens - new table is split into multiple rows with
 columns following the one in question (like wordsmerged) become NULLs
 Am i doing something wrong here?

 Regards,
 Maciek



Re: Hive splits/adds rows when outputting dataset with new lines

2014-10-07 Thread Maciek
This …works!
quite surprised as per the steps I outlined, the issue manifested even
without CTAS (regular SELECT)
still don't see how could that be related …or those are two separate issues?

Also, maybe you know - is there any way to make it work for TextFile?
Thank you,
Maciek

On Tue, Oct 7, 2014 at 7:13 AM, Navis류승우 navis@nexr.com wrote:

 Try with set hive.default.fileformat=SequenceFile;

 Thanks,
 Navis

 2014-10-06 20:51 GMT+09:00 Maciek mac...@sonra.io:

 Hello,

 I've encountered a situation when printing new lines corrupts
 (multiplies) the returned dataset.
 This seem to be similar to HIVE-3012
 https://issues.apache.org/jira/browse/HIVE-3012 (fixed on 0.11), but
 as I'm on Hive 0.13 it's still the case.
 Here are the steps to illustrate/reproduce:

 1. Fist let'e create table with one row and one column by selecting from
 any existing table (substitute ANYTABLE respecitvely):

 CREATE TABLE singlerow AS SELECT 'worldofhostels' wordsmerged FROM
 ANYTABLE LIMIT 1;

 and verify:

 SELECT * FROM singlerow;

 OK---
 worldofhostels

 Time taken: 0.028 seconds, Fetched: 1 row(s)

 All good so far.
 2. Now let's introduce newline here by:

 SELECT regexp_replace(wordsmerged,'of',\nof\n) wordsseparate FROM
 singlerow;

 OK--

 world
 of
 hostels

 Time taken: 6.404 seconds, Fetched: 3 row(s)
 and I'm suddenly getting 3 rows now.
 3. This is not just for CLI output as when submitting CTAS, it
 materializes such corrupted result set:

 CREATE TABLE corrupted AS
 SELECT regexp_replace(wordsmerged,'of',\nof\n) wordsseparate,
 wordsmerged FROM singlerow;

 hive select * from corrupted;

 OK

 world NULL
 of NULL
 hostels worldofhostels

 Time taken: 0.029 seconds, Fetched: 3 row(s)
 Apparently, the same happens - new table is split into multiple rows with
 columns following the one in question (like wordsmerged) become NULLs
 Am i doing something wrong here?

 Regards,
 Maciek





-- 
Kind Regards
Maciek Kocon