Re: HiveContext unable to recognize the delimiter of Hive table in textfile partitioned by date

2016-04-11 Thread Shiva Achari
Hi All,

In the above scenario if the field delimiter is default of hive then Spark
is able to parse the data as expected , hence i believe this is a bug.

​Regards,
Shiva Achari​


On Tue, Apr 5, 2016 at 8:15 PM, Shiva Achari <shiva.ach...@gmail.com> wrote:

> Hi,
>
> I have created a hive external table stored as textfile partitioned by
> event_date Date.
>
> How do we have to specify a specific format of csv while reading in spark
> from Hive table ?
>
> The environment is
>
>  1. 1.Spark 1.5.0 - cdh5.5.1 Using Scala version 2.10.4(Java
> HotSpot(TM) 64 - Bit Server VM, Java 1.7.0_67)
>  2. Hive 1.1, CDH 5.5.1
>
> scala script
>
> sqlContext.setConf("hive.exec.dynamic.partition", "true")
> sqlContext.setConf("hive.exec.dynamic.partition.mode",
> "nonstrict")
>
> val distData = sc.parallelize(Array((1, 1, 1), (2, 2, 2), (3, 3,
> 3))).toDF
> val distData_1 = distData.withColumn("event_date", current_date())
> distData_1: org.apache.spark.sql.DataFrame = [_1: int, _2: int,
> _3: int, event_date: date]
>
> scala > distData_1.show
> + ---+---+---+--+
> |_1 |_2 |_3 | event_date |
> | 1 | 1 | 1 | 2016-03-25 |
> | 2 | 2 | 2 | 2016-03-25 |
> | 3 | 3 | 3 | 2016-03-25 |
>
>
> distData_1.write.mode("append").partitionBy("event_date").saveAsTable("part_table")
>
>
> scala > sqlContext.sql("select * from part_table").show
> | a| b| c| event_date |
> |1,1,1 | null | null | 2016-03-25 |
> |2,2,2 | null | null | 2016-03-25 |
> |3,3,3 | null | null | 2016-03-25 |
>
>
>
> Hive table
>
> create external table part_table (a String, b int, c bigint)
> partitioned by (event_date Date)
> row format delimited fields terminated by ','
> stored as textfile  LOCATION "/user/hdfs/hive/part_table";
>
> select * from part_table shows
> |part_table.a | part_table.b | part_table.c |
> part_table.event_date |
> |1 |1 |1
>  |2016-03-25
> |2 |2 |2
>  |2016-03-25
> |3 |3 |3
>  |2016-03-25
>
>
> Looking at the hdfs
>
>
> The path has 2 part files
> /user/hdfs/hive/part_table/event_date=2016-03-25
> part-0
> part-1
>
>   part-0 content
> 1,1,1
>   part-1 content
> 2,2,2
> 3,3,3
>
>
> P.S. if we store the table as orc it writes and reads the data as
> expected.
>
>


HiveContext unable to recognize the delimiter of Hive table in textfile partitioned by date

2016-04-05 Thread Shiva Achari
Hi,

I have created a hive external table stored as textfile partitioned by
event_date Date.

How do we have to specify a specific format of csv while reading in spark
from Hive table ?

The environment is

 1. 1.Spark 1.5.0 - cdh5.5.1 Using Scala version 2.10.4(Java
HotSpot(TM) 64 - Bit Server VM, Java 1.7.0_67)
 2. Hive 1.1, CDH 5.5.1

scala script

sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

val distData = sc.parallelize(Array((1, 1, 1), (2, 2, 2), (3, 3,
3))).toDF
val distData_1 = distData.withColumn("event_date", current_date())
distData_1: org.apache.spark.sql.DataFrame = [_1: int, _2: int, _3:
int, event_date: date]

scala > distData_1.show
+ ---+---+---+--+
|_1 |_2 |_3 | event_date |
| 1 | 1 | 1 | 2016-03-25 |
| 2 | 2 | 2 | 2016-03-25 |
| 3 | 3 | 3 | 2016-03-25 |


distData_1.write.mode("append").partitionBy("event_date").saveAsTable("part_table")


scala > sqlContext.sql("select * from part_table").show
| a| b| c| event_date |
|1,1,1 | null | null | 2016-03-25 |
|2,2,2 | null | null | 2016-03-25 |
|3,3,3 | null | null | 2016-03-25 |



Hive table

create external table part_table (a String, b int, c bigint)
partitioned by (event_date Date)
row format delimited fields terminated by ','
stored as textfile  LOCATION "/user/hdfs/hive/part_table";

select * from part_table shows
|part_table.a | part_table.b | part_table.c | part_table.event_date
|
|1 |1 |1
 |2016-03-25
|2 |2 |2
 |2016-03-25
|3 |3 |3
 |2016-03-25


Looking at the hdfs


The path has 2 part files
/user/hdfs/hive/part_table/event_date=2016-03-25
part-0
part-1

  part-0 content
1,1,1
  part-1 content
2,2,2
3,3,3


P.S. if we store the table as orc it writes and reads the data as expected.