Mich, Yes, it is already partitioned.
In Hive, I can do this: INSERT OVERWRITE amo_bi_events PARTITION (dt) SELECT event_type, timestamp, …, concat(substring(timestamp, 1, 10), ' ', substring(timestamp, 12, 2), ':00:00') AS dt FROM amo_raw_events WHERE to_date(timestamp_iso) BETWEEN ‘2016-06-01’ AND ‘2016-06-02’; I tried the same equivalent using DataFrames but couldn’t get it to work. Thanks, Ben > On Jun 3, 2016, at 11:06 AM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > OK fine but dt is the column used for partitioning the table. > > This is what I get in Hive itself > > use test; > set hive.exec.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nonstrict; > drop table if exists amo_bi_events; > CREATE EXTERNAL TABLE `amo_bi_events`( > `event_type` string COMMENT '', > `timestamp` string COMMENT '', > `event_valid` int COMMENT '', > `event_subtype` string COMMENT '', > `user_ip` string COMMENT '', > `user_id` string COMMENT '', > `cookie_status` string COMMENT '', > `profile_status` string COMMENT '', > `user_status` string COMMENT '', > `previous_timestamp` string COMMENT '', > `user_agent` string COMMENT '', > `referer` string COMMENT '', > `uri` string COMMENT '', > `request_elapsed` bigint COMMENT '', > `browser_languages` string COMMENT '', > `acamp_id` int COMMENT '', > `creative_id` int COMMENT '', > `location_id` int COMMENT '', > `pcamp_id` int COMMENT '', > `pdomain_id` int COMMENT '', > `country` string COMMENT '', > `region` string COMMENT '', > `dma` int COMMENT '', > `city` string COMMENT '', > `zip` string COMMENT '', > `isp` string COMMENT '', > `line_speed` string COMMENT '', > `gender` string COMMENT '', > `year_of_birth` int COMMENT '', > `behaviors_read` string COMMENT '', > `behaviors_written` string COMMENT '', > `key_value_pairs` string COMMENT '', > `acamp_candidates` int COMMENT '', > `tag_format` string COMMENT '', > `optimizer_name` string COMMENT '', > `optimizer_version` string COMMENT '', > `optimizer_ip` string COMMENT '', > `pixel_id` int COMMENT '', > `video_id` string COMMENT '', > `video_network_id` int COMMENT '', > `video_time_watched` bigint COMMENT '', > `video_percentage_watched` int COMMENT '', > `conversion_valid_sale` int COMMENT '', > `conversion_sale_amount` float COMMENT '', > `conversion_commission_amount` float COMMENT '', > `conversion_step` int COMMENT '', > `conversion_currency` string COMMENT '', > `conversion_attribution` int COMMENT '', > `conversion_offer_id` string COMMENT '', > `custom_info` string COMMENT '', > `frequency` int COMMENT '', > `recency_seconds` int COMMENT '', > `cost` float COMMENT '', > `revenue` float COMMENT '', > `optimizer_acamp_id` int COMMENT '', > `optimizer_creative_id` int COMMENT '', > `optimizer_ecpm` float COMMENT '', > `event_id` string COMMENT '', > `impression_id` string COMMENT '', > `diagnostic_data` string COMMENT '', > `user_profile_mapping_source` string COMMENT '', > `latitude` float COMMENT '', > `longitude` float COMMENT '', > `area_code` int COMMENT '', > `gmt_offset` string COMMENT '', > `in_dst` string COMMENT '', > `proxy_type` string COMMENT '', > `mobile_carrier` string COMMENT '', > `pop` string COMMENT '', > `hostname` string COMMENT '', > `profile_ttl` string COMMENT '', > `timestamp_iso` string COMMENT '', > `reference_id` string COMMENT '', > `identity_organization` string COMMENT '', > `identity_method` string COMMENT '', > `mappable_id` string COMMENT '', > `profile_expires` string COMMENT '', > `video_player_iframed` int COMMENT '', > `video_player_in_view` int COMMENT '', > `video_player_width` int COMMENT '', > `video_player_height` int COMMENT '', > `host_domain` string COMMENT '', > `browser_type` string COMMENT '', > `browser_device_cat` string COMMENT '', > `browser_family` string COMMENT '', > `browser_name` string COMMENT '', > `browser_version` string COMMENT '', > `browser_major_version` string COMMENT '', > `browser_minor_version` string COMMENT '', > `os_family` string COMMENT '', > `os_name` string COMMENT '', > `os_version` string COMMENT '', > `os_major_version` string COMMENT '', > `os_minor_version` string COMMENT '') > PARTITIONED BY (`dt` timestamp) > STORED AS PARQUET; > desc formatted amo_bi_events; > > The output in hive is as follows: > > # col_name data_type comment > event_type string > timestamp string > event_valid int > event_subtype string > user_ip string > user_id string > cookie_status string > profile_status string > user_status string > previous_timestamp string > user_agent string > referer string > uri string > request_elapsed bigint > browser_languages string > acamp_id int > creative_id int > location_id int > pcamp_id int > pdomain_id int > country string > region string > dma int > city string > zip string > isp string > line_speed string > gender string > year_of_birth int > behaviors_read string > behaviors_written string > key_value_pairs string > acamp_candidates int > tag_format string > optimizer_name string > optimizer_version string > optimizer_ip string > pixel_id int > video_id string > video_network_id int > video_time_watched bigint > video_percentage_watched int > conversion_valid_sale int > conversion_sale_amount float > conversion_commission_amount float > conversion_step int > conversion_currency string > conversion_attribution int > conversion_offer_id string > custom_info string > frequency int > recency_seconds int > cost float > revenue float > optimizer_acamp_id int > optimizer_creative_id int > optimizer_ecpm float > event_id string > impression_id string > diagnostic_data string > user_profile_mapping_source string > latitude float > longitude float > area_code int > gmt_offset string > in_dst string > proxy_type string > mobile_carrier string > pop string > hostname string > profile_ttl string > timestamp_iso string > reference_id string > identity_organization string > identity_method string > mappable_id string > profile_expires string > video_player_iframed int > video_player_in_view int > video_player_width int > video_player_height int > host_domain string > browser_type string > browser_device_cat string > browser_family string > browser_name string > browser_version string > browser_major_version string > browser_minor_version string > os_family string > os_name string > os_version string > os_major_version string > os_minor_version string > # Partition Information > # col_name data_type comment > dt timestamp > # Detailed Table Information > Database: test > Owner: hduser > CreateTime: Fri Jun 03 19:03:20 BST 2016 > LastAccessTime: UNKNOWN > Retention: 0 > Location: > hdfs://rhes564:9000/user/hive/warehouse/test.db/amo_bi_events > Table Type: EXTERNAL_TABLE > Table Parameters: > EXTERNAL TRUE > transient_lastDdlTime 1464977000 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format 1 > Time taken: 0.397 seconds, Fetched: 124 row(s) > > So effectively that table is partitioned by dt in notime > > Now what I don't understand whether that table is already partitioned as you > said the table already exists! > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 3 June 2016 at 18:48, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Mich, > > I am using .withColumn to add another column “dt” that is a reformatted > version of an existing column “timestamp”. The partitioned by column is “dt”. > > We are using Spark 1.6.0 in CDH 5.7.0. > > Thanks, > Ben > >> On Jun 3, 2016, at 10:33 AM, Mich Talebzadeh <mich.talebza...@gmail.com >> <mailto:mich.talebza...@gmail.com>> wrote: >> >> what version of spark are you using >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> >> On 3 June 2016 at 17:51, Mich Talebzadeh <mich.talebza...@gmail.com >> <mailto:mich.talebza...@gmail.com>> wrote: >> ok what is the new column is called? you are basically adding a new column >> to an already existing table >> >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> >> On 3 June 2016 at 17:04, Benjamin Kim <bbuil...@gmail.com >> <mailto:bbuil...@gmail.com>> wrote: >> The table already exists. >> >> CREATE EXTERNAL TABLE `amo_bi_events`( >> `event_type` string COMMENT '', >> >> `timestamp` string COMMENT '', >> >> `event_valid` int COMMENT '', >> >> `event_subtype` string COMMENT '', >> >> `user_ip` string COMMENT '', >> >> `user_id` string COMMENT '', >> >> `cookie_status` string COMMENT '', >> >> `profile_status` string COMMENT '', >> >> `user_status` string COMMENT '', >> >> `previous_timestamp` string COMMENT '', >> >> `user_agent` string COMMENT '', >> >> `referer` string COMMENT '', >> >> `uri` string COMMENT '', >> >> `request_elapsed` bigint COMMENT '', >> >> `browser_languages` string COMMENT '', >> >> `acamp_id` int COMMENT '', >> >> `creative_id` int COMMENT '', >> >> `location_id` int COMMENT '', >> >> `pcamp_id` int COMMENT '', >> >> `pdomain_id` int COMMENT '', >> >> `country` string COMMENT '', >> >> `region` string COMMENT '', >> >> `dma` int COMMENT '', >> >> `city` string COMMENT '', >> >> `zip` string COMMENT '', >> >> `isp` string COMMENT '', >> >> `line_speed` string COMMENT '', >> >> `gender` string COMMENT '', >> >> `year_of_birth` int COMMENT '', >> >> `behaviors_read` string COMMENT '', >> >> `behaviors_written` string COMMENT '', >> >> `key_value_pairs` string COMMENT '', >> >> `acamp_candidates` int COMMENT '', >> >> `tag_format` string COMMENT '', >> >> `optimizer_name` string COMMENT '', >> >> `optimizer_version` string COMMENT '', >> >> `optimizer_ip` string COMMENT '', >> >> `pixel_id` int COMMENT '', >> >> `video_id` string COMMENT '', >> >> `video_network_id` int COMMENT '', >> >> `video_time_watched` bigint COMMENT '', >> >> `video_percentage_watched` int COMMENT '', >> >> `conversion_valid_sale` int COMMENT '', >> >> `conversion_sale_amount` float COMMENT '', >> >> `conversion_commission_amount` float COMMENT '', >> >> `conversion_step` int COMMENT '', >> >> `conversion_currency` string COMMENT '', >> >> `conversion_attribution` int COMMENT '', >> >> `conversion_offer_id` string COMMENT '', >> >> `custom_info` string COMMENT '', >> >> `frequency` int COMMENT '', >> >> `recency_seconds` int COMMENT '', >> >> `cost` float COMMENT '', >> >> `revenue` float COMMENT '', >> >> `optimizer_acamp_id` int COMMENT '', >> >> `optimizer_creative_id` int COMMENT '', >> >> `optimizer_ecpm` float COMMENT '', >> >> `event_id` string COMMENT '', >> >> `impression_id` string COMMENT '', >> >> `diagnostic_data` string COMMENT '', >> >> `user_profile_mapping_source` string COMMENT '', >> >> `latitude` float COMMENT '', >> >> `longitude` float COMMENT '', >> >> `area_code` int COMMENT '', >> >> `gmt_offset` string COMMENT '', >> >> `in_dst` string COMMENT '', >> >> `proxy_type` string COMMENT '', >> >> `mobile_carrier` string COMMENT '', >> >> `pop` string COMMENT '', >> >> `hostname` string COMMENT '', >> >> `profile_ttl` string COMMENT '', >> >> `timestamp_iso` string COMMENT '', >> >> `reference_id` string COMMENT '', >> >> `identity_organization` string COMMENT '', >> >> `identity_method` string COMMENT '', >> >> `mappable_id` string COMMENT '', >> >> `profile_expires` string COMMENT '', >> >> `video_player_iframed` int COMMENT '', >> >> `video_player_in_view` int COMMENT '', >> >> `video_player_width` int COMMENT '', >> >> `video_player_height` int COMMENT '', >> >> `host_domain` string COMMENT '', >> >> `browser_type` string COMMENT '', >> >> `browser_device_cat` string COMMENT '', >> >> `browser_family` string COMMENT '', >> >> `browser_name` string COMMENT '', >> >> `browser_version` string COMMENT '', >> >> `browser_major_version` string COMMENT '', >> >> `browser_minor_version` string COMMENT '', >> >> `os_family` string COMMENT '', >> >> `os_name` string COMMENT '', >> >> `os_version` string COMMENT '', >> >> `os_major_version` string COMMENT '', >> >> `os_minor_version` string COMMENT '') >> >> PARTITIONED BY (`dt` timestamp) >> >> STORED AS PARQUET; >> >> Thanks, >> Ben >> >> >>> On Jun 3, 2016, at 8:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com >>> <mailto:mich.talebza...@gmail.com>> wrote: >>> >>> hang on are you saving this as a new table? >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>> >>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >>> >>> >>> On 3 June 2016 at 14:13, Benjamin Kim <bbuil...@gmail.com >>> <mailto:bbuil...@gmail.com>> wrote: >>> Does anyone know how to save data in a DataFrame to a table partitioned >>> using an existing column reformatted into a derived column? >>> >>> val partitionedDf = df.withColumn("dt", >>> concat(substring($"timestamp", 1, 10), lit(" "), substring($"timestamp", >>> 12, 2), lit(":00"))) >>> >>> sqlContext.setConf("hive.exec.dynamic.partition", "true") >>> sqlContext.setConf("hive.exec.dynamic.partition.mode", >>> "nonstrict") >>> partitionedDf.write >>> .mode(SaveMode.Append) >>> .partitionBy("dt") >>> .saveAsTable("ds.amo_bi_events") >>> >>> I am getting an ArrayOutOfBounds error. There are 83 columns in the >>> destination table. But after adding the derived column, then I get an 84 >>> error. I assumed that the column used for the partition would not be >>> counted. >>> >>> Can someone please help. >>> >>> Thanks, >>> Ben >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> <mailto:user-unsubscr...@spark.apache.org> >>> For additional commands, e-mail: user-h...@spark.apache.org >>> <mailto:user-h...@spark.apache.org> >>> >>> >> >> >> > >