Susan, I did give that a shot -- I'm seeing a number of oddities:
(1) 'Partition By' appears only accepts alphanumeric lower case fields. It will work for 'machinename', but not 'machineName' or 'machine_name'. (2) When partitioning with maps included in the data I get odd string conversion issues (3) When partitioning without maps I see frequent out of memory issues I'll update this email when I've got a more concrete example of problems. Regards, Bryan Jeffrey On Wed, Oct 28, 2015 at 1:33 PM, Susan Zhang <suchenz...@gmail.com> wrote: > Have you tried partitionBy? > > Something like > > hiveWindowsEvents.foreachRDD( rdd => { > val eventsDataFrame = rdd.toDF() > eventsDataFrame.write.mode(SaveMode.Append).partitionBy(" > windows_event_time_bin").saveAsTable("windows_event") > }) > > > > On Wed, Oct 28, 2015 at 7:41 AM, Bryan Jeffrey <bryan.jeff...@gmail.com> > wrote: > >> Hello. >> >> I am working to get a simple solution working using Spark SQL. I am >> writing streaming data to persistent tables using a HiveContext. Writing >> to a persistent non-partitioned table works well - I update the table using >> Spark streaming, and the output is available via Hive Thrift/JDBC. >> >> I create a table that looks like the following: >> >> 0: jdbc:hive2://localhost:10000> describe windows_event; >> describe windows_event; >> +--------------------------+---------------------+----------+ >> | col_name | data_type | comment | >> +--------------------------+---------------------+----------+ >> | target_entity | string | NULL | >> | target_entity_type | string | NULL | >> | date_time_utc | timestamp | NULL | >> | machine_ip | string | NULL | >> | event_id | string | NULL | >> | event_data | map<string,string> | NULL | >> | description | string | NULL | >> | event_record_id | string | NULL | >> | level | string | NULL | >> | machine_name | string | NULL | >> | sequence_number | string | NULL | >> | source | string | NULL | >> | source_machine_name | string | NULL | >> | task_category | string | NULL | >> | user | string | NULL | >> | additional_data | map<string,string> | NULL | >> | windows_event_time_bin | timestamp | NULL | >> | # Partition Information | | | >> | # col_name | data_type | comment | >> | windows_event_time_bin | timestamp | NULL | >> +--------------------------+---------------------+----------+ >> >> >> However, when I create a partitioned table and write data using the >> following: >> >> hiveWindowsEvents.foreachRDD( rdd => { >> val eventsDataFrame = rdd.toDF() >> >> eventsDataFrame.write.mode(SaveMode.Append).saveAsTable("windows_event") >> }) >> >> The data is written as though the table is not partitioned (so everything >> is written to /user/hive/warehouse/windows_event/file.gz.paquet. Because >> the data is not following the partition schema, it is not accessible (and >> not partitioned). >> >> Is there a straightforward way to write to partitioned tables using Spark >> SQL? I understand that the read performance for partitioned data is far >> better - are there other performance improvements that might be better to >> use instead of partitioning? >> >> Regards, >> >> Bryan Jeffrey >> > >