Re: need workaround around HIVE-11625 / DISTRO-800

2018-08-08 Thread Pranav Agrawal
any help please

On Tue, Aug 7, 2018 at 1:49 PM, Pranav Agrawal 
wrote:

> I am hitting issue,
> https://issues.cloudera.org/browse/DISTRO-800 (related to
> https://issues.apache.org/jira/browse/HIVE-11625)
>
> I am unable to write empty array of types int or string (array of size 0)
> into parquet, please assist or suggest workaround for the same.
>
> spark version: 2.2.1
> AWS EMR: 5.12, 5.13
>


need workaround around HIVE-11625 / DISTRO-800

2018-08-07 Thread Pranav Agrawal
I am hitting issue,
https://issues.cloudera.org/browse/DISTRO-800 (related to
https://issues.apache.org/jira/browse/HIVE-11625)

I am unable to write empty array of types int or string (array of size 0)
into parquet, please assist or suggest workaround for the same.

spark version: 2.2.1
AWS EMR: 5.12, 5.13


Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-04 Thread Pranav Agrawal
yes, issue is with array type only, I have confirmed that.
I exploded array to struct but still getting the same error,


*Exception in thread "main" org.apache.spark.sql.AnalysisException: Union
can only be performed on tables with the compatible column types.
struct
<>
struct
at the 21th column of the second table;;*

On Mon, Jun 4, 2018 at 2:55 PM, Jorge Machado  wrote:

> Have you tryed to narrow down the problem so that we can be 100% sure that
> it lies on the array types ? Just exclude them for sake of testing.
> If we know 100% that it is on this array stuff try to explode that columns
> into simple types.
>
> Jorge Machado
>
>
>
>
>
>
> On 4 Jun 2018, at 11:09, Pranav Agrawal  wrote:
>
> I am ordering the columns before doing union, so I think that should not
> be an issue,
>
>
>
>
>
>
>
>
>
>
> * String[] columns_original_order = baseDs.columns();
> String[] columns = baseDs.columns();Arrays.sort(columns);
> baseDs=baseDs.selectExpr(columns);
> incDsForPartition=incDsForPartition.selectExpr(columns);if
> (baseDs.count() > 0) {return
> baseDs.union(incDsForPartition).selectExpr(columns_original_order);
> } else {return
> incDsForPartition.selectExpr(columns_original_order);*
>
>
> On Mon, Jun 4, 2018 at 2:31 PM, Jorge Machado  wrote:
>
>> Try the same union with a dataframe without the arrays types. Could be
>> something strange there like ordering or so.
>>
>> Jorge Machado
>>
>>
>>
>>
>>
>> On 4 Jun 2018, at 10:17, Pranav Agrawal  wrote:
>>
>> schema is exactly the same, not sure why it is failing though.
>>
>> root
>>  |-- booking_id: integer (nullable = true)
>>  |-- booking_rooms_room_category_id: integer (nullable = true)
>>  |-- booking_rooms_room_id: integer (nullable = true)
>>  |-- booking_source: integer (nullable = true)
>>  |-- booking_status: integer (nullable = true)
>>  |-- cancellation_reason: integer (nullable = true)
>>  |-- checkin: string (nullable = true)
>>  |-- checkout: string (nullable = true)
>>  |-- city_id: integer (nullable = true)
>>  |-- cluster_id: integer (nullable = true)
>>  |-- company_id: integer (nullable = true)
>>  |-- created_at: string (nullable = true)
>>  |-- discount: integer (nullable = true)
>>  |-- feedback_created_at: string (nullable = true)
>>  |-- feedback_id: integer (nullable = true)
>>  |-- hotel_id: integer (nullable = true)
>>  |-- hub_id: integer (nullable = true)
>>  |-- month: integer (nullable = true)
>>  |-- no_show_reason: integer (nullable = true)
>>  |-- oyo_rooms: integer (nullable = true)
>>  |-- selling_amount: integer (nullable = true)
>>  |-- shifting: array (nullable = true)
>>  ||-- element: struct (containsNull = true)
>>  |||-- id: integer (nullable = true)
>>  |||-- booking_id: integer (nullable = true)
>>  |||-- shifting_status: integer (nullable = true)
>>  |||-- shifting_reason: integer (nullable = true)
>>  |||-- shifting_metadata: integer (nullable = true)
>>  |-- suggest_oyo: integer (nullable = true)
>>  |-- tickets: array (nullable = true)
>>  ||-- element: struct (containsNull = true)
>>  |||-- ticket_source: integer (nullable = true)
>>  |||-- ticket_status: string (nullable = true)
>>  |||-- ticket_instance_source: integer (nullable = true)
>>  |||-- ticket_category: string (nullable = true)
>>  |-- updated_at: timestamp (nullable = true)
>>  |-- year: integer (nullable = true)
>>  |-- zone_id: integer (nullable = true)
>>
>> root
>>  |-- booking_id: integer (nullable = true)
>>  |-- booking_rooms_room_category_id: integer (nullable = true)
>>  |-- booking_rooms_room_id: integer (nullable = true)
>>  |-- booking_source: integer (nullable = true)
>>  |-- booking_status: integer (nullable = true)
>>  |-- cancellation_reason: integer (nullable = true)
>>  |-- checkin: string (nullable = true)
>>  |-- checkout: string (nullable = true)
>>  |-- city_id: integer (nullable = true)
>>  |-- cluster_id: integer (nullable = true)
>>  |-- company_id: integer (nullable = true)
>>  |-- created_at: string (nullable = true)
>>  |-- discount: integer (nullable = true)
>>  |-- feedback_created_at: string (nullable = true)
>>  |-- feedback_id: integer (nullable = true)
>>  |-- hotel_id: integer (nullable = true)
>>  |-- hub_id: integer (nullable = true)
>>  |-- month: integer (nullable = true)
>>  |-- no_show_reason: integer (nullable 

Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-04 Thread Pranav Agrawal
I am ordering the columns before doing union, so I think that should not be
an issue,










* String[] columns_original_order = baseDs.columns();
String[] columns = baseDs.columns();Arrays.sort(columns);
baseDs=baseDs.selectExpr(columns);
incDsForPartition=incDsForPartition.selectExpr(columns);if
(baseDs.count() > 0) {return
baseDs.union(incDsForPartition).selectExpr(columns_original_order);
} else {return
incDsForPartition.selectExpr(columns_original_order);*


On Mon, Jun 4, 2018 at 2:31 PM, Jorge Machado  wrote:

> Try the same union with a dataframe without the arrays types. Could be
> something strange there like ordering or so.
>
> Jorge Machado
>
>
>
>
>
> On 4 Jun 2018, at 10:17, Pranav Agrawal  wrote:
>
> schema is exactly the same, not sure why it is failing though.
>
> root
>  |-- booking_id: integer (nullable = true)
>  |-- booking_rooms_room_category_id: integer (nullable = true)
>  |-- booking_rooms_room_id: integer (nullable = true)
>  |-- booking_source: integer (nullable = true)
>  |-- booking_status: integer (nullable = true)
>  |-- cancellation_reason: integer (nullable = true)
>  |-- checkin: string (nullable = true)
>  |-- checkout: string (nullable = true)
>  |-- city_id: integer (nullable = true)
>  |-- cluster_id: integer (nullable = true)
>  |-- company_id: integer (nullable = true)
>  |-- created_at: string (nullable = true)
>  |-- discount: integer (nullable = true)
>  |-- feedback_created_at: string (nullable = true)
>  |-- feedback_id: integer (nullable = true)
>  |-- hotel_id: integer (nullable = true)
>  |-- hub_id: integer (nullable = true)
>  |-- month: integer (nullable = true)
>  |-- no_show_reason: integer (nullable = true)
>  |-- oyo_rooms: integer (nullable = true)
>  |-- selling_amount: integer (nullable = true)
>  |-- shifting: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- id: integer (nullable = true)
>  |||-- booking_id: integer (nullable = true)
>  |||-- shifting_status: integer (nullable = true)
>  |||-- shifting_reason: integer (nullable = true)
>  |||-- shifting_metadata: integer (nullable = true)
>  |-- suggest_oyo: integer (nullable = true)
>  |-- tickets: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- ticket_source: integer (nullable = true)
>  |||-- ticket_status: string (nullable = true)
>  |||-- ticket_instance_source: integer (nullable = true)
>  |||-- ticket_category: string (nullable = true)
>  |-- updated_at: timestamp (nullable = true)
>  |-- year: integer (nullable = true)
>  |-- zone_id: integer (nullable = true)
>
> root
>  |-- booking_id: integer (nullable = true)
>  |-- booking_rooms_room_category_id: integer (nullable = true)
>  |-- booking_rooms_room_id: integer (nullable = true)
>  |-- booking_source: integer (nullable = true)
>  |-- booking_status: integer (nullable = true)
>  |-- cancellation_reason: integer (nullable = true)
>  |-- checkin: string (nullable = true)
>  |-- checkout: string (nullable = true)
>  |-- city_id: integer (nullable = true)
>  |-- cluster_id: integer (nullable = true)
>  |-- company_id: integer (nullable = true)
>  |-- created_at: string (nullable = true)
>  |-- discount: integer (nullable = true)
>  |-- feedback_created_at: string (nullable = true)
>  |-- feedback_id: integer (nullable = true)
>  |-- hotel_id: integer (nullable = true)
>  |-- hub_id: integer (nullable = true)
>  |-- month: integer (nullable = true)
>  |-- no_show_reason: integer (nullable = true)
>  |-- oyo_rooms: integer (nullable = true)
>  |-- selling_amount: integer (nullable = true)
>  |-- shifting: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- id: integer (nullable = true)
>  |||-- booking_id: integer (nullable = true)
>  |||-- shifting_status: integer (nullable = true)
>  |||-- shifting_reason: integer (nullable = true)
>  |||-- shifting_metadata: integer (nullable = true)
>  |-- suggest_oyo: integer (nullable = true)
>  |-- tickets: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- ticket_source: integer (nullable = true)
>  |||-- ticket_status: string (nullable = true)
>  |||-- ticket_instance_source: integer (nullable = true)
>  |||-- ticket_category: string (nullable = true)
>  |-- updated_at: timestamp (nullable = false)
>  |-- year: integer (nullable = true)
>  |-- zone_id: integer (nullable = true)
>
> On Sun, Jun 3, 2018 at 8:05 PM, Alessandro Solimando <
> alessandro.solima...@gmail.com> wrote:
>
>> H

Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-04 Thread Pranav Agrawal
schema is exactly the same, not sure why it is failing though.

root
 |-- booking_id: integer (nullable = true)
 |-- booking_rooms_room_category_id: integer (nullable = true)
 |-- booking_rooms_room_id: integer (nullable = true)
 |-- booking_source: integer (nullable = true)
 |-- booking_status: integer (nullable = true)
 |-- cancellation_reason: integer (nullable = true)
 |-- checkin: string (nullable = true)
 |-- checkout: string (nullable = true)
 |-- city_id: integer (nullable = true)
 |-- cluster_id: integer (nullable = true)
 |-- company_id: integer (nullable = true)
 |-- created_at: string (nullable = true)
 |-- discount: integer (nullable = true)
 |-- feedback_created_at: string (nullable = true)
 |-- feedback_id: integer (nullable = true)
 |-- hotel_id: integer (nullable = true)
 |-- hub_id: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- no_show_reason: integer (nullable = true)
 |-- oyo_rooms: integer (nullable = true)
 |-- selling_amount: integer (nullable = true)
 |-- shifting: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: integer (nullable = true)
 |||-- booking_id: integer (nullable = true)
 |||-- shifting_status: integer (nullable = true)
 |||-- shifting_reason: integer (nullable = true)
 |||-- shifting_metadata: integer (nullable = true)
 |-- suggest_oyo: integer (nullable = true)
 |-- tickets: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- ticket_source: integer (nullable = true)
 |||-- ticket_status: string (nullable = true)
 |||-- ticket_instance_source: integer (nullable = true)
 |||-- ticket_category: string (nullable = true)
 |-- updated_at: timestamp (nullable = true)
 |-- year: integer (nullable = true)
 |-- zone_id: integer (nullable = true)

root
 |-- booking_id: integer (nullable = true)
 |-- booking_rooms_room_category_id: integer (nullable = true)
 |-- booking_rooms_room_id: integer (nullable = true)
 |-- booking_source: integer (nullable = true)
 |-- booking_status: integer (nullable = true)
 |-- cancellation_reason: integer (nullable = true)
 |-- checkin: string (nullable = true)
 |-- checkout: string (nullable = true)
 |-- city_id: integer (nullable = true)
 |-- cluster_id: integer (nullable = true)
 |-- company_id: integer (nullable = true)
 |-- created_at: string (nullable = true)
 |-- discount: integer (nullable = true)
 |-- feedback_created_at: string (nullable = true)
 |-- feedback_id: integer (nullable = true)
 |-- hotel_id: integer (nullable = true)
 |-- hub_id: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- no_show_reason: integer (nullable = true)
 |-- oyo_rooms: integer (nullable = true)
 |-- selling_amount: integer (nullable = true)
 |-- shifting: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: integer (nullable = true)
 |||-- booking_id: integer (nullable = true)
 |||-- shifting_status: integer (nullable = true)
 |||-- shifting_reason: integer (nullable = true)
 |||-- shifting_metadata: integer (nullable = true)
 |-- suggest_oyo: integer (nullable = true)
 |-- tickets: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- ticket_source: integer (nullable = true)
 |||-- ticket_status: string (nullable = true)
 |||-- ticket_instance_source: integer (nullable = true)
 |||-- ticket_category: string (nullable = true)
 |-- updated_at: timestamp (nullable = false)
 |-- year: integer (nullable = true)
 |-- zone_id: integer (nullable = true)

On Sun, Jun 3, 2018 at 8:05 PM, Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Hi Pranav,
> I don´t have an answer to your issue, but what I generally do in this
> cases is to first try to simplify it to a point where it is easier to check
> what´s going on, and then adding back ¨pieces¨ one by one until I spot the
> error.
>
> In your case I can suggest to:
>
> 1) project the dataset to the problematic column only (column 21 from your
> log)
> 2) use explode function to have one element of the array per line
> 3) flatten the struct
>
> At each step use printSchema() to double check if the types are as you
> expect them to be, and if they are the same for both datasets.
>
> Best regards,
> Alessandro
>
> On 2 June 2018 at 19:48, Pranav Agrawal  wrote:
>
>> can't get around this error when performing union of two datasets
>> (ds1.union(ds2)) having complex data type (struct, list),
>>
>>
>> *18/06/02 15:12:00 INFO ApplicationMaster: Final app status: FAILED,
>> exitCode: 15, (reason: User class threw exception:
>> org.apache.spark.sql.AnalysisException: Union can only be performed on
>> tables with the compatible column types.
>> array>
>> <>
>> array>
&

[Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-02 Thread Pranav Agrawal
can't get around this error when performing union of two datasets
(ds1.union(ds2)) having complex data type (struct, list),


*18/06/02 15:12:00 INFO ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception:
org.apache.spark.sql.AnalysisException: Union can only be performed on
tables with the compatible column types.
array>
<>
array>
at the 21th column of the second table;;*
As far as I can tell, they are the same. What am I doing wrong? Any help /
workaround appreciated!

spark version: 2.2.1

Thanks,
Pranav


[Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-02 Thread Pranav Agrawal
can't get around this error when performing union of two datasets having
complex data type (struct, list),


*18/06/02 15:12:00 INFO ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception:
org.apache.spark.sql.AnalysisException: Union can only be performed on
tables with the compatible column types.
array>
<>
array>
at the 21th column of the second table;;*
As far as I can tell, they are the same. What am I doing wrong? Any help /
workaround appreciated!

spark version: 2.2.1

Thanks,
Pranav