Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-04 Thread Pranav Agrawal
yes, issue is with array type only, I have confirmed that.
I exploded array to struct but still getting the same error,


*Exception in thread "main" org.apache.spark.sql.AnalysisException: Union
can only be performed on tables with the compatible column types.
struct
<>
struct
at the 21th column of the second table;;*

On Mon, Jun 4, 2018 at 2:55 PM, Jorge Machado  wrote:

> Have you tryed to narrow down the problem so that we can be 100% sure that
> it lies on the array types ? Just exclude them for sake of testing.
> If we know 100% that it is on this array stuff try to explode that columns
> into simple types.
>
> Jorge Machado
>
>
>
>
>
>
> On 4 Jun 2018, at 11:09, Pranav Agrawal  wrote:
>
> I am ordering the columns before doing union, so I think that should not
> be an issue,
>
>
>
>
>
>
>
>
>
>
> * String[] columns_original_order = baseDs.columns();
> String[] columns = baseDs.columns();Arrays.sort(columns);
> baseDs=baseDs.selectExpr(columns);
> incDsForPartition=incDsForPartition.selectExpr(columns);if
> (baseDs.count() > 0) {return
> baseDs.union(incDsForPartition).selectExpr(columns_original_order);
> } else {return
> incDsForPartition.selectExpr(columns_original_order);*
>
>
> On Mon, Jun 4, 2018 at 2:31 PM, Jorge Machado  wrote:
>
>> Try the same union with a dataframe without the arrays types. Could be
>> something strange there like ordering or so.
>>
>> Jorge Machado
>>
>>
>>
>>
>>
>> On 4 Jun 2018, at 10:17, Pranav Agrawal  wrote:
>>
>> schema is exactly the same, not sure why it is failing though.
>>
>> root
>>  |-- booking_id: integer (nullable = true)
>>  |-- booking_rooms_room_category_id: integer (nullable = true)
>>  |-- booking_rooms_room_id: integer (nullable = true)
>>  |-- booking_source: integer (nullable = true)
>>  |-- booking_status: integer (nullable = true)
>>  |-- cancellation_reason: integer (nullable = true)
>>  |-- checkin: string (nullable = true)
>>  |-- checkout: string (nullable = true)
>>  |-- city_id: integer (nullable = true)
>>  |-- cluster_id: integer (nullable = true)
>>  |-- company_id: integer (nullable = true)
>>  |-- created_at: string (nullable = true)
>>  |-- discount: integer (nullable = true)
>>  |-- feedback_created_at: string (nullable = true)
>>  |-- feedback_id: integer (nullable = true)
>>  |-- hotel_id: integer (nullable = true)
>>  |-- hub_id: integer (nullable = true)
>>  |-- month: integer (nullable = true)
>>  |-- no_show_reason: integer (nullable = true)
>>  |-- oyo_rooms: integer (nullable = true)
>>  |-- selling_amount: integer (nullable = true)
>>  |-- shifting: array (nullable = true)
>>  ||-- element: struct (containsNull = true)
>>  |||-- id: integer (nullable = true)
>>  |||-- booking_id: integer (nullable = true)
>>  |||-- shifting_status: integer (nullable = true)
>>  |||-- shifting_reason: integer (nullable = true)
>>  |||-- shifting_metadata: integer (nullable = true)
>>  |-- suggest_oyo: integer (nullable = true)
>>  |-- tickets: array (nullable = true)
>>  ||-- element: struct (containsNull = true)
>>  |||-- ticket_source: integer (nullable = true)
>>  |||-- ticket_status: string (nullable = true)
>>  |||-- ticket_instance_source: integer (nullable = true)
>>  |||-- ticket_category: string (nullable = true)
>>  |-- updated_at: timestamp (nullable = true)
>>  |-- year: integer (nullable = true)
>>  |-- zone_id: integer (nullable = true)
>>
>> root
>>  |-- booking_id: integer (nullable = true)
>>  |-- booking_rooms_room_category_id: integer (nullable = true)
>>  |-- booking_rooms_room_id: integer (nullable = true)
>>  |-- booking_source: integer (nullable = true)
>>  |-- booking_status: integer (nullable = true)
>>  |-- cancellation_reason: integer (nullable = true)
>>  |-- checkin: string (nullable = true)
>>  |-- checkout: string (nullable = true)
>>  |-- city_id: integer (nullable = true)
>>  |-- cluster_id: integer (nullable = true)
>>  |-- company_id: integer (nullable = true)
>>  |-- created_at: string (nullable = true)
>>  |-- discount: integer (nullable = true)
>>  |-- feedback_created_at: string (nullable = true)
>>  |-- feedback_id: integer (nullable = true)
>>  |-- hotel_id: integer (nullable = true)
>>  |-- hub_id: integer (nullable = true)
>>  |-- month: integer (nullable = true)
>>  |-- no_show_reason: integer (nullable = true)
>>  |-- oyo_rooms: integer (nullable = true)
>>  |-- selling_amount: integer (nullable = true)
>>  |-- shifting: array (nullable = true)
>>  ||-- element: struct (containsNull = true)
>>  |||-- id: integer (nullable = true)
>>  |||-- booking_id: integer (nullable = true)
>>  |||-- shifting_status: integer (nullable = true)
>>  |||-- shifting_reason: integer (nullable = true)
>>  |||-- shifting_metadata: integer (nullable = true)
>>  |-- suggest_oyo: integer (nullable = true)
>>  |-- tickets: array (nullable = true)
>>  |   

Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-04 Thread Jorge Machado
Have you tryed to narrow down the problem so that we can be 100% sure that it 
lies on the array types ? Just exclude them for sake of testing. 
If we know 100% that it is on this array stuff try to explode that columns into 
simple types.

Jorge Machado






> On 4 Jun 2018, at 11:09, Pranav Agrawal  wrote:
> 
> I am ordering the columns before doing union, so I think that should not be 
> an issue,
> 
> String[] columns_original_order = baseDs.columns();
> String[] columns = baseDs.columns();
> Arrays.sort(columns);
> baseDs=baseDs.selectExpr(columns);
> incDsForPartition=incDsForPartition.selectExpr(columns);
> 
> if (baseDs.count() > 0) {
> return 
> baseDs.union(incDsForPartition).selectExpr(columns_original_order);
> } else {
> return incDsForPartition.selectExpr(columns_original_order);
> 
> 
> On Mon, Jun 4, 2018 at 2:31 PM, Jorge Machado  > wrote:
> Try the same union with a dataframe without the arrays types. Could be 
> something strange there like ordering or so.
> 
> Jorge Machado
> 
> 
> 
> 
> 
>> On 4 Jun 2018, at 10:17, Pranav Agrawal > > wrote:
>> 
>> schema is exactly the same, not sure why it is failing though.
>> 
>> root
>>  |-- booking_id: integer (nullable = true)
>>  |-- booking_rooms_room_category_id: integer (nullable = true)
>>  |-- booking_rooms_room_id: integer (nullable = true)
>>  |-- booking_source: integer (nullable = true)
>>  |-- booking_status: integer (nullable = true)
>>  |-- cancellation_reason: integer (nullable = true)
>>  |-- checkin: string (nullable = true)
>>  |-- checkout: string (nullable = true)
>>  |-- city_id: integer (nullable = true)
>>  |-- cluster_id: integer (nullable = true)
>>  |-- company_id: integer (nullable = true)
>>  |-- created_at: string (nullable = true)
>>  |-- discount: integer (nullable = true)
>>  |-- feedback_created_at: string (nullable = true)
>>  |-- feedback_id: integer (nullable = true)
>>  |-- hotel_id: integer (nullable = true)
>>  |-- hub_id: integer (nullable = true)
>>  |-- month: integer (nullable = true)
>>  |-- no_show_reason: integer (nullable = true)
>>  |-- oyo_rooms: integer (nullable = true)
>>  |-- selling_amount: integer (nullable = true)
>>  |-- shifting: array (nullable = true)
>>  ||-- element: struct (containsNull = true)
>>  |||-- id: integer (nullable = true)
>>  |||-- booking_id: integer (nullable = true)
>>  |||-- shifting_status: integer (nullable = true)
>>  |||-- shifting_reason: integer (nullable = true)
>>  |||-- shifting_metadata: integer (nullable = true)
>>  |-- suggest_oyo: integer (nullable = true)
>>  |-- tickets: array (nullable = true)
>>  ||-- element: struct (containsNull = true)
>>  |||-- ticket_source: integer (nullable = true)
>>  |||-- ticket_status: string (nullable = true)
>>  |||-- ticket_instance_source: integer (nullable = true)
>>  |||-- ticket_category: string (nullable = true)
>>  |-- updated_at: timestamp (nullable = true)
>>  |-- year: integer (nullable = true)
>>  |-- zone_id: integer (nullable = true)
>> 
>> root
>>  |-- booking_id: integer (nullable = true)
>>  |-- booking_rooms_room_category_id: integer (nullable = true)
>>  |-- booking_rooms_room_id: integer (nullable = true)
>>  |-- booking_source: integer (nullable = true)
>>  |-- booking_status: integer (nullable = true)
>>  |-- cancellation_reason: integer (nullable = true)
>>  |-- checkin: string (nullable = true)
>>  |-- checkout: string (nullable = true)
>>  |-- city_id: integer (nullable = true)
>>  |-- cluster_id: integer (nullable = true)
>>  |-- company_id: integer (nullable = true)
>>  |-- created_at: string (nullable = true)
>>  |-- discount: integer (nullable = true)
>>  |-- feedback_created_at: string (nullable = true)
>>  |-- feedback_id: integer (nullable = true)
>>  |-- hotel_id: integer (nullable = true)
>>  |-- hub_id: integer (nullable = true)
>>  |-- month: integer (nullable = true)
>>  |-- no_show_reason: integer (nullable = true)
>>  |-- oyo_rooms: integer (nullable = true)
>>  |-- selling_amount: integer (nullable = true)
>>  |-- shifting: array (nullable = true)
>>  ||-- element: struct (containsNull = true)
>>  |||-- id: integer (nullable = true)
>>  |||-- booking_id: integer (nullable = true)
>>  |||-- shifting_status: integer (nullable = true)
>>  |||-- shifting_reason: integer (nullable = true)
>>  |||-- shifting_metadata: integer (nullable = true)
>>  |-- suggest_oyo: integer (nullable = true)
>>  |-- tickets: array (nullable = true)
>>  ||-- element: struct (containsNull = true)
>>  |||-- ticket_source: integer (nullable = true)
>>  |||-- ticket_status: string (nullable = true)
>>  |||-- ticket_instance_source: integer (nullable = true)
>>  |||-- ticket_category: string (nullable = true)
>>  |-- up

Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-04 Thread Pranav Agrawal
I am ordering the columns before doing union, so I think that should not be
an issue,










* String[] columns_original_order = baseDs.columns();
String[] columns = baseDs.columns();Arrays.sort(columns);
baseDs=baseDs.selectExpr(columns);
incDsForPartition=incDsForPartition.selectExpr(columns);if
(baseDs.count() > 0) {return
baseDs.union(incDsForPartition).selectExpr(columns_original_order);
} else {return
incDsForPartition.selectExpr(columns_original_order);*


On Mon, Jun 4, 2018 at 2:31 PM, Jorge Machado  wrote:

> Try the same union with a dataframe without the arrays types. Could be
> something strange there like ordering or so.
>
> Jorge Machado
>
>
>
>
>
> On 4 Jun 2018, at 10:17, Pranav Agrawal  wrote:
>
> schema is exactly the same, not sure why it is failing though.
>
> root
>  |-- booking_id: integer (nullable = true)
>  |-- booking_rooms_room_category_id: integer (nullable = true)
>  |-- booking_rooms_room_id: integer (nullable = true)
>  |-- booking_source: integer (nullable = true)
>  |-- booking_status: integer (nullable = true)
>  |-- cancellation_reason: integer (nullable = true)
>  |-- checkin: string (nullable = true)
>  |-- checkout: string (nullable = true)
>  |-- city_id: integer (nullable = true)
>  |-- cluster_id: integer (nullable = true)
>  |-- company_id: integer (nullable = true)
>  |-- created_at: string (nullable = true)
>  |-- discount: integer (nullable = true)
>  |-- feedback_created_at: string (nullable = true)
>  |-- feedback_id: integer (nullable = true)
>  |-- hotel_id: integer (nullable = true)
>  |-- hub_id: integer (nullable = true)
>  |-- month: integer (nullable = true)
>  |-- no_show_reason: integer (nullable = true)
>  |-- oyo_rooms: integer (nullable = true)
>  |-- selling_amount: integer (nullable = true)
>  |-- shifting: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- id: integer (nullable = true)
>  |||-- booking_id: integer (nullable = true)
>  |||-- shifting_status: integer (nullable = true)
>  |||-- shifting_reason: integer (nullable = true)
>  |||-- shifting_metadata: integer (nullable = true)
>  |-- suggest_oyo: integer (nullable = true)
>  |-- tickets: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- ticket_source: integer (nullable = true)
>  |||-- ticket_status: string (nullable = true)
>  |||-- ticket_instance_source: integer (nullable = true)
>  |||-- ticket_category: string (nullable = true)
>  |-- updated_at: timestamp (nullable = true)
>  |-- year: integer (nullable = true)
>  |-- zone_id: integer (nullable = true)
>
> root
>  |-- booking_id: integer (nullable = true)
>  |-- booking_rooms_room_category_id: integer (nullable = true)
>  |-- booking_rooms_room_id: integer (nullable = true)
>  |-- booking_source: integer (nullable = true)
>  |-- booking_status: integer (nullable = true)
>  |-- cancellation_reason: integer (nullable = true)
>  |-- checkin: string (nullable = true)
>  |-- checkout: string (nullable = true)
>  |-- city_id: integer (nullable = true)
>  |-- cluster_id: integer (nullable = true)
>  |-- company_id: integer (nullable = true)
>  |-- created_at: string (nullable = true)
>  |-- discount: integer (nullable = true)
>  |-- feedback_created_at: string (nullable = true)
>  |-- feedback_id: integer (nullable = true)
>  |-- hotel_id: integer (nullable = true)
>  |-- hub_id: integer (nullable = true)
>  |-- month: integer (nullable = true)
>  |-- no_show_reason: integer (nullable = true)
>  |-- oyo_rooms: integer (nullable = true)
>  |-- selling_amount: integer (nullable = true)
>  |-- shifting: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- id: integer (nullable = true)
>  |||-- booking_id: integer (nullable = true)
>  |||-- shifting_status: integer (nullable = true)
>  |||-- shifting_reason: integer (nullable = true)
>  |||-- shifting_metadata: integer (nullable = true)
>  |-- suggest_oyo: integer (nullable = true)
>  |-- tickets: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- ticket_source: integer (nullable = true)
>  |||-- ticket_status: string (nullable = true)
>  |||-- ticket_instance_source: integer (nullable = true)
>  |||-- ticket_category: string (nullable = true)
>  |-- updated_at: timestamp (nullable = false)
>  |-- year: integer (nullable = true)
>  |-- zone_id: integer (nullable = true)
>
> On Sun, Jun 3, 2018 at 8:05 PM, Alessandro Solimando <
> alessandro.solima...@gmail.com> wrote:
>
>> Hi Pranav,
>> I don´t have an answer to your issue, but what I generally do in this
>> cases is to first try to simplify it to a point where it is easier to check
>> what´s going on, and then adding back ¨pieces¨ one by one until I spot the
>> error.
>>
>> In your case I can suggest to:
>>
>> 1) project t

Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-04 Thread Jorge Machado
Try the same union with a dataframe without the arrays types. Could be 
something strange there like ordering or so.

Jorge Machado





> On 4 Jun 2018, at 10:17, Pranav Agrawal  wrote:
> 
> schema is exactly the same, not sure why it is failing though.
> 
> root
>  |-- booking_id: integer (nullable = true)
>  |-- booking_rooms_room_category_id: integer (nullable = true)
>  |-- booking_rooms_room_id: integer (nullable = true)
>  |-- booking_source: integer (nullable = true)
>  |-- booking_status: integer (nullable = true)
>  |-- cancellation_reason: integer (nullable = true)
>  |-- checkin: string (nullable = true)
>  |-- checkout: string (nullable = true)
>  |-- city_id: integer (nullable = true)
>  |-- cluster_id: integer (nullable = true)
>  |-- company_id: integer (nullable = true)
>  |-- created_at: string (nullable = true)
>  |-- discount: integer (nullable = true)
>  |-- feedback_created_at: string (nullable = true)
>  |-- feedback_id: integer (nullable = true)
>  |-- hotel_id: integer (nullable = true)
>  |-- hub_id: integer (nullable = true)
>  |-- month: integer (nullable = true)
>  |-- no_show_reason: integer (nullable = true)
>  |-- oyo_rooms: integer (nullable = true)
>  |-- selling_amount: integer (nullable = true)
>  |-- shifting: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- id: integer (nullable = true)
>  |||-- booking_id: integer (nullable = true)
>  |||-- shifting_status: integer (nullable = true)
>  |||-- shifting_reason: integer (nullable = true)
>  |||-- shifting_metadata: integer (nullable = true)
>  |-- suggest_oyo: integer (nullable = true)
>  |-- tickets: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- ticket_source: integer (nullable = true)
>  |||-- ticket_status: string (nullable = true)
>  |||-- ticket_instance_source: integer (nullable = true)
>  |||-- ticket_category: string (nullable = true)
>  |-- updated_at: timestamp (nullable = true)
>  |-- year: integer (nullable = true)
>  |-- zone_id: integer (nullable = true)
> 
> root
>  |-- booking_id: integer (nullable = true)
>  |-- booking_rooms_room_category_id: integer (nullable = true)
>  |-- booking_rooms_room_id: integer (nullable = true)
>  |-- booking_source: integer (nullable = true)
>  |-- booking_status: integer (nullable = true)
>  |-- cancellation_reason: integer (nullable = true)
>  |-- checkin: string (nullable = true)
>  |-- checkout: string (nullable = true)
>  |-- city_id: integer (nullable = true)
>  |-- cluster_id: integer (nullable = true)
>  |-- company_id: integer (nullable = true)
>  |-- created_at: string (nullable = true)
>  |-- discount: integer (nullable = true)
>  |-- feedback_created_at: string (nullable = true)
>  |-- feedback_id: integer (nullable = true)
>  |-- hotel_id: integer (nullable = true)
>  |-- hub_id: integer (nullable = true)
>  |-- month: integer (nullable = true)
>  |-- no_show_reason: integer (nullable = true)
>  |-- oyo_rooms: integer (nullable = true)
>  |-- selling_amount: integer (nullable = true)
>  |-- shifting: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- id: integer (nullable = true)
>  |||-- booking_id: integer (nullable = true)
>  |||-- shifting_status: integer (nullable = true)
>  |||-- shifting_reason: integer (nullable = true)
>  |||-- shifting_metadata: integer (nullable = true)
>  |-- suggest_oyo: integer (nullable = true)
>  |-- tickets: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- ticket_source: integer (nullable = true)
>  |||-- ticket_status: string (nullable = true)
>  |||-- ticket_instance_source: integer (nullable = true)
>  |||-- ticket_category: string (nullable = true)
>  |-- updated_at: timestamp (nullable = false)
>  |-- year: integer (nullable = true)
>  |-- zone_id: integer (nullable = true)
> 
> On Sun, Jun 3, 2018 at 8:05 PM, Alessandro Solimando 
> mailto:alessandro.solima...@gmail.com>> 
> wrote:
> Hi Pranav,
> I don´t have an answer to your issue, but what I generally do in this cases 
> is to first try to simplify it to a point where it is easier to check what´s 
> going on, and then adding back ¨pieces¨ one by one until I spot the error.
> 
> In your case I can suggest to: 
> 
> 1) project the dataset to the problematic column only (column 21 from your 
> log)
> 2) use explode function to have one element of the array per line
> 3) flatten the struct 
> 
> At each step use printSchema() to double check if the types are as you expect 
> them to be, and if they are the same for both datasets.
> 
> Best regards,
> Alessandro 
> 
> On 2 June 2018 at 19:48, Pranav Agrawal  > wrote:
> can't get around this error when performing union of two datasets 
> (ds1.union(ds2)) having complex data type (struct, list),
> 
> 18/06/02 15:12:00 IN

Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-04 Thread Pranav Agrawal
schema is exactly the same, not sure why it is failing though.

root
 |-- booking_id: integer (nullable = true)
 |-- booking_rooms_room_category_id: integer (nullable = true)
 |-- booking_rooms_room_id: integer (nullable = true)
 |-- booking_source: integer (nullable = true)
 |-- booking_status: integer (nullable = true)
 |-- cancellation_reason: integer (nullable = true)
 |-- checkin: string (nullable = true)
 |-- checkout: string (nullable = true)
 |-- city_id: integer (nullable = true)
 |-- cluster_id: integer (nullable = true)
 |-- company_id: integer (nullable = true)
 |-- created_at: string (nullable = true)
 |-- discount: integer (nullable = true)
 |-- feedback_created_at: string (nullable = true)
 |-- feedback_id: integer (nullable = true)
 |-- hotel_id: integer (nullable = true)
 |-- hub_id: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- no_show_reason: integer (nullable = true)
 |-- oyo_rooms: integer (nullable = true)
 |-- selling_amount: integer (nullable = true)
 |-- shifting: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: integer (nullable = true)
 |||-- booking_id: integer (nullable = true)
 |||-- shifting_status: integer (nullable = true)
 |||-- shifting_reason: integer (nullable = true)
 |||-- shifting_metadata: integer (nullable = true)
 |-- suggest_oyo: integer (nullable = true)
 |-- tickets: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- ticket_source: integer (nullable = true)
 |||-- ticket_status: string (nullable = true)
 |||-- ticket_instance_source: integer (nullable = true)
 |||-- ticket_category: string (nullable = true)
 |-- updated_at: timestamp (nullable = true)
 |-- year: integer (nullable = true)
 |-- zone_id: integer (nullable = true)

root
 |-- booking_id: integer (nullable = true)
 |-- booking_rooms_room_category_id: integer (nullable = true)
 |-- booking_rooms_room_id: integer (nullable = true)
 |-- booking_source: integer (nullable = true)
 |-- booking_status: integer (nullable = true)
 |-- cancellation_reason: integer (nullable = true)
 |-- checkin: string (nullable = true)
 |-- checkout: string (nullable = true)
 |-- city_id: integer (nullable = true)
 |-- cluster_id: integer (nullable = true)
 |-- company_id: integer (nullable = true)
 |-- created_at: string (nullable = true)
 |-- discount: integer (nullable = true)
 |-- feedback_created_at: string (nullable = true)
 |-- feedback_id: integer (nullable = true)
 |-- hotel_id: integer (nullable = true)
 |-- hub_id: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- no_show_reason: integer (nullable = true)
 |-- oyo_rooms: integer (nullable = true)
 |-- selling_amount: integer (nullable = true)
 |-- shifting: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: integer (nullable = true)
 |||-- booking_id: integer (nullable = true)
 |||-- shifting_status: integer (nullable = true)
 |||-- shifting_reason: integer (nullable = true)
 |||-- shifting_metadata: integer (nullable = true)
 |-- suggest_oyo: integer (nullable = true)
 |-- tickets: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- ticket_source: integer (nullable = true)
 |||-- ticket_status: string (nullable = true)
 |||-- ticket_instance_source: integer (nullable = true)
 |||-- ticket_category: string (nullable = true)
 |-- updated_at: timestamp (nullable = false)
 |-- year: integer (nullable = true)
 |-- zone_id: integer (nullable = true)

On Sun, Jun 3, 2018 at 8:05 PM, Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Hi Pranav,
> I don´t have an answer to your issue, but what I generally do in this
> cases is to first try to simplify it to a point where it is easier to check
> what´s going on, and then adding back ¨pieces¨ one by one until I spot the
> error.
>
> In your case I can suggest to:
>
> 1) project the dataset to the problematic column only (column 21 from your
> log)
> 2) use explode function to have one element of the array per line
> 3) flatten the struct
>
> At each step use printSchema() to double check if the types are as you
> expect them to be, and if they are the same for both datasets.
>
> Best regards,
> Alessandro
>
> On 2 June 2018 at 19:48, Pranav Agrawal  wrote:
>
>> can't get around this error when performing union of two datasets
>> (ds1.union(ds2)) having complex data type (struct, list),
>>
>>
>> *18/06/02 15:12:00 INFO ApplicationMaster: Final app status: FAILED,
>> exitCode: 15, (reason: User class threw exception:
>> org.apache.spark.sql.AnalysisException: Union can only be performed on
>> tables with the compatible column types.
>> array>
>> <>
>> array>
>> at the 21th column of the second table;;*
>> As far as I can tell, they are the same. What am I doing wrong? Any help
>> / workaround appreciated!
>>
>> 

Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-03 Thread Alessandro Solimando
Hi Pranav,
I don´t have an answer to your issue, but what I generally do in this cases
is to first try to simplify it to a point where it is easier to check
what´s going on, and then adding back ¨pieces¨ one by one until I spot the
error.

In your case I can suggest to:

1) project the dataset to the problematic column only (column 21 from your
log)
2) use explode function to have one element of the array per line
3) flatten the struct

At each step use printSchema() to double check if the types are as you
expect them to be, and if they are the same for both datasets.

Best regards,
Alessandro

On 2 June 2018 at 19:48, Pranav Agrawal  wrote:

> can't get around this error when performing union of two datasets
> (ds1.union(ds2)) having complex data type (struct, list),
>
>
> *18/06/02 15:12:00 INFO ApplicationMaster: Final app status: FAILED,
> exitCode: 15, (reason: User class threw exception:
> org.apache.spark.sql.AnalysisException: Union can only be performed on
> tables with the compatible column types.
> array>
> <>
> array>
> at the 21th column of the second table;;*
> As far as I can tell, they are the same. What am I doing wrong? Any help /
> workaround appreciated!
>
> spark version: 2.2.1
>
> Thanks,
> Pranav
>


Re: SPARK SQL Error

2015-10-15 Thread pnpritchard
Going back to your code, I see that you instantiate the spark context as:
  val sc = new SparkContext(args(0), "Csv loading example")
which will set the master url to "args(0)" and app name to "Csv loading
example". In your case, args(0) is
"hdfs://quickstart.cloudera:8020/people_csv", which obviously is not the
master url, so that is why you are getting the error.

There are two ways to fix this:
1. Add master url to the command line args:
spark-submit --master yarn --class org.spark.apache.CsvDataSource
/home/cloudera/Desktop/TestMain.jar yarn
hdfs://quickstart.cloudera:8020/people_csv

2. Use the no arg SparkContext constructor
I would recommend this since you are using spark-submit, which can set the
master url and app name properties. You would have to change your code as
"val sc = new SparkContext()" use the "--name" option for spark-submit.
Also, you would have to change your code for setting the csv file path using
"arg(0)" (since there is only one command line argument, indexed from 0).
spark-submit --master yarn --name "Csv loading example" --class
org.spark.apache.CsvDataSource /home/cloudera/Desktop/TestMain.jar
hdfs://quickstart.cloudera:8020/people_csv

Lastly, if you look at this documentation:
http://spark.apache.org/docs/latest/submitting-applications.html#master-urls,
"yarn" is not a valid master url. It looks like you need to use
"yarn-client" or "yarn-cluster". Unfortunately, I do not have experience
using yarn, so can't help you there. Here is more documentation for yarn you
can read: http://spark.apache.org/docs/latest/running-on-yarn.html.

-Nick



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-Error-tp25050p25078.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SPARK SQL Error

2015-10-15 Thread Giridhar Maddukuri
Hi Dilip,

I tried this option also spark-submit --master yarn --class
org.spark.apache.CsvDataSource /home/cloudera/Desktop/TestMain.jar  --files
hdfs://quickstart.cloudera:8020/people_csv & getting similar error

Exception in thread "main" org.apache.spark.SparkException: Could not parse
Master URL: '--files'
 at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2244)
 at org.apache.spark.SparkContext.(SparkContext.scala:361)
 at org.apache.spark.SparkContext.(SparkContext.scala:154)
 at org.spark.apache.CsvDataSource$.main(CsvDataSource.scala:10)
 at org.spark.apache.CsvDataSource.main(CsvDataSource.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
 at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Thanks & Regards,
Giri.


On Thu, Oct 15, 2015 at 5:43 PM, Dilip Biswal  wrote:

> Hi Giri,
>
> You are perhaps  missing the "--files" option before the supplied hdfs
> file name ?
>
> spark-submit --master yarn --class org.spark.apache.CsvDataSource
> /home/cloudera/Desktop/TestMain.jar  *--files*
> hdfs://quickstart.cloudera:8020/people_csv
>
> Please refer to Ritchard's comments on why the --files option may be
> redundant in
> your case.
>
> Regards,
> Dilip Biswal
> Tel: 408-463-4980
> dbis...@us.ibm.com
>
>
>
> From:    Giri 
> To:user@spark.apache.org
> Date:10/15/2015 02:44 AM
> Subject:Re: SPARK SQL Error
> --
>
>
>
> Hi Ritchard,
>
> Thank you so much  again for your input.This time I ran the command in the
> below way
> spark-submit --master yarn --class org.spark.apache.CsvDataSource
> /home/cloudera/Desktop/TestMain.jar
> hdfs://quickstart.cloudera:8020/people_csv
> But I am facing the new error "Could not parse Master URL:
> 'hdfs://quickstart.cloudera:8020/people_csv'"
> file path is correct
>
> hadoop fs -ls hdfs://quickstart.cloudera:8020/people_csv
> -rw-r--r--   1 cloudera supergroup 29 2015-10-10 00:02
> hdfs://quickstart.cloudera:8020/people_csv
>
> Can you help me to fix this new error
>
> 15/10/15 02:24:39 INFO spark.SparkContext: Added JAR
> file:/home/cloudera/Desktop/TestMain.jar at
> http://10.0.2.15:40084/jars/TestMain.jarwith timestamp 1444901079484
> Exception in thread "main" org.apache.spark.SparkException: Could not parse
> Master URL: 'hdfs://quickstart.cloudera:8020/people_csv'
> at
>
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2244)
> at
> org.apache.spark.SparkContext.(SparkContext.scala:361)
> at
> org.apache.spark.SparkContext.(SparkContext.scala:154)
> at
> org.spark.apache.CsvDataSource$.main(CsvDataSource.scala:10)
> at org.spark.apache.CsvDataSource.main(CsvDataSource.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
>
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
> at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
> at
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
> Thanks & Regards,
> Giri.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-Error-tp25050p25075.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>


Re: SPARK SQL Error

2015-10-15 Thread Dilip Biswal
Hi Giri,

You are perhaps  missing the "--files" option before the supplied hdfs 
file name ?

spark-submit --master yarn --class org.spark.apache.CsvDataSource
/home/cloudera/Desktop/TestMain.jar  --files 
hdfs://quickstart.cloudera:8020/people_csv

Please refer to Ritchard's comments on why the --files option may be 
redundant in 
your case. 

Regards,
Dilip Biswal
Tel: 408-463-4980
dbis...@us.ibm.com



From:   Giri 
To: user@spark.apache.org
Date:   10/15/2015 02:44 AM
Subject:        Re: SPARK SQL Error



Hi Ritchard,

Thank you so much  again for your input.This time I ran the command in the
below way
spark-submit --master yarn --class org.spark.apache.CsvDataSource
/home/cloudera/Desktop/TestMain.jar 
hdfs://quickstart.cloudera:8020/people_csv
But I am facing the new error "Could not parse Master URL:
'hdfs://quickstart.cloudera:8020/people_csv'"
file path is correct
 
hadoop fs -ls hdfs://quickstart.cloudera:8020/people_csv
-rw-r--r--   1 cloudera supergroup 29 2015-10-10 00:02
hdfs://quickstart.cloudera:8020/people_csv

Can you help me to fix this new error

15/10/15 02:24:39 INFO spark.SparkContext: Added JAR
file:/home/cloudera/Desktop/TestMain.jar at
http://10.0.2.15:40084/jars/TestMain.jar with timestamp 1444901079484
Exception in thread "main" org.apache.spark.SparkException: Could not 
parse
Master URL: 'hdfs://quickstart.cloudera:8020/people_csv'
 at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2244)
 at 
org.apache.spark.SparkContext.(SparkContext.scala:361)
 at 
org.apache.spark.SparkContext.(SparkContext.scala:154)
 at 
org.spark.apache.CsvDataSource$.main(CsvDataSource.scala:10)
 at 
org.spark.apache.CsvDataSource.main(CsvDataSource.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
 at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
 at 
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
 at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
 at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Thanks & Regards,
Giri.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-Error-tp25050p25075.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org






Re: SPARK SQL Error

2015-10-15 Thread Giri
Hi Ritchard,

Thank you so much  again for your input.This time I ran the command in the
below way
spark-submit --master yarn --class org.spark.apache.CsvDataSource
/home/cloudera/Desktop/TestMain.jar  
hdfs://quickstart.cloudera:8020/people_csv
But I am facing the new error "Could not parse Master URL:
'hdfs://quickstart.cloudera:8020/people_csv'"
file path is correct
 
hadoop fs -ls hdfs://quickstart.cloudera:8020/people_csv
-rw-r--r--   1 cloudera supergroup 29 2015-10-10 00:02
hdfs://quickstart.cloudera:8020/people_csv

Can you help me to fix this new error

15/10/15 02:24:39 INFO spark.SparkContext: Added JAR
file:/home/cloudera/Desktop/TestMain.jar at
http://10.0.2.15:40084/jars/TestMain.jar with timestamp 1444901079484
Exception in thread "main" org.apache.spark.SparkException: Could not parse
Master URL: 'hdfs://quickstart.cloudera:8020/people_csv'
at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2244)
at org.apache.spark.SparkContext.(SparkContext.scala:361)
at org.apache.spark.SparkContext.(SparkContext.scala:154)
at org.spark.apache.CsvDataSource$.main(CsvDataSource.scala:10)
at org.spark.apache.CsvDataSource.main(CsvDataSource.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Thanks & Regards,
Giri.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-Error-tp25050p25075.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SPARK SQL Error

2015-10-14 Thread pnpritchard
I think the stack trace is quite informative.

Assuming line 10 of CsvDataSource is "val df =
sqlContext.load("com.databricks.spark.csv", Map("path" ->
args(1),"header"->"true"))", then the "args(1)" call is throwing an
ArrayIndexOutOfBoundsException. The reason for this is because you aren't
passing any command line arguments to your application. When using
spark-submit, you should put all of your app command line arguments at then
end, after the jar. In your example, I think you'd want:

spark-submit --master yarn --class org.spark.apache.CsvDataSource --files
hdfs:///people_csv /home/cloudera/Desktop/TestMain.jar hdfs:///people_csv

Also, I don't think it is necessary for you to have "--files
hdfs:///people_csv". The documentation for this option says "Comma-separated
list of files to be placed in the working directory of each executor." Since
you are going to read the "people_csv" file from hdfs, rather than the local
file system, it seems unnecessary.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-Error-tp25050p25064.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SPARK SQL Error

2015-10-13 Thread pnpritchard
Your app jar should be at the end of the command, without the --jars prefix.
That option is only necessary if you have more than one jar to put on the
classpath (i.e. dependency jars that aren't packaged inside your app jar).

spark-submit --master yarn --class org.spark.apache.CsvDataSource --files
hdfs:///people_csv /home/cloudera/Desktop/TestMain.jar



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-Error-tp25050p25052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL Error

2015-07-30 Thread Akhil Das
It seem an issue with the ES connector
https://github.com/elastic/elasticsearch-hadoop/issues/482

Thanks
Best Regards

On Tue, Jul 28, 2015 at 6:14 AM, An Tran  wrote:

> Hello all,
>
> I am currently having an error with Spark SQL access Elasticsearch using
> Elasticsearch Spark integration.  Below is the series of command I issued
> along with the stacktrace.  I am unclear what the error could mean.  I can
> print the schema correctly but error out if i try and display a few
> results.  Can you guys point me in the right direction?
>
> scala>
> sqlContext.read.format("org.elasticsearch.spark.sql").options(esOptions).load("reddit_comment_public-201507-v3/default").registerTempTable("reddit_comment")
>
>
> scala> reddit_comment_df.printSchema
>
> root
>
>  |-- data: struct (nullable = true)
>
>  ||-- archived: boolean (nullable = true)
>
>  ||-- author: string (nullable = true)
>
>  ||-- author_flair_css_class: string (nullable = true)
>
>  ||-- author_flair_text: string (nullable = true)
>
>  ||-- body: string (nullable = true)
>
>  ||-- body_html: string (nullable = true)
>
>  ||-- controversiality: long (nullable = true)
>
>  ||-- created: long (nullable = true)
>
>  ||-- created_utc: long (nullable = true)
>
>  ||-- distinguished: string (nullable = true)
>
>  ||-- downs: long (nullable = true)
>
>  ||-- edited: long (nullable = true)
>
>  ||-- gilded: long (nullable = true)
>
>  ||-- id: string (nullable = true)
>
>  ||-- link_author: string (nullable = true)
>
>  ||-- link_id: string (nullable = true)
>
>  ||-- link_title: string (nullable = true)
>
>  ||-- link_url: string (nullable = true)
>
>  ||-- name: string (nullable = true)
>
>  ||-- parent_id: string (nullable = true)
>
>  ||-- replies: string (nullable = true)
>
>  ||-- saved: boolean (nullable = true)
>
>  ||-- score: long (nullable = true)
>
>  ||-- score_hidden: boolean (nullable = true)
>
>  ||-- subreddit: string (nullable = true)
>
>  ||-- subreddit_id: string (nullable = true)
>
>  ||-- ups: long (nullable = true)
>
>
>
> scala> reddit_comment_df.show
>
> 15/07/27 20:38:31 INFO ScalaEsRowRDD: Reading from
> [reddit_comment_public-201507-v3/default]
>
> 15/07/27 20:38:31 INFO ScalaEsRowRDD: Discovered mapping
> {reddit_comment_public-201507-v3=[mappings=[default=[acquire_date=DATE,
> elasticsearch_date_partition_index=STRING,
> elasticsearch_language_partition_index=STRING, elasticsearch_type=STRING,
> source=[data=[archived=BOOLEAN, author=STRING,
> author_flair_css_class=STRING, author_flair_text=STRING, body=STRING,
> body_html=STRING, controversiality=LONG, created=LONG, created_utc=LONG,
> distinguished=STRING, downs=LONG, edited=LONG, gilded=LONG, id=STRING,
> link_author=STRING, link_id=STRING, link_title=STRING, link_url=STRING,
> name=STRING, parent_id=STRING, replies=STRING, saved=BOOLEAN, score=LONG,
> score_hidden=BOOLEAN, subreddit=STRING, subreddit_id=STRING, ups=LONG],
> kind=STRING], source_geo_location=GEO_POINT, source_id=STRING,
> source_language=STRING, source_time=DATE]]]} for
> [reddit_comment_public-201507-v3/default]
>
> 15/07/27 20:38:31 INFO SparkContext: Starting job: show at :26
>
> 15/07/27 20:38:31 INFO DAGScheduler: Got job 13 (show at :26)
> with 1 output partitions (allowLocal=false)
>
> 15/07/27 20:38:31 INFO DAGScheduler: Final stage: ResultStage 16(show at
> :26)
>
> 15/07/27 20:38:31 INFO DAGScheduler: Parents of final stage: List()
>
> 15/07/27 20:38:31 INFO DAGScheduler: Missing parents: List()
>
> 15/07/27 20:38:31 INFO DAGScheduler: Submitting ResultStage 16
> (MapPartitionsRDD[65] at show at :26), which has no missing parents
>
> 15/07/27 20:38:31 INFO MemoryStore: ensureFreeSpace(7520) called with
> curMem=71364, maxMem=2778778828
>
> 15/07/27 20:38:31 INFO MemoryStore: Block broadcast_13 stored as values in
> memory (estimated size 7.3 KB, free 2.6 GB)
>
> 15/07/27 20:38:31 INFO MemoryStore: ensureFreeSpace(3804) called with
> curMem=78884, maxMem=2778778828
>
> 15/07/27 20:38:31 INFO MemoryStore: Block broadcast_13_piece0 stored as
> bytes in memory (estimated size 3.7 KB, free 2.6 GB)
>
> 15/07/27 20:38:31 INFO BlockManagerInfo: Added broadcast_13_piece0 in
> memory on 172.25.185.239:58296 (size: 3.7 KB, free: 2.6 GB)
>
> 15/07/27 20:38:31 INFO SparkContext: Created broadcast 13 from broadcast
> at DAGScheduler.scala:874
>
> 15/07/27 20:38:31 INFO DAGScheduler: Submitting 1 missing tasks from
> ResultStage 16 (MapPartitionsRDD[65] at show at :26)
>
> 15/07/27 20:38:31 INFO TaskSchedulerImpl: Adding task set 16.0 with 1 tasks
>
> 15/07/27 20:38:31 INFO FairSchedulableBuilder: Added task set TaskSet_16
> tasks to pool default
>
> 15/07/27 20:38:31 INFO TaskSetManager: Starting task 0.0 in stage 16.0
> (TID 172, 172.25.185.164, ANY, 5085 bytes)
>
> 15/07/27 20:38:31 INFO BlockManagerInfo: Added broadcast_13_piece0 in
> memory on 172.25.185.164:50275 (size: 3.7 KB, free: 3.6 GB)
>
> 15/

RE: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-19 Thread Chandra Mohan, Ananda Vel Murugan
Hi,

Thanks for the response. I was looking for a java solution. I will check the 
scala and python ones.

Regards,
Anand.C

From: Todd Nist [mailto:tsind...@gmail.com]
Sent: Tuesday, May 19, 2015 6:17 PM
To: Chandra Mohan, Ananda Vel Murugan
Cc: ayan guha; user
Subject: Re: Spark sql error while writing Parquet file- Trying to write more 
fields than contained in row

I believe your looking for  df.na.fill in scala, in pySpark Module it is fillna 
(http://spark.apache.org/docs/latest/api/python/pyspark.sql.html)

from the docs:

df4.fillna({'age': 50, 'name': 'unknown'}).show()

age height name

10  80 Alice

5   null   Bob

50  null   Tom

50  null   unknown

On Mon, May 18, 2015 at 11:01 PM, Chandra Mohan, Ananda Vel Murugan 
mailto:ananda.muru...@honeywell.com>> wrote:
Hi,

Thanks for the response. But I could not see fillna function in DataFrame class.

[cid:image001.png@01D092DA.4DF87A00]


Is it available in some specific version of Spark sql. This is what I have in 
my pom.xml


  org.apache.spark
  spark-sql_2.10
  1.3.1
   

Regards,
Anand.C

From: ayan guha [mailto:guha.a...@gmail.com<mailto:guha.a...@gmail.com>]
Sent: Monday, May 18, 2015 5:19 PM
To: Chandra Mohan, Ananda Vel Murugan; user
Subject: Re: Spark sql error while writing Parquet file- Trying to write more 
fields than contained in row

Hi

Give a try with dtaFrame.fillna function to fill up missing column

Best
Ayan

On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan 
mailto:ananda.muru...@honeywell.com>> wrote:
Hi,

I am using spark-sql to read a CSV file and write it as parquet file. I am 
building the schema using the following code.

String schemaString = "a b c";
   List fields = new ArrayList();
   MetadataBuilder mb = new MetadataBuilder();
   mb.putBoolean("nullable", true);
   Metadata m = mb.build();
   for (String fieldName: schemaString.split(" ")) {
fields.add(new StructField(fieldName,DataTypes.DoubleType,true, 
m));
   }
   StructType schema = DataTypes.createStructType(fields);

Some of the rows in my input csv does not contain three columns. After building 
my JavaRDD, I create data frame as shown below using the RDD and schema.

DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema);

Finally I try to save it as Parquet file

darDataFrame.saveAsParquetFile("/home/anand/output.parquet”)

I get this error when saving it as Parquet file

java.lang.IndexOutOfBoundsException: Trying to write more fields than contained 
in row (3 > 2)

I understand the reason behind this error. Some of my rows in Row RDD does not 
contain three elements as some rows in my input csv does not contain three 
columns. But while building the schema, I am specifying every field as 
nullable. So I believe, it should not throw this error. Can anyone help me fix 
this error. Thank you.

Regards,
Anand.C





--
Best Regards,
Ayan Guha



Re: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-19 Thread Todd Nist
I believe your looking for  df.na.fill in scala, in pySpark Module it is
fillna (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html)

from the docs:

df4.fillna({'age': 50, 'name': 'unknown'}).show()age height name10  80
Alice5   null   Bob50  null   Tom50  null   unknown


On Mon, May 18, 2015 at 11:01 PM, Chandra Mohan, Ananda Vel Murugan <
ananda.muru...@honeywell.com> wrote:

>  Hi,
>
>
>
> Thanks for the response. But I could not see fillna function in DataFrame
> class.
>
>
>
>
>
>
>
> Is it available in some specific version of Spark sql. This is what I have
> in my pom.xml
>
>
>
> 
>
>   org.apache.spark
>
>   spark-sql_2.10
>
>   1.3.1
>
>
>
>
>
> Regards,
>
> Anand.C
>
>
>
> *From:* ayan guha [mailto:guha.a...@gmail.com]
> *Sent:* Monday, May 18, 2015 5:19 PM
> *To:* Chandra Mohan, Ananda Vel Murugan; user
> *Subject:* Re: Spark sql error while writing Parquet file- Trying to
> write more fields than contained in row
>
>
>
> Hi
>
>
>
> Give a try with dtaFrame.fillna function to fill up missing column
>
>
>
> Best
>
> Ayan
>
>
>
> On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan <
> ananda.muru...@honeywell.com> wrote:
>
> Hi,
>
>
>
> I am using spark-sql to read a CSV file and write it as parquet file. I am
> building the schema using the following code.
>
>
>
> String schemaString = "a b c";
>
>List fields = *new* ArrayList();
>
>MetadataBuilder mb = *new* MetadataBuilder();
>
>mb.putBoolean("nullable", *true*);
>
>Metadata m = mb.build();
>
>*for* (String fieldName: schemaString.split(" ")) {
>
> fields.add(*new* StructField(fieldName,DataTypes.
> *DoubleType*,*true*, m));
>
>}
>
>StructType schema = DataTypes.*createStructType*(fields);
>
>
>
> Some of the rows in my input csv does not contain three columns. After
> building my JavaRDD, I create data frame as shown below using the
> RDD and schema.
>
>
>
> DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema);
>
>
>
> Finally I try to save it as Parquet file
>
>
>
> darDataFrame.saveAsParquetFile("/home/anand/output.parquet”)
>
>
>
> I get this error when saving it as Parquet file
>
>
>
> java.lang.IndexOutOfBoundsException: Trying to write more fields than
> contained in row (3 > 2)
>
>
>
> I understand the reason behind this error. Some of my rows in Row RDD does
> not contain three elements as some rows in my input csv does not contain
> three columns. But while building the schema, I am specifying every field
> as nullable. So I believe, it should not throw this error. Can anyone help
> me fix this error. Thank you.
>
>
>
> Regards,
>
> Anand.C
>
>
>
>
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>


RE: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-18 Thread Chandra Mohan, Ananda Vel Murugan
Hi,

Thanks for the response. But I could not see fillna function in DataFrame class.

[cid:image001.png@01D0920E.32B14460]


Is it available in some specific version of Spark sql. This is what I have in 
my pom.xml


  org.apache.spark
  spark-sql_2.10
  1.3.1
   

Regards,
Anand.C

From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Monday, May 18, 2015 5:19 PM
To: Chandra Mohan, Ananda Vel Murugan; user
Subject: Re: Spark sql error while writing Parquet file- Trying to write more 
fields than contained in row

Hi

Give a try with dtaFrame.fillna function to fill up missing column

Best
Ayan

On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan 
mailto:ananda.muru...@honeywell.com>> wrote:
Hi,

I am using spark-sql to read a CSV file and write it as parquet file. I am 
building the schema using the following code.

String schemaString = "a b c";
   List fields = new ArrayList();
   MetadataBuilder mb = new MetadataBuilder();
   mb.putBoolean("nullable", true);
   Metadata m = mb.build();
   for (String fieldName: schemaString.split(" ")) {
fields.add(new StructField(fieldName,DataTypes.DoubleType,true, 
m));
   }
   StructType schema = DataTypes.createStructType(fields);

Some of the rows in my input csv does not contain three columns. After building 
my JavaRDD, I create data frame as shown below using the RDD and schema.

DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema);

Finally I try to save it as Parquet file

darDataFrame.saveAsParquetFile("/home/anand/output.parquet”)

I get this error when saving it as Parquet file

java.lang.IndexOutOfBoundsException: Trying to write more fields than contained 
in row (3 > 2)

I understand the reason behind this error. Some of my rows in Row RDD does not 
contain three elements as some rows in my input csv does not contain three 
columns. But while building the schema, I am specifying every field as 
nullable. So I believe, it should not throw this error. Can anyone help me fix 
this error. Thank you.

Regards,
Anand.C





--
Best Regards,
Ayan Guha


Re: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-18 Thread ayan guha
Hi

Give a try with dtaFrame.fillna function to fill up missing column

Best
Ayan

On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan <
ananda.muru...@honeywell.com> wrote:

>  Hi,
>
>
>
> I am using spark-sql to read a CSV file and write it as parquet file. I am
> building the schema using the following code.
>
>
>
> String schemaString = "a b c";
>
>List fields = *new* ArrayList();
>
>MetadataBuilder mb = *new* MetadataBuilder();
>
>mb.putBoolean("nullable", *true*);
>
>Metadata m = mb.build();
>
>*for* (String fieldName: schemaString.split(" ")) {
>
> fields.add(*new* StructField(fieldName,DataTypes.
> *DoubleType*,*true*, m));
>
>}
>
>StructType schema = DataTypes.*createStructType*(fields);
>
>
>
> Some of the rows in my input csv does not contain three columns. After
> building my JavaRDD, I create data frame as shown below using the
> RDD and schema.
>
>
>
> DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema);
>
>
>
> Finally I try to save it as Parquet file
>
>
>
> darDataFrame.saveAsParquetFile("/home/anand/output.parquet”)
>
>
>
> I get this error when saving it as Parquet file
>
>
>
> java.lang.IndexOutOfBoundsException: Trying to write more fields than
> contained in row (3 > 2)
>
>
>
> I understand the reason behind this error. Some of my rows in Row RDD does
> not contain three elements as some rows in my input csv does not contain
> three columns. But while building the schema, I am specifying every field
> as nullable. So I believe, it should not throw this error. Can anyone help
> me fix this error. Thank you.
>
>
>
> Regards,
>
> Anand.C
>
>
>
>
>



-- 
Best Regards,
Ayan Guha


Re: spark sql error with proto/parquet

2015-04-20 Thread Michael Armbrust
You are probably using an encoding that we don't support.  I think this PR
may be adding that support: https://github.com/apache/spark/pull/5422

On Sat, Apr 18, 2015 at 5:40 PM, Abhishek R. Singh <
abhis...@tetrationanalytics.com> wrote:

> I have created a bunch of protobuf based parquet files that I want to
> read/inspect using Spark SQL. However, I am running into exceptions and not
> able to proceed much further:
>
> This succeeds successfully (probably because there is no action yet). I
> can also printSchema() and count() without any issues:
>
> scala> val df = sqlContext.load(“my_root_dir/201504101000",
> "parquet")
>
> scala> df.select(df("summary")).first
>
> 15/04/18 17:03:03 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 5.0 (TID 27, xxx.yyy.com): parquet.io.ParquetDecodingException: Can not
> read value at 0 in block -1 in file
> hdfs://xxx.yyy.com:8020/my_root_dir/201504101000/0.parquet
> at
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
> at
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
> at
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: parquet.io.ParquetDecodingException: The requested schema is
> not compatible with the file schema. incompatible types: optional group …
>
>
> I could convert my protos into json and then back to parquet, but that
> seems wasteful !
>
> Also, I will be happy to contribute and make protobuf work with Spark SQL
> if I can get some guidance/help/pointers. Help appreciated.
>
> -Abhishek-
>