pyspark(sparksql-v 2.4) cannot read hive table which is created

2020-03-16 Thread dominic kim
I use related spark config value but not works like below(success in spark
2.1.1) :
spark.hive.mapred.supports.subdirectories=true
spark.hive.supports.subdirectories=true
spark.mapred.input.dir.recursive=true
spark.hive.mapred.supports.subdirectories=true

And when I query, I also use related hive config but not works like below:
mapred.input.dir.recursive=true
hive.mapred.supports.subdirectories=true

I already know if load the path like
'/user/test/warehouse/somedb.db/dt=20200312/*/' as Dataframein pyspark, it
works. But for complex business logic, I should use spark.sql().

Please give me advise.
Thanks !

* Code
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext, SparkSession
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("Sub-Directory Test") \
.enableHiveSupport() \
.getOrCreate()
spark.sql("select * from somedb.table where dt = '20200301' limit
10").show()

* Hive table directory path
/user/test/warehouse/somedb.db/dt=20200312/1/00_0
/user/test/warehouse/somedb.db/dt=20200312/1/00_1
.
.
/user/test/warehouse/somedb.db/dt=20200312/2/00_0
/user/test/warehouse/somedb.db/dt=20200312/3/00_0



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Dongjoon Hyun
Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value
silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta 
wrote:

> Hi,
>
> 100% agree with Reynold.
>
>
> Regards,
> Gourav Sengupta
>
> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin  wrote:
>
>> Are we sure "not padding" is "incorrect"?
>>
>> I don't know whether ANSI SQL actually requires padding, but plenty of
>> databases don't actually pad.
>>
>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>> 
>>  :
>> "Snowflake currently deviates from common CHAR semantics in that strings
>> shorter than the maximum length are not space-padded at the end."
>>
>> MySQL:
>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun 
>> wrote:
>>
>>> Hi, Reynold.
>>>
>>> Please see the following for the context.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-31136
>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax"
>>>
>>> I raised the above issue according to the new rubric, and the banning
>>> was the proposed alternative to reduce the potential issue.
>>>
>>> Please give us your opinion since it's still PR.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin  wrote:
>>>
 I don’t understand this change. Wouldn’t this “ban” confuse the hell
 out of both new and old users?

 For old users, their old code that was working for char(3) would now
 stop working.

 For new users, depending on whether the underlying metastore char(3) is
 either supported but different from ansi Sql (which is not that big of a
 deal if we explain it) or not supported.

 On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> Apache Spark has been suffered from a known consistency issue on
> `CHAR` type behavior among its usages and configurations. However, the
> evolution direction has been gradually moving forward to be consistent
> inside Apache Spark because we don't have `CHAR` offically. The following
> is the summary.
>
> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different
> result.
> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
> Hive behavior.)
>
> spark-sql> CREATE TABLE t1(a CHAR(3));
> spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
> spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>
> spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
> spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
> spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>
> spark-sql> SELECT a, length(a) FROM t1;
> a   3
> spark-sql> SELECT a, length(a) FROM t2;
> a   3
> spark-sql> SELECT a, length(a) FROM t3;
> a 2
>
> Since 2.4.0, `STORED AS ORC` became consistent.
> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to
> Hive behavior.)
>
> spark-sql> SELECT a, length(a) FROM t1;
> a   3
> spark-sql> SELECT a, length(a) FROM t2;
> a 2
> spark-sql> SELECT a, length(a) FROM t3;
> a 2
>
> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause)
> became consistent.
> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
> fallback to Hive behavior.)
>
> spark-sql> SELECT a, length(a) FROM t1;
> a 2
> spark-sql> SELECT a, length(a) FROM t2;
> a 2
> spark-sql> SELECT a, length(a) FROM t3;
> a 2
>
> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in
> the following syntax to be safe.
>
> CREATE TABLE t(a CHAR(3));
> https://github.com/apache/spark/pull/27902
>
> This email is sent out to inform you based on the new policy we voted.
> The recommendation is always using Apache Spark's native type `String`.
>
> Bests,
> Dongjoon.
>
> References:
> 1. "CHAR implementation?", 2017/09/15
>
> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE
> TABLE syntax", 2019/12/06
>
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>

>>


Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
I haven't spent enough time thinking about it to give a strong opinion, but 
this is of course very different from TRIM.

TRIM is a publicly documented function with two arguments, and we silently 
swapped the two arguments. And trim is also quite commonly used from a long 
time ago.

CHAR is an undocumented data type without clearly defined semantics. It's not 
great that we are changing the value here, but the value is already fucked up. 
It depends on the underlying data source, and random configs that are seemingly 
unrelated (orc) would impact the value.

On Mon, Mar 16, 2020 at 4:01 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Hi, Reynold.
> (And +Michael Armbrust)
> 
> 
> If you think so, do you think it's okay that we change the return value
> silently? Then, I'm wondering why we reverted `TRIM` functions then?
> 
> 
> > Are we sure "not padding" is "incorrect"?
> 
> 
> 
> Bests,
> Dongjoon.
> 
> 
> 
> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
> com ( gourav.sengu...@gmail.com ) > wrote:
> 
> 
>> Hi,
>> 
>> 
>> 100% agree with Reynold.
>> 
>> 
>> 
>> 
>> Regards,
>> Gourav Sengupta
>> 
>> 
>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>> 
>>> Are we sure "not padding" is "incorrect"?
>>> 
>>> 
>>> 
>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>> databases don't actually pad.
>>> 
>>> 
>>> 
>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ data-types-text. 
>>> html
>>> (
>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.&text=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.
>>> ) : "Snowflake currently deviates from common CHAR semantics in that
>>> strings shorter than the maximum length are not space-padded at the end."
>>> 
>>> 
>>> 
>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ 
>>> why-char-dont-have-padding-in-mysql
>>> (
>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>> )
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com ) > wrote:
>>> 
 Hi, Reynold.
 
 
 Please see the following for the context.
 
 
 https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
 https://issues.apache.org/jira/browse/SPARK-31136 )
 "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
 syntax"
 
 
 I raised the above issue according to the new rubric, and the banning was
 the proposed alternative to reduce the potential issue.
 
 
 Please give us your opinion since it's still PR.
 
 
 Bests,
 Dongjoon.
 
 On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
 r...@databricks.com ) > wrote:
 
 
> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
> of both new and old users?
> 
> 
> For old users, their old code that was working for char(3) would now stop
> working. 
> 
> 
> For new users, depending on whether the underlying metastore char(3) is
> either supported but different from ansi Sql (which is not that big of a
> deal if we explain it) or not supported. 
> 
> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
> ( dongjoon.h...@gmail.com ) > wrote:
> 
> 
>> Hi, All.
>> 
>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>> type behavior among its usages and configurations. However, the evolution
>> direction has been gradually moving forward to be consistent inside 
>> Apache
>> Spark because we don't have `CHAR` offically. The following is the
>> summary.
>> 
>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>> Hive behavior.)
>> 
>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>> 
>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>> 
>>     spark-sql> SELECT a, length(a) FROM t1;
>>     a   3
>>     spark-sql> SELECT a, length(a) FROM t2;
>>     a   3
>>     spark-sql> SELECT a, length(a) FROM t3;
>>     a 2
>> 
>> Since 2.4.0, `STORED AS ORC` became consistent.
>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>> behavior.)
>> 
>>     spark-sql> SELECT a, length(a) FROM t1;
>>     a   3
>>     spark-

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Stephen Coy
Hi there,

I’m kind of new around here, but I have had experience with all of all the so 
called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL Server as 
well as Postgresql.

They all support the notion of “ANSI padding” for CHAR columns - which means 
that such columns are always space padded, and they default to having this 
enabled (for ANSI compliance).

MySQL also supports it, but it defaults to leaving it disabled for historical 
reasons not unlike what we have here.

In my opinion we should push toward standards compliance where possible and 
then document where it cannot work.

If users don’t like the padding on CHAR columns then they should change to 
VARCHAR - I believe that was its purpose in the first place, and it does not 
dictate any sort of “padding".

I can see why you might “ban” the use of CHAR columns where they cannot be 
consistently supported, but VARCHAR is a different animal and I would expect it 
to work consistently everywhere.


Cheers,

Steve C

On 17 Mar 2020, at 10:01 am, Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:

Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value 
silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>> wrote:
Hi,

100% agree with Reynold.


Regards,
Gourav Sengupta

On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin 
mailto:r...@databricks.com>> wrote:

Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of 
databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
 : "Snowflake currently deviates from common CHAR semantics in that strings 
shorter than the maximum length are not space-padded at the end."

MySQL: 
https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql








On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, Reynold.

Please see the following for the context.

https://issues.apache.org/jira/browse/SPARK-31136
"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax"

I raised the above issue according to the new rubric, and the banning was the 
proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin 
mailto:r...@databricks.com>> wrote:
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of 
both new and old users?

For old users, their old code that was working for char(3) would now stop 
working.

For new users, depending on whether the underlying metastore char(3) is either 
supported but different from ansi Sql (which is not that big of a deal if we 
explain it) or not supported.

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type 
behavior among its usages and configurations. However, the evolution direction 
has been gradually moving forward to be consistent inside Apache Spark because 
we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive 
behavior.)

spark-sql> CREATE TABLE t1(a CHAR(3));
spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
spark-sql> INSERT INTO TABLE 

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Dongjoon Hyun
Thank you, Stephen and Reynold.

To Reynold.

The way I see the following is a little different.

  > CHAR is an undocumented data type without clearly defined semantics.

Let me describe in Apache Spark User's View point.

Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
Apache Spark 1.x without much documentation. In addition, there still
exists an effort which is trying to keep it in 3.0.0 age.

   https://issues.apache.org/jira/browse/SPARK-31088
   Add back HiveContext and createExternalTable

Historically, we tried to make many SQL-based customer migrate their
workloads from Apache Hive into Apache Spark through `HiveContext`.

Although Apache Spark didn't have a good document about the inconsistent
behavior among its data sources, Apache Hive has been providing its
documentation and many customers rely the behavior.

  -
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

At that time, frequently in on-prem Hadoop clusters by well-known vendors,
many existing huge tables were created by Apache Hive, not Apache Spark.
And, Apache Spark is used for boosting SQL performance with its *caching*.
This was true because Apache Spark was added into the Hadoop-vendor
products later than Apache Hive.

Until the turning point at Apache Spark 2.0, we tried to catch up more
features to be consistent at least with Hive tables in Apache Hive and
Apache Spark because two SQL engines share the same tables.

For the following, technically, while Apache Hive doesn't changed its
existing behavior in this part, Apache Spark evolves inevitably by moving
away from the original Apache Spark old behaviors one-by-one.

  >  the value is already fucked up

The following is the change log.

  - When we switched the default value of `convertMetastoreParquet`.
(at Apache Spark 1.2)
  - When we switched the default value of `convertMetastoreOrc` (at
Apache Spark 2.4)
  - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
`PARQUET` table at Apache Spark 3.0)

To sum up, this has been a well-known issue in the community and among the
customers.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy  wrote:

> Hi there,
>
> I’m kind of new around here, but I have had experience with all of all the
> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
> Server as well as Postgresql.
>
> They all support the notion of “ANSI padding” for CHAR columns - which
> means that such columns are always space padded, and they default to having
> this enabled (for ANSI compliance).
>
> MySQL also supports it, but it defaults to leaving it disabled for
> historical reasons not unlike what we have here.
>
> In my opinion we should push toward standards compliance where possible
> and then document where it cannot work.
>
> If users don’t like the padding on CHAR columns then they should change to
> VARCHAR - I believe that was its purpose in the first place, and it does
> not dictate any sort of “padding".
>
> I can see why you might “ban” the use of CHAR columns where they cannot be
> consistently supported, but VARCHAR is a different animal and I would
> expect it to work consistently everywhere.
>
>
> Cheers,
>
> Steve C
>
> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun 
> wrote:
>
> Hi, Reynold.
> (And +Michael Armbrust)
>
> If you think so, do you think it's okay that we change the return value
> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>
> > Are we sure "not padding" is "incorrect"?
>
> Bests,
> Dongjoon.
>
>
> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> 100% agree with Reynold.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin  wrote:
>>
>>> Are we sure "not padding" is "incorrect"?
>>>
>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>> databases don't actually pad.
>>>
>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>>> 
>>>  :
>>> "Snowflake currently deviates from common CHAR semantics in that strings
>>> shorter than the maximum length are not space-padded at the end."
>>>
>>> MySQL:
>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>> 

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
I looked up our usage logs (sorry I can't share this publicly) and trim has at 
least four orders of magnitude higher usage than char.

On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Thank you, Stephen and Reynold.
> 
> 
> To Reynold.
> 
> 
> The way I see the following is a little different.
> 
> 
>       > CHAR is an undocumented data type without clearly defined
> semantics.
> 
> Let me describe in Apache Spark User's View point.
> 
> 
> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
> Apache Spark 1.x without much documentation. In addition, there still
> exists an effort which is trying to keep it in 3.0.0 age.
> 
>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
> https://issues.apache.org/jira/browse/SPARK-31088 )
>        Add back HiveContext and createExternalTable
> 
> Historically, we tried to make many SQL-based customer migrate their
> workloads from Apache Hive into Apache Spark through `HiveContext`.
> 
> Although Apache Spark didn't have a good document about the inconsistent
> behavior among its data sources, Apache Hive has been providing its
> documentation and many customers rely the behavior.
> 
>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ 
> LanguageManual+Types
> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
> 
> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
> many existing huge tables were created by Apache Hive, not Apache Spark.
> And, Apache Spark is used for boosting SQL performance with its *caching*.
> This was true because Apache Spark was added into the Hadoop-vendor
> products later than Apache Hive.
> 
> 
> Until the turning point at Apache Spark 2.0, we tried to catch up more
> features to be consistent at least with Hive tables in Apache Hive and
> Apache Spark because two SQL engines share the same tables.
> 
> For the following, technically, while Apache Hive doesn't changed its
> existing behavior in this part, Apache Spark evolves inevitably by moving
> away from the original Apache Spark old behaviors one-by-one.
> 
> 
>       >  the value is already fucked up
> 
> 
> The following is the change log.
> 
>       - When we switched the default value of `convertMetastoreParquet`.
> (at Apache Spark 1.2)
>       - When we switched the default value of `convertMetastoreOrc` (at
> Apache Spark 2.4)
>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
> `PARQUET` table at Apache Spark 3.0)
> 
> To sum up, this has been a well-known issue in the community and among the
> customers.
> 
> Bests,
> Dongjoon.
> 
> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
> s...@infomedia.com.au ) > wrote:
> 
> 
>> Hi there,
>> 
>> 
>> I’m kind of new around here, but I have had experience with all of all the
>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>> Server as well as Postgresql.
>> 
>> 
>> They all support the notion of “ANSI padding” for CHAR columns - which
>> means that such columns are always space padded, and they default to
>> having this enabled (for ANSI compliance).
>> 
>> 
>> MySQL also supports it, but it defaults to leaving it disabled for
>> historical reasons not unlike what we have here.
>> 
>> 
>> In my opinion we should push toward standards compliance where possible
>> and then document where it cannot work.
>> 
>> 
>> If users don’t like the padding on CHAR columns then they should change to
>> VARCHAR - I believe that was its purpose in the first place, and it does
>> not dictate any sort of “padding".
>> 
>> 
>> I can see why you might “ban” the use of CHAR columns where they cannot be
>> consistently supported, but VARCHAR is a different animal and I would
>> expect it to work consistently everywhere.
>> 
>> 
>> 
>> 
>> Cheers,
>> 
>> 
>> Steve C
>> 
>> 
>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>> dongjoon.h...@gmail.com ) > wrote:
>>> 
>>> Hi, Reynold.
>>> (And +Michael Armbrust)
>>> 
>>> 
>>> If you think so, do you think it's okay that we change the return value
>>> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>> 
>>> 
>>> > Are we sure "not padding" is "incorrect"?
>>> 
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> 
>>> 
>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
>>> com ( gourav.sengu...@gmail.com ) > wrote:
>>> 
>>> 
 Hi,
 
 
 100% agree with Reynold.
 
 
 
 
 Regards,
 Gourav Sengupta
 
 
 On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
 r...@databricks.com ) > wrote:
 
 
> 
> Are we sure "not padding" is "incorrect"?
> 
> 
> 
> I don't know whether ANSI SQL actually requires padding, but plenty of
> databases don't actually pad.
> 
> 
> 
> https:/ / docs. snowflake. net/ manuals/ sql-refer

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
BTW I'm not opposing us sticking to SQL standard (I'm in general for it). I was 
merely pointing out that if we deviate away from SQL standard in any way we are 
considered "wrong" or "incorrect". That argument itself is flawed when plenty 
of other popular database systems also deviate away from the standard on this 
specific behavior.

On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < r...@databricks.com > wrote:

> 
> I looked up our usage logs (sorry I can't share this publicly) and trim
> has at least four orders of magnitude higher usage than char.
> 
> 
> 
> 
> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
> ( dongjoon.h...@gmail.com ) > wrote:
> 
>> Thank you, Stephen and Reynold.
>> 
>> 
>> To Reynold.
>> 
>> 
>> The way I see the following is a little different.
>> 
>> 
>>       > CHAR is an undocumented data type without clearly defined
>> semantics.
>> 
>> Let me describe in Apache Spark User's View point.
>> 
>> 
>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
>> Apache Spark 1.x without much documentation. In addition, there still
>> exists an effort which is trying to keep it in 3.0.0 age.
>> 
>>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
>> https://issues.apache.org/jira/browse/SPARK-31088 )
>>        Add back HiveContext and createExternalTable
>> 
>> Historically, we tried to make many SQL-based customer migrate their
>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>> 
>> Although Apache Spark didn't have a good document about the inconsistent
>> behavior among its data sources, Apache Hive has been providing its
>> documentation and many customers rely the behavior.
>> 
>>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ 
>> LanguageManual+Types
>> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
>> 
>> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
>> many existing huge tables were created by Apache Hive, not Apache Spark.
>> And, Apache Spark is used for boosting SQL performance with its *caching*.
>> This was true because Apache Spark was added into the Hadoop-vendor
>> products later than Apache Hive.
>> 
>> 
>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>> features to be consistent at least with Hive tables in Apache Hive and
>> Apache Spark because two SQL engines share the same tables.
>> 
>> For the following, technically, while Apache Hive doesn't changed its
>> existing behavior in this part, Apache Spark evolves inevitably by moving
>> away from the original Apache Spark old behaviors one-by-one.
>> 
>> 
>>       >  the value is already fucked up
>> 
>> 
>> The following is the change log.
>> 
>>       - When we switched the default value of `convertMetastoreParquet`.
>> (at Apache Spark 1.2)
>>       - When we switched the default value of `convertMetastoreOrc` (at
>> Apache Spark 2.4)
>>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
>> `PARQUET` table at Apache Spark 3.0)
>> 
>> To sum up, this has been a well-known issue in the community and among the
>> customers.
>> 
>> Bests,
>> Dongjoon.
>> 
>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
>> s...@infomedia.com.au ) > wrote:
>> 
>> 
>>> Hi there,
>>> 
>>> 
>>> I’m kind of new around here, but I have had experience with all of all the
>>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>>> Server as well as Postgresql.
>>> 
>>> 
>>> They all support the notion of “ANSI padding” for CHAR columns - which
>>> means that such columns are always space padded, and they default to
>>> having this enabled (for ANSI compliance).
>>> 
>>> 
>>> MySQL also supports it, but it defaults to leaving it disabled for
>>> historical reasons not unlike what we have here.
>>> 
>>> 
>>> In my opinion we should push toward standards compliance where possible
>>> and then document where it cannot work.
>>> 
>>> 
>>> If users don’t like the padding on CHAR columns then they should change to
>>> VARCHAR - I believe that was its purpose in the first place, and it does
>>> not dictate any sort of “padding".
>>> 
>>> 
>>> I can see why you might “ban” the use of CHAR columns where they cannot be
>>> consistently supported, but VARCHAR is a different animal and I would
>>> expect it to work consistently everywhere.
>>> 
>>> 
>>> 
>>> 
>>> Cheers,
>>> 
>>> 
>>> Steve C
>>> 
>>> 
 On 17 Mar 2020, at 10:01 am, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
 dongjoon.h...@gmail.com ) > wrote:
 
 Hi, Reynold.
 (And +Michael Armbrust)
 
 
 If you think so, do you think it's okay that we change the return value
 silently? Then, I'm wondering why we reverted `TRIM` functions then?
 
 
 > Are we sure "not padding" is "incorrect"?
 
 
 
 Bests,
 Dongjoon.
 
 
 
 On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengu

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Dongjoon Hyun
Ur, are you comparing the number of SELECT statement with TRIM and CREATE
statements with `CHAR`?

> I looked up our usage logs (sorry I can't share this publicly) and trim
has at least four orders of magnitude higher usage than char.

We need to discuss more about what to do. This thread is what I expected
exactly. :)

> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
I was merely pointing out that if we deviate away from SQL standard in any
way we are considered "wrong" or "incorrect". That argument itself is
flawed when plenty of other popular database systems also deviate away from
the standard on this specific behavior.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin  wrote:

> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
> I was merely pointing out that if we deviate away from SQL standard in any
> way we are considered "wrong" or "incorrect". That argument itself is
> flawed when plenty of other popular database systems also deviate away from
> the standard on this specific behavior.
>
>
>
>
> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin  wrote:
>
>> I looked up our usage logs (sorry I can't share this publicly) and trim
>> has at least four orders of magnitude higher usage than char.
>>
>>
>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun 
>> wrote:
>>
>>> Thank you, Stephen and Reynold.
>>>
>>> To Reynold.
>>>
>>> The way I see the following is a little different.
>>>
>>>   > CHAR is an undocumented data type without clearly defined
>>> semantics.
>>>
>>> Let me describe in Apache Spark User's View point.
>>>
>>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function)
>>> at Apache Spark 1.x without much documentation. In addition, there still
>>> exists an effort which is trying to keep it in 3.0.0 age.
>>>
>>>https://issues.apache.org/jira/browse/SPARK-31088
>>>Add back HiveContext and createExternalTable
>>>
>>> Historically, we tried to make many SQL-based customer migrate their
>>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>>>
>>> Although Apache Spark didn't have a good document about the inconsistent
>>> behavior among its data sources, Apache Hive has been providing its
>>> documentation and many customers rely the behavior.
>>>
>>>   -
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
>>>
>>> At that time, frequently in on-prem Hadoop clusters by well-known
>>> vendors, many existing huge tables were created by Apache Hive, not Apache
>>> Spark. And, Apache Spark is used for boosting SQL performance with its
>>> *caching*. This was true because Apache Spark was added into the
>>> Hadoop-vendor products later than Apache Hive.
>>>
>>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>>> features to be consistent at least with Hive tables in Apache Hive and
>>> Apache Spark because two SQL engines share the same tables.
>>>
>>> For the following, technically, while Apache Hive doesn't changed its
>>> existing behavior in this part, Apache Spark evolves inevitably by moving
>>> away from the original Apache Spark old behaviors one-by-one.
>>>
>>>   >  the value is already fucked up
>>>
>>> The following is the change log.
>>>
>>>   - When we switched the default value of `convertMetastoreParquet`.
>>> (at Apache Spark 1.2)
>>>   - When we switched the default value of `convertMetastoreOrc` (at
>>> Apache Spark 2.4)
>>>   - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
>>> `PARQUET` table at Apache Spark 3.0)
>>>
>>> To sum up, this has been a well-known issue in the community and among
>>> the customers.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy 
>>> wrote:
>>>
 Hi there,

 I’m kind of new around here, but I have had experience with all of all
 the so called “big iron” databases such as Oracle, IBM DB2 and Microsoft
 SQL Server as well as Postgresql.

 They all support the notion of “ANSI padding” for CHAR columns - which
 means that such columns are always space padded, and they default to having
 this enabled (for ANSI compliance).

 MySQL also supports it, but it defaults to leaving it disabled for
 historical reasons not unlike what we have here.

 In my opinion we should push toward standards compliance where possible
 and then document where it cannot work.

 If users don’t like the padding on CHAR columns then they should change
 to VARCHAR - I believe that was its purpose in the first place, and it does
 not dictate any sort of “padding".

 I can see why you might “ban” the use of CHAR columns where they cannot
 be consistently supported, but VARCHAR is a different animal and I would
 expect it to work consistently everywhere.


 Cheers,

 Steve C

 On 17 Mar 2020, at 10:01 am, Dongjoon Hyun 
 wrote:

 H

回复: [PySpark] How to write HFiles as an 'append' to the same directory?

2020-03-16 Thread Zhang Victor
Maybe set spark.hadoop.validateOutputSpecs=false?


发件人: Gautham Acharya 
发送时间: 2020年3月15日 3:23
收件人: user@spark.apache.org 
主题: [PySpark] How to write HFiles as an 'append' to the same directory?


I have a process in Apache Spark that attempts to write HFiles to S3 in a 
batched process. I want the resulting HFiles in the same directory, as they are 
in the same column family. However, I’m getting a ‘directory already exists 
error’ when I try to run this on AWS EMR. How can I write Hfiles via Spark as 
an ‘append’, like I can do via a CSV?



The batch writing function looks like this:

for col_group in split_cols:
processed_chunk = 
batch_write_pandas_udf_for_col_aggregation(joined_dataframe, col_group, 
pandas_udf_func, group_by_args)

hfile_writer.write_hfiles(processed_chunk, output_path,
  zookeeper_ip, table_name, 
constants.DEFAULT_COL_FAMILY)



The actual function to write the Hfiles is this:

rdd.saveAsNewAPIHadoopFile(output_path,
   constants.OUTPUT_FORMAT_CLASS,
   keyClass=constants.KEY_CLASS,
   valueClass=constants.VALUE_CLASS,
   keyConverter=constants.KEY_CONVERTER,
   valueConverter=constants.VALUE_CONVERTER,
   conf=conf)



The exception I’m getting:



Called with arguments: Namespace(job_args=['matrix_path=/tmp/matrix.csv', 
'metadata_path=/tmp/metadata.csv', 
'output_path=s3://gauthama-etl-pipelines-test-working-bucket/8c6b2f09-2ba0-42fc-af8e-656188c0186a/hfiles',
 'group_by_args=cluster_id', 'zookeeper_ip=ip-172-30-5-36.ec2.internal', 
'table_name=test_wide_matrix__8c6b2f09-2ba0-42fc-af8e-656188c0186a'], 
job_name='matrix_transformations')

job_args_tuples: [['matrix_path', '/tmp/matrix.csv'], ['metadata_path', 
'/tmp/metadata.csv'], ['output_path', 
's3://gauthama-etl-pipelines-test-working-bucket/8c6b2f09-2ba0-42fc-af8e-656188c0186a/hfiles'],
 ['group_by_args', 'cluster_id'], ['zookeeper_ip', 
'ip-172-30-5-36.ec2.internal'], ['table_name', 
'test_wide_matrix__8c6b2f09-2ba0-42fc-af8e-656188c0186a']]

Traceback (most recent call last):

  File "/mnt/var/lib/hadoop/steps/s-2ZIOR335HH9TR/main.py", line 56, in 

job_module.transform(spark, **job_args)

  File 
"/mnt/tmp/spark-745d355c-0a90-4992-a07c-bc1dde4fac3e/userFiles-33d65b1c-bdae-4b44-997e-4d4c7521ad96/dep.zip/src/spark_transforms/pyspark_jobs/jobs/matrix_transformations/job.py",
 line 93, in transform

  File 
"/mnt/tmp/spark-745d355c-0a90-4992-a07c-bc1dde4fac3e/userFiles-33d65b1c-bdae-4b44-997e-4d4c7521ad96/dep.zip/src/spark_transforms/pyspark_jobs/jobs/matrix_transformations/job.py",
 line 73, in write_split_columnwise_transform

  File 
"/mnt/tmp/spark-745d355c-0a90-4992-a07c-bc1dde4fac3e/userFiles-33d65b1c-bdae-4b44-997e-4d4c7521ad96/dep.zip/src/spark_transforms/pyspark_jobs/output_handler/hfile_writer.py",
 line 44, in write_hfiles

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1438, in 
saveAsNewAPIHadoopFile

  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
line 1257, in __call__

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, 
in deco

  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.

: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
s3://gauthama-etl-pipelines-test-working-bucket/8c6b2f09-2ba0-42fc-af8e-656188c0186a/hfiles/median
 already exists

at 
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)

at 
org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.assertConf(SparkHadoopWriter.scala:393)

at 
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)

at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083)

at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)

at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)

at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)

at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081)

at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:1000)

at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:991)

   

Re: pyspark(sparksql-v 2.4) cannot read hive table which is created

2020-03-16 Thread dominic kim
I solved the problem with the option below
spark.sql ("SET spark.hadoop.metastore.catalog.default = hive") 
spark.sql ("SET spark.sql.hive.convertMetastoreOrc = false")



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: pyspark(sparksql-v 2.4) cannot read hive table which is created

2020-03-16 Thread dominic kim
I solved the problem with the option below
spark.sql ("SET spark.hadoop.metastore.catalog.default = hive") 
spark.sql ("SET spark.sql.hive.convertMetastoreOrc = false")



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Optimising multiple hive table join and query in spark

2020-03-16 Thread Manjunath Shetty H
Thanks Georg,

Batch import job frequency is different than the read job. Import job will run 
every 15mins - 1 hour, and Read/Transform job will run once a day.

In this case i think write with sortWithinPartitions doesnt make any difference 
as the combined data stored in HDFS will not be sorted at the end of the day.

Does partition/sort while reading help ?. Tried this out but it still results 
in shuffle during join of multiple tables and generates very complex DAG

-
Manjunath

From: Georg Heiler 
Sent: Monday, March 16, 2020 12:06 PM
To: Manjunath Shetty H 
Subject: Re: Optimising multiple hive table join and query in spark

Hi,

if you only have 1.6, forget bucketing. 
https://databricks.com/session/bucketing-in-spark-sql-2-3 that only works well 
with Hive from 2.3 onwards.
The other thing in your (daily?) batch job

val myData = 
spark.read.<>(/path/to/file).transform(<>)

Now when writing the data:
myData.write.repartition(xxx)
where xxx resembles the number of files you want to have for each period 
(day?). When writing ORC / Parquet make sure to have files of HDFS Block Size 
or more i.e. usually 128MB up to a maximum of 1G.
myData.write.repartition(xxx)).sortWithinPartitions(join_col, join_col)

apply a secondary sort to get ORC Indices.

IF the cardinality of the join_cols is high enough:
myData.write.repartition(xxx, col(join_col), 
col(other_join_col))).sortWithinPartitions(join_col, join_col)

Best,
Georg

Am Mo., 16. März 2020 um 04:27 Uhr schrieb Manjunath Shetty H 
mailto:manjunathshe...@live.com>>:
Hi Georg,

Thanks for the suggestion. Can you please explain bit more about what you meant 
exactly ?

Bdw i am on Spark 1.6


-
Manjunath

From: Georg Heiler mailto:georg.kf.hei...@gmail.com>>
Sent: Monday, March 16, 2020 12:35 AM
To: Manjunath Shetty H 
mailto:manjunathshe...@live.com>>
Subject: Re: Optimising multiple hive table join and query in spark

To speed things up:
- sortWithinPartitions (i.e. for each day)& potentially repartition
- pre-shuffle the data with bucketing

Am So., 15. März 2020 um 17:07 Uhr schrieb Manjunath Shetty H 
mailto:manjunathshe...@live.com>>:
Only partitioned and Join keys are not sorted coz those are written 
incrementally with batch jobs

From: Georg Heiler mailto:georg.kf.hei...@gmail.com>>
Sent: Sunday, March 15, 2020 8:30:53 PM
To: Manjunath Shetty H 
mailto:manjunathshe...@live.com>>
Cc: ayan guha mailto:guha.a...@gmail.com>>; Magnus Nilsson 
mailto:ma...@kth.se>>; user 
mailto:user@spark.apache.org>>
Subject: Re: Optimising multiple hive table join and query in spark

Did you only partition or also bucket by the join column? Are ORCI indices 
active i.e. the JOIN keys sorted when writing the files?

Best,
Georg

Am So., 15. März 2020 um 15:52 Uhr schrieb Manjunath Shetty H 
mailto:manjunathshe...@live.com>>:
Mostly the concern is the reshuffle. Even though all the DF's are partitioned 
by same column. During join it does reshuffle, that is the bottleneck as of now 
in our POC implementation.

Is there any way tell spark that keep all partitions with same partition key at 
the same place so that during the join it wont do shuffle again.


-
Manjunath

From: ayan guha mailto:guha.a...@gmail.com>>
Sent: Sunday, March 15, 2020 5:46 PM
To: Magnus Nilsson mailto:ma...@kth.se>>
Cc: user mailto:user@spark.apache.org>>
Subject: Re: Optimising multiple hive table join and query in spark

Hi

I would first and foremost try to identify where is the most time spend during 
the query. One possibility is it just takes ramp up time for executors to be 
available, if thats the case then maybe a dedicated yarn queue may help, or 
using Spark thriftserver may help.

On Sun, Mar 15, 2020 at 11:02 PM Magnus Nilsson 
mailto:ma...@kth.se>> wrote:
Been a while but I remember reading on Stack Overflow you can use a UDF as a 
join condition to trick catalyst into not reshuffling the partitions, ie use 
regular equality on the column you partitioned or bucketed by and your custom 
comparer for the other columns. Never got around to try it out hough. I really 
would like a native way to tell catalyst not to reshuffle just because you use 
more columns in the join condition.

On Sun, Mar 15, 2020 at 6:04 AM Manjunath Shetty H 
mailto:manjunathshe...@live.com>> wrote:
Hi All,

We have 10 tables in data warehouse (hdfs/hive) written using ORC format. We 
are serving a usecase on top of that by joining 4-5 tables using Hive as of 
now. But it is not fast as we wanted it to be, so we are thinking of using 
spark for this use case.

Any suggestion on this ? Is it good idea to use the Spark for this use case ? 
Can we get better performance by using spark ?

Any pointers would be helpful.

Notes:

  *   Data is partitioned by date (MMdd) as integer.
  *   Query will fetch data for last 7 days from some tables while joining with 
other tables.

Approach w

Re: [PySpark] How to write HFiles as an 'append' to the same directory?

2020-03-16 Thread Stephen Coy
I encountered a similar problem when trying to:

ds.write().save(“s3a://some-bucket/some/path/table”);

which writes the content as a bunch of parquet files in the “folder” named 
“table”.

I am using a Flintrock cluster with the Spark 3.0 preview FWIW.

Anyway, I just used the AWS SDK to remove it (and any “subdirectories”) before 
kicking off the spark machinery.

I can show you how to do this in Java, but I think the Python SDK maybe 
significantly different.

Steve C


On 15 Mar 2020, at 6:23 am, Gautham Acharya 
mailto:gauth...@alleninstitute.org>> wrote:

I have a process in Apache Spark that attempts to write HFiles to S3 in a 
batched process. I want the resulting HFiles in the same directory, as they are 
in the same column family. However, I’m getting a ‘directory already exists 
error’ when I try to run this on AWS EMR. How can I write Hfiles via Spark as 
an ‘append’, like I can do via a CSV?

The batch writing function looks like this:

for col_group in split_cols:
processed_chunk = 
batch_write_pandas_udf_for_col_aggregation(joined_dataframe, col_group, 
pandas_udf_func, group_by_args)

hfile_writer.write_hfiles(processed_chunk, output_path,
  zookeeper_ip, table_name, 
constants.DEFAULT_COL_FAMILY)

The actual function to write the Hfiles is this:

rdd.saveAsNewAPIHadoopFile(output_path,
   constants.OUTPUT_FORMAT_CLASS,
   keyClass=constants.KEY_CLASS,
   valueClass=constants.VALUE_CLASS,
   keyConverter=constants.KEY_CONVERTER,
   valueConverter=constants.VALUE_CONVERTER,
   conf=conf)


The exception I’m getting:


Called with arguments: Namespace(job_args=['matrix_path=/tmp/matrix.csv', 
'metadata_path=/tmp/metadata.csv', 
'output_path=s3://gauthama-etl-pipelines-test-working-bucket/8c6b2f09-2ba0-42fc-af8e-656188c0186a/hfiles',
 'group_by_args=cluster_id', 'zookeeper_ip=ip-172-30-5-36.ec2.internal', 
'table_name=test_wide_matrix__8c6b2f09-2ba0-42fc-af8e-656188c0186a'], 
job_name='matrix_transformations')

job_args_tuples: [['matrix_path', '/tmp/matrix.csv'], ['metadata_path', 
'/tmp/metadata.csv'], ['output_path', 
's3://gauthama-etl-pipelines-test-working-bucket/8c6b2f09-2ba0-42fc-af8e-656188c0186a/hfiles'],
 ['group_by_args', 'cluster_id'], ['zookeeper_ip', 
'ip-172-30-5-36.ec2.internal'], ['table_name', 
'test_wide_matrix__8c6b2f09-2ba0-42fc-af8e-656188c0186a']]

Traceback (most recent call last):

  File "/mnt/var/lib/hadoop/steps/s-2ZIOR335HH9TR/main.py", line 56, in 

job_module.transform(spark, **job_args)

  File 
"/mnt/tmp/spark-745d355c-0a90-4992-a07c-bc1dde4fac3e/userFiles-33d65b1c-bdae-4b44-997e-4d4c7521ad96/dep.zip/src/spark_transforms/pyspark_jobs/jobs/matrix_transformations/job.py",
 line 93, in transform

  File 
"/mnt/tmp/spark-745d355c-0a90-4992-a07c-bc1dde4fac3e/userFiles-33d65b1c-bdae-4b44-997e-4d4c7521ad96/dep.zip/src/spark_transforms/pyspark_jobs/jobs/matrix_transformations/job.py",
 line 73, in write_split_columnwise_transform

  File 
"/mnt/tmp/spark-745d355c-0a90-4992-a07c-bc1dde4fac3e/userFiles-33d65b1c-bdae-4b44-997e-4d4c7521ad96/dep.zip/src/spark_transforms/pyspark_jobs/output_handler/hfile_writer.py",
 line 44, in write_hfiles

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1438, in 
saveAsNewAPIHadoopFile

  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
line 1257, in __call__

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, 
in deco

  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.

: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
s3://gauthama-etl-pipelines-test-working-bucket/8c6b2f09-2ba0-42fc-af8e-656188c0186a/hfiles/median
 already exists

at 
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)

at 
org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.assertConf(SparkHadoopWriter.scala:393)

at 
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)

at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083)

at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)

at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)

at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)

at 
org.apache.spar