Re: [SPARK-29898][SQL] Support Avro Custom Logical Types

2019-11-22 Thread Gengliang Wang
Hi Carlos,

To write Avro files with a schema different from the default mapping, you
can use the option "avroSchema":
 df.write.format("avro").option("avroSchema",
avroSchemaAsJSONStringFormat)...
See
https://spark.apache.org/docs/latest/sql-data-sources-avro.html#supported-types-for-spark-sql---avro-conversion
for more details.
The function `to_avro` also supports customized the output schema with the
last parameter "jsonFormatSchema"

To read Avro file with customized Avro schema, you can also use the option
"avroSchema". To specify a customized Dataframe schema, you can use general
data source method "spark.read.schema(..)..".
If there is missing mapping for the Avro logical types to DataFrame schema(
https://spark.apache.org/docs/latest/sql-data-sources-avro.html#supported-types-for-avro---spark-sql-conversion),
please update it in the `SchemaConverters`.

Hope this helps.

Thank you
Gengliang

On Fri, Nov 22, 2019 at 5:17 AM Carlos del Prado Mota 
wrote:

> Hi there,
>
> I recently proposed a change to add support for custom logical types for
> Avro in Spark. This change provides capabilities to build custom types
> conversions between StructType and Avro and is fully compatible with the
> current solution. This is the link for the solution and I would really
> appreciate if you could check it. I think that @Michael Armbrust is the
> best candidate to check this, but please redirect to the proper developer
> if necessary.
>
> https://github.com/apache/spark/pull/26524
>
> Many thanks & regards,
> Carlos.
>


Re: Spark 2.4.5 release for Parquet and Avro dependency updates?

2019-11-22 Thread Sean Owen
I haven't been following this closely, but I'm aware that there are
some tricky compatibility problems between Avro and Parquet, both of
which are used in Spark. That's made it pretty hard to update in 2.x.
master/3.0 is on Parquet 1.10.1 and Avro 1.8.2. Just a general
question: is that the best combo going forward? because the time to
update would be right about now for Spark 3. Backporting to 2.x is
pretty unlikely though.

On Fri, Nov 22, 2019 at 12:45 PM Michael Heuer  wrote:
>
> Hello,
>
> I am sorry for asking a somewhat inappropriate question.
>
> For context, our projects depend on a fix in Parquet master but not yet 
> released.  Parquet 1.11.0 is in release-candidate phase.  It looks like we 
> can't build against Parquet 1.11.0 RC to include the fix and run successfully 
> on Spark 2.4.x, which includes 1.10.1, without various classpath workarounds.
>
> I see now that Spark policy requires the Avro upgrade to wait until Spark 
> 3.0, and since Parquet 1.11.0 RC currently depends on Avro 1.9.1, it may also 
> have to wait.  I'll continue to think on this in the scope of the Parquet 
> community.
>
> Thank you for the clarification,
>
>michael
>
>
> On Nov 22, 2019, at 12:07 PM, Dongjoon Hyun  wrote:
>
> Hi, Michael.
>
> I'm not sure Apache Spark is in the status close to what you want.
>
> First, both Apache Spark 3.0.0-preview and Apache Spark 2.4 is using Avro 
> 1.8.2. Also, `master` and `branch-2.4` branch does. Cutting new releases do 
> not provide you what you want.
>
> Do we have a PR on the master branch? Otherwise, before starting to discuss 
> the releases, could you make a PR first on the master branch? For Parquet, 
> it's the same.
>
> Second, we want to provide Apache Spark 3.0.0 as compatible as possible. The 
> incompatible change could be a reason for rejection even in `master` branch 
> for Apache Spark 3.0.0.
>
> Lastly, we may consider backporting if it lands at `master` branch for 3.0.
> However, as Nan Zhu said, the dependency upgrade backporting PR is -1 by 
> default. Usually, it's allowed only for those serious cases like 
> security/production outage.
>
> Bests,
> Dongjoon.
>
>
> On Fri, Nov 22, 2019 at 9:00 AM Ryan Blue  wrote:
>>
>> Just to clarify, I don't think that Parquet 1.10.1 to 1.11.0 is a 
>> runtime-incompatible change. The example mixed 1.11.0 and 1.10.1 in the same 
>> execution.
>>
>> Michael, please be more careful about announcing compatibility problems in 
>> other communities. If you've observed problems, let's find out the root 
>> cause first.
>>
>> rb
>>
>> On Fri, Nov 22, 2019 at 8:56 AM Michael Heuer  wrote:
>>>
>>> Hello,
>>>
>>> Avro 1.8.2 to 1.9.1 is a binary incompatible update, and it appears that 
>>> Parquet 1.10.1 to 1.11 will be a runtime-incompatible update (see thread on 
>>> dev@parquet).
>>>
>>> Might there be any desire to cut a Spark 2.4.5 release so that users can 
>>> pick up these changes independently of all the other changes in Spark 3.0?
>>>
>>> Thank you in advance,
>>>
>>>michael
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 2.4.5 release for Parquet and Avro dependency updates?

2019-11-22 Thread Michael Heuer
Hello,

I am sorry for asking a somewhat inappropriate question.

For context, our projects depend on a fix in Parquet master but not yet 
released.  Parquet 1.11.0 is in release-candidate phase.  It looks like we 
can't build against Parquet 1.11.0 RC to include the fix and run successfully 
on Spark 2.4.x, which includes 1.10.1, without various classpath workarounds.

I see now that Spark policy requires the Avro upgrade to wait until Spark 3.0, 
and since Parquet 1.11.0 RC currently depends on Avro 1.9.1, it may also have 
to wait.  I'll continue to think on this in the scope of the Parquet community.

Thank you for the clarification,

   michael


> On Nov 22, 2019, at 12:07 PM, Dongjoon Hyun  wrote:
> 
> Hi, Michael.
> 
> I'm not sure Apache Spark is in the status close to what you want.
> 
> First, both Apache Spark 3.0.0-preview and Apache Spark 2.4 is using Avro 
> 1.8.2. Also, `master` and `branch-2.4` branch does. Cutting new releases do 
> not provide you what you want. 
> 
> Do we have a PR on the master branch? Otherwise, before starting to discuss 
> the releases, could you make a PR first on the master branch? For Parquet, 
> it's the same.
> 
> Second, we want to provide Apache Spark 3.0.0 as compatible as possible. The 
> incompatible change could be a reason for rejection even in `master` branch 
> for Apache Spark 3.0.0.
> 
> Lastly, we may consider backporting if it lands at `master` branch for 3.0.
> However, as Nan Zhu said, the dependency upgrade backporting PR is -1 by 
> default. Usually, it's allowed only for those serious cases like 
> security/production outage.
> 
> Bests,
> Dongjoon.
> 
> 
> On Fri, Nov 22, 2019 at 9:00 AM Ryan Blue  wrote:
> Just to clarify, I don't think that Parquet 1.10.1 to 1.11.0 is a 
> runtime-incompatible change. The example mixed 1.11.0 and 1.10.1 in the same 
> execution.
> 
> Michael, please be more careful about announcing compatibility problems in 
> other communities. If you've observed problems, let's find out the root cause 
> first.
> 
> rb
> 
> On Fri, Nov 22, 2019 at 8:56 AM Michael Heuer  > wrote:
> Hello,
> 
> Avro 1.8.2 to 1.9.1 is a binary incompatible update, and it appears that 
> Parquet 1.10.1 to 1.11 will be a runtime-incompatible update (see thread on 
> dev@parquet 
> ).
> 
> Might there be any desire to cut a Spark 2.4.5 release so that users can pick 
> up these changes independently of all the other changes in Spark 3.0?
> 
> Thank you in advance,
> 
>michael
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix



Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-22 Thread Dongjoon Hyun
Thank you, Steve and all.

As a conclusion of this thread, we will merge the following PR and move
forward.

[SPARK-29981][BUILD] Add hive-1.2/2.3 profiles
https://github.com/apache/spark/pull/26619

Please leave your comments if you have any concern.
And, the following PRs and more will follow it soon.

SPARK-29988 Adjust Jenkins jobs for hive-1.2/2.3 combination
SPARK-29989 Update release-script for hive-1.2/2.3 combination
SPARK-29991 Support hive-1.2/2.3 in PR Builder

In this thread, we have been focusing on only Hive dependency.
These change become effective at Apache Spark 3.0.0 (or the next preview).
For Hadoop3 and JDK11, please follow up the other threads.

Bests,
Dongjoon.


Re: Spark 2.4.5 release for Parquet and Avro dependency updates?

2019-11-22 Thread Dongjoon Hyun
Hi, Michael.

I'm not sure Apache Spark is in the status close to what you want.

First, both Apache Spark 3.0.0-preview and Apache Spark 2.4 is using Avro
1.8.2. Also, `master` and `branch-2.4` branch does. Cutting new releases do
not provide you what you want.

Do we have a PR on the master branch? Otherwise, before starting to discuss
the releases, could you make a PR first on the master branch? For Parquet,
it's the same.

Second, we want to provide Apache Spark 3.0.0 as compatible as possible.
The incompatible change could be a reason for rejection even in `master`
branch for Apache Spark 3.0.0.

Lastly, we may consider backporting if it lands at `master` branch for 3.0.
However, as Nan Zhu said, the dependency upgrade backporting PR is -1 by
default. Usually, it's allowed only for those serious cases like
security/production outage.

Bests,
Dongjoon.


On Fri, Nov 22, 2019 at 9:00 AM Ryan Blue  wrote:

> Just to clarify, I don't think that Parquet 1.10.1 to 1.11.0 is a
> runtime-incompatible change. The example mixed 1.11.0 and 1.10.1 in the
> same execution.
>
> Michael, please be more careful about announcing compatibility problems in
> other communities. If you've observed problems, let's find out the root
> cause first.
>
> rb
>
> On Fri, Nov 22, 2019 at 8:56 AM Michael Heuer  wrote:
>
>> Hello,
>>
>> Avro 1.8.2 to 1.9.1 is a binary incompatible update, and it appears that
>> Parquet 1.10.1 to 1.11 will be a runtime-incompatible update (see thread on
>> dev@parquet
>> 
>> ).
>>
>> Might there be any desire to cut a Spark 2.4.5 release so that users can
>> pick up these changes independently of all the other changes in Spark 3.0?
>>
>> Thank you in advance,
>>
>>michael
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Spark 2.4.5 release for Parquet and Avro dependency updates?

2019-11-22 Thread Ryan Blue
Just to clarify, I don't think that Parquet 1.10.1 to 1.11.0 is a
runtime-incompatible change. The example mixed 1.11.0 and 1.10.1 in the
same execution.

Michael, please be more careful about announcing compatibility problems in
other communities. If you've observed problems, let's find out the root
cause first.

rb

On Fri, Nov 22, 2019 at 8:56 AM Michael Heuer  wrote:

> Hello,
>
> Avro 1.8.2 to 1.9.1 is a binary incompatible update, and it appears that
> Parquet 1.10.1 to 1.11 will be a runtime-incompatible update (see thread on
> dev@parquet
> 
> ).
>
> Might there be any desire to cut a Spark 2.4.5 release so that users can
> pick up these changes independently of all the other changes in Spark 3.0?
>
> Thank you in advance,
>
>michael
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: Spark 2.4.5 release for Parquet and Avro dependency updates?

2019-11-22 Thread Nan Zhu
I am not sure if it is a good practice to have breaking changes in
dependencies for maintenance releases

On Fri, Nov 22, 2019 at 8:56 AM Michael Heuer  wrote:

> Hello,
>
> Avro 1.8.2 to 1.9.1 is a binary incompatible update, and it appears that
> Parquet 1.10.1 to 1.11 will be a runtime-incompatible update (see thread on
> dev@parquet
> 
> ).
>
> Might there be any desire to cut a Spark 2.4.5 release so that users can
> pick up these changes independently of all the other changes in Spark 3.0?
>
> Thank you in advance,
>
>michael
>


Spark 2.4.5 release for Parquet and Avro dependency updates?

2019-11-22 Thread Michael Heuer
Hello,

Avro 1.8.2 to 1.9.1 is a binary incompatible update, and it appears that 
Parquet 1.10.1 to 1.11 will be a runtime-incompatible update (see thread on 
dev@parquet 
).

Might there be any desire to cut a Spark 2.4.5 release so that users can pick 
up these changes independently of all the other changes in Spark 3.0?

Thank you in advance,

   michael

[SPARK-29898][SQL] Support Avro Custom Logical Types

2019-11-22 Thread Carlos del Prado Mota
Hi there, 

I recently proposed a change to add support for custom logical types for Avro 
in Spark. This change provides capabilities to build custom types conversions 
between StructType and Avro and is fully compatible with the current solution. 
This is the link for the solution and I would really appreciate if you could 
check it. I think that @Michael Armbrust is the best candidate to check this, 
but please redirect to the proper developer if necessary.

https://github.com/apache/spark/pull/26524 


Many thanks & regards,
Carlos.

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-22 Thread Steve Loughran
On Tue, Nov 19, 2019 at 10:40 PM Cheng Lian  wrote:

> Hey Steve,
>
> In terms of Maven artifact, I don't think the default Hadoop version
> matters except for the spark-hadoop-cloud module, which is only meaningful
> under the hadoop-3.2 profile. All  the other spark-* artifacts published to
> Maven central are Hadoop-version-neutral.
>

It's more that everyone using it has to do the game of excluding all the
old artifacts and requesting the new dependencies -including working out
what the spark poms excluded from their imports of later versions of
things.


>
> Another issue about switching the default Hadoop version to 3.2 is PySpark
> distribution. Right now, we only publish PySpark artifacts prebuilt with
> Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to
> 3.2 is feasible for PySpark users. Or maybe we should publish PySpark
> prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>
> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the
> proposed hive-2.3 profile, I personally don't have a preference over having
> Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing
> the release management work, in case we decided to publish other spark-*
> Maven artifacts from a Hadoop 2.7 build, we can still special case
> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>

that would really complicate life on maven. sticking a version on mvn
central with the 3.2 dependencies consistently would be better



>>>


Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-22 Thread Steve Loughran
On Thu, Nov 21, 2019 at 12:53 AM Dongjoon Hyun 
wrote:

> Thank you for much thoughtful clarification. I agree with your all options.
>
> Especially, for Hive Metastore connection, `Hive isolated client loader`
> is also important with Hive 2.3 because Hive 2.3 client cannot talk with
> Hive 2.1 and lower. `Hive Isolated client loader` is one of the good design
> in Apache Spark.
>
> One of the reason I started this thread focusing on the fork is that we *don't
> use* that fork actually.
>
> https://mvnrepository.com/artifact/org.spark-project.hive/
>
> Big companies (and vendors) maintains their own fork of that fork or
> upgrade its hive dependency already. So, when we say it's battle-tested, it
> does not mean it really. It's not tested.
>
>
I'm not up to date with the cloudera fork. Last time I went near the
then-hortonworks fork was for this : https://github.com/pwendell/hive/pull/2
; think there were a couple of security patches too.

I don't think anyone would have added new features to the branch, but bug
fixes and security patches are inevitable.

The above repository becomes something like a stranded phantom. We pointed
> that repo as a legacy interface, and we don't use the code really in the
> large production environments. Since there is no way to contribute back
> to that repo, we also have a segmentation problem on the experience with
> Hive 1.2.1. Someone may say it's good while the others still struggles without
> any community support.
>
> Anyway, thank you so much for the conclusion.
> I'll try to make a JIRA and PR for `hive-1.2` profile first as a
> conclusion.
>

+1


>>>