RE: read image or binary files / spark 2.3

2019-09-05 Thread Ilya Matiach
Hi Peter,
You can use the spark.readImages API in spark 2.3 for reading images:

https://databricks.com/blog/2018/12/10/introducing-built-in-image-data-source-in-apache-spark-2-4.html
https://blogs.technet.microsoft.com/machinelearning/2018/03/05/image-data-support-in-apache-spark/

https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.ml.image.ImageSchema$

There’s also a spark package for spark versions older than 2.3:
https://github.com/Microsoft/spark-images

Thank you, Ilya




From: Peter Liu 
Sent: Thursday, September 5, 2019 2:13 PM
To: dev ; User 
Subject: Re: read image or binary files / spark 2.3

Hello experts,

I have quick question: which API allows me to read images files or binary files 
(for SparkSession.readStream) from a local/hadoop file system in Spark 2.3?

I have been browsing the following documentations and googling for it and 
didn't find a good example/documentation:

https://spark.apache.org/docs/2.3.0/streaming-programming-guide.html
https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.package

any hint/help would be very much appreciated!

thanks!

Peter


Re: read image or binary files / spark 2.3

2019-09-05 Thread Peter Liu
Hello experts,

I have quick question: which API allows me to read images files or binary
files (for SparkSession.readStream) from a local/hadoop file system in
Spark 2.3?

I have been browsing the following documentations and googling for it and
didn't find a good example/documentation:

https://spark.apache.org/docs/2.3.0/streaming-programming-guide.html
https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.package

any hint/help would be very much appreciated!

thanks!

Peter


Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-05 Thread Reynold Xin
Having three modes is a lot. Why not just use ansi mode as default, and legacy 
for backward compatibility? Then over time there's only the ANSI mode, which is 
standard compliant and easy to understand. We also don't need to invent a 
standard just for Spark.

On Thu, Sep 05, 2019 at 12:27 AM, Wenchen Fan < cloud0...@gmail.com > wrote:

> 
> +1
> 
> 
> To be honest I don't like the legacy policy. It's too loose and easy for
> users to make mistakes, especially when Spark returns null if a function
> hit errors like overflow.
> 
> 
> The strict policy is not good either. It's too strict and stops valid use
> cases like writing timestamp values to a date type column. Users do expect
> truncation to happen without adding cast manually in this case. It's also
> weird to use a spark specific policy that no other database is using.
> 
> 
> The ANSI policy is better. It stops invalid use cases like writing string
> values to an int type column, while keeping valid use cases like timestamp
> -> date.
> 
> 
> I think it's no doubt that we should use ANSI policy instead of legacy
> policy for v1 tables. Except for backward compatibility, ANSI policy is
> literally better than the legacy policy.
> 
> 
> The v2 table is arguable here. Although the ANSI policy is better than
> strict policy to me, this is just the store assignment policy, which only
> partially controls the table insertion behavior. With Spark's "return null
> on error" behavior, the table insertion is more likely to insert invalid
> null values with the ANSI policy compared to the strict policy.
> 
> 
> I think we should use ANSI policy by default for both v1 and v2 tables,
> because
> 1. End-users don't care how the table is implemented. Spark should provide
> consistent table insertion behavior between v1 and v2 tables.
> 2. Data Source V2 is unstable in Spark 2.x so there is no backward
> compatibility issue. That said, the baseline to judge which policy is
> better should be the table insertion behavior in Spark 2.x, which is the
> legacy policy + "return null on error". ANSI policy is better than the
> baseline.
> 3. We expect more and more uses to migrate their data sources to the V2
> API. The strict policy can be a stopper as it's a too big breaking change,
> which may break many existing queries.
> 
> 
> Thanks,
> Wenchen 
> 
> 
> 
> 
> On Wed, Sep 4, 2019 at 1:59 PM Gengliang Wang < gengliang. wang@ databricks.
> com ( gengliang.w...@databricks.com ) > wrote:
> 
> 
>> Hi everyone,
>> 
>> I'd like to call for a vote on SPARK-28885 (
>> https://issues.apache.org/jira/browse/SPARK-28885 ) "Follow ANSI store
>> assignment rules in table insertion by default".  
>> When inserting a value into a column with the different data type, Spark
>> performs type coercion. Currently, we support 3 policies for the type
>> coercion rules: ANSI, legacy and strict, which can be set via the option
>> "spark.sql.storeAssignmentPolicy":
>> 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice,
>> the behavior is mostly the same as PostgreSQL. It disallows certain
>> unreasonable type conversions such as converting `string` to `int` and
>> `double` to `boolean`.
>> 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`,
>> which is very loose. E.g., converting either `string` to `int` or `double`
>> to `boolean` is allowed. It is the current behavior in Spark 2.x for
>> compatibility with Hive.
>> 3. Strict: Spark doesn't allow any possible precision loss or data
>> truncation in type coercion, e.g., converting either `double` to `int` or
>> `decimal` to `double` is allowed. The rules are originally for Dataset
>> encoder. As far as I know, no maintainstream DBMS is using this policy by
>> default.
>> 
>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>> uses "Strict". This proposal is to use "ANSI" policy by default for both
>> V1 and V2 in Spark 3.0.
>> 
>> There was also a DISCUSS thread "Follow ANSI SQL on table insertion" in
>> the dev mailing list.
>> 
>> This vote is open until next Thurs (Sept. 12nd).
>> 
>> [ ] +1: Accept the proposal
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ... Thank you! Gengliang
>> 
>> 
> 
>

Re: Why two netty libs?

2019-09-05 Thread Jacek Laskowski
Hi,

Thanks much for the answers. Learning Spark every day!

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Wed, Sep 4, 2019 at 3:15 PM Sean Owen  wrote:

> Yes that's right. I don't think Spark's usage of ZK needs any ZK
> server, so it's safe to exclude in Spark (at least, so far so good!)
>
> On Wed, Sep 4, 2019 at 8:06 AM Steve Loughran
>  wrote:
> >
> > Zookeeper client is/was netty 3, AFAIK, so if you want to use it for
> anything, it ends up on the CP
> >
> > On Tue, Sep 3, 2019 at 5:18 PM Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
> >>
> >> Yep, historical reasons. And Netty 4 is under another namespace, so we
> can use Netty 3 and Netty 4 in the same JVM.
> >>
> >> On Tue, Sep 3, 2019 at 6:15 AM Sean Owen  wrote:
> >>>
> >>> It was for historical reasons; some other transitive dependencies
> needed it.
> >>> I actually was just able to exclude Netty 3 last week from master.
> >>> Spark uses Netty 4.
> >>>
> >>> On Tue, Sep 3, 2019 at 6:59 AM Jacek Laskowski 
> wrote:
> >>> >
> >>> > Hi,
> >>> >
> >>> > Just noticed that Spark 2.4.x uses two netty deps of different
> versions. Why?
> >>> >
> >>> > jars/netty-all-4.1.17.Final.jar
> >>> > jars/netty-3.9.9.Final.jar
> >>> >
> >>> > Shouldn't one be excluded or perhaps shaded?
> >>> >
> >>> > Pozdrawiam,
> >>> > Jacek Laskowski
> >>> > 
> >>> > https://about.me/JacekLaskowski
> >>> > The Internals of Spark SQL https://bit.ly/spark-sql-internals
> >>> > The Internals of Spark Structured Streaming
> https://bit.ly/spark-structured-streaming
> >>> > The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
> >>> > Follow me at https://twitter.com/jaceklaskowski
> >>> >
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >> --
> >>
> >> Best Regards,
> >>
> >> Ryan
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-05 Thread Wenchen Fan
+1

To be honest I don't like the legacy policy. It's too loose and easy for
users to make mistakes, especially when Spark returns null if a function
hit errors like overflow.

The strict policy is not good either. It's too strict and stops valid use
cases like writing timestamp values to a date type column. Users do expect
truncation to happen without adding cast manually in this case. It's also
weird to use a spark specific policy that no other database is using.

The ANSI policy is better. It stops invalid use cases like writing string
values to an int type column, while keeping valid use cases like timestamp
-> date.

I think it's no doubt that we should use ANSI policy instead of legacy
policy for v1 tables. Except for backward compatibility, ANSI policy is
literally better than the legacy policy.

The v2 table is arguable here. Although the ANSI policy is better than
strict policy to me, this is just the store assignment policy, which only
partially controls the table insertion behavior. With Spark's "return null
on error" behavior, the table insertion is more likely to insert invalid
null values with the ANSI policy compared to the strict policy.

I think we should use ANSI policy by default for both v1 and v2 tables,
because
1. End-users don't care how the table is implemented. Spark should provide
consistent table insertion behavior between v1 and v2 tables.
2. Data Source V2 is unstable in Spark 2.x so there is no backward
compatibility issue. That said, the baseline to judge which policy is
better should be the table insertion behavior in Spark 2.x, which is the
legacy policy + "return null on error". ANSI policy is better than the
baseline.
3. We expect more and more uses to migrate their data sources to the V2
API. The strict policy can be a stopper as it's a too big breaking change,
which may break many existing queries.

Thanks,
Wenchen


On Wed, Sep 4, 2019 at 1:59 PM Gengliang Wang 
wrote:

> Hi everyone,
>
> I'd like to call for a vote on SPARK-28885 
>  "Follow ANSI store 
> assignment rules in table insertion by default".
> When inserting a value into a column with the different data type, Spark 
> performs type coercion. Currently, we support 3 policies for the type 
> coercion rules: ANSI, legacy and strict, which can be set via the option 
> "spark.sql.storeAssignmentPolicy":
> 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the 
> behavior is mostly the same as PostgreSQL. It disallows certain unreasonable 
> type conversions such as converting `string` to `int` and `double` to 
> `boolean`.
> 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, 
> which is very loose. E.g., converting either `string` to `int` or `double` to 
> `boolean` is allowed. It is the current behavior in Spark 2.x for 
> compatibility with Hive.
> 3. Strict: Spark doesn't allow any possible precision loss or data truncation 
> in type coercion, e.g., converting either `double` to `int` or `decimal` to 
> `double` is allowed. The rules are originally for Dataset encoder. As far as 
> I know, no maintainstream DBMS is using this policy by default.
>
> Currently, the V1 data source uses "Legacy" policy by default, while V2 uses 
> "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 
> in Spark 3.0.
>
> There was also a DISCUSS thread "Follow ANSI SQL on table insertion" in the 
> dev mailing list.
>
> This vote is open until next Thurs (Sept. 12nd).
>
> [ ] +1: Accept the proposal
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
> Thank you!
>
> Gengliang
>
>