Re: Honor ParseMode in AvroFileFormat

2019-03-07 Thread Gengliang
Hi Tim,

I think you can try setting the option *spark.sql.files.ignoreCorruptFiles *as
*true*. With the option enabled, the Spark jobs will continue to run
when encountering corrupted files and the contents that have been read will
still be returned.
The CSV/JSON data source supports the Permissive modes in reading files
because it is possible that users still want partial row results.
When reading corrupted Avro files, I think skipping the rest of files is
enough if users want to ignore them.
For processing data with function `from_avro`, I have created a PR to
support  *PERMISSIVE*/*FAILFAST* mode:
https://github.com/apache/spark/pull/22814

Gengliang


On Fri, Mar 8, 2019 at 6:25 AM tim  wrote:

> /facepalm
>
> Here we go: https://issues.apache.org/jira/browse/SPARK-27093
>
> Tim
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [SQL] hash: 64-bits and seeding

2019-03-07 Thread Huon.Wilson
Thanks for the guidance. That was my initial inclination, but I decided that 
consistency with the existing ‘hash’ was better. However, like you, I also 
prefer the specific form.

I’ve opened https://issues.apache.org/jira/browse/SPARK-27099 and submitted the 
patch (using ‘xxhash64’) at https://github.com/apache/spark/pull/24019.

- Huon

From: Reynold Xin 
Date: Thursday, 7 March 2019 at 6:33 pm
To: "Wilson, Huon (Data61, Eveleigh ATP)" 
Cc: "dev@spark.apache.org" 
Subject: Re: [SQL] hash: 64-bits and seeding


Rather than calling it hash64, it'd be better to just call it xxhash64. The 
reason being ten years from now, we probably would look back and laugh at a 
specific hash implementation. It'd be better to just name the expression what 
it is.


On Wed, Mar 06, 2019 at 7:59 PM, 
mailto:huon.wil...@data61.csiro.au>> wrote:

Hi,

I’m working on something that requires deterministic randomness, i.e. a row 
gets the same “random” value no matter the order of the DataFrame. A seeded 
hash seems to be the perfect way to do this, but the existing hashes have 
various limitations:

- hash: 32-bit output (only 4 billion possibilities will result in a lot of 
collisions for many tables: the birthday paradox implies >50% chance of at 
least one for tables larger than 77000 rows, and likely ~1.6 billion collisions 
in a table of size 4 billion)
- sha1/sha2/md5: single binary column input, string output

It seems there’s already support for a 64-bit hash function that can work with 
an arbitrary number of arbitrary-typed columns (XxHash64), and exposing this 
for DataFrames seems like it’s essentially one line in sql/functions.scala to 
match `hash` (plus docs, tests, function registry etc.):

def hash64(cols: Column*): Column = withExpr { new XxHash64(cols.map(_.expr)) }

For my use case, this can then be used to get a 64-bit “random” column like

val seed = rng.nextLong()
hash64(lit(seed), col1, col2)

I’ve created a (hopefully) complete patch by mimicking ‘hash’ at 
https://github.com/apache/spark/compare/master...huonw:hash64; should I open a 
JIRA and submit it as a pull request?

Additionally, both hash and the new hash64 already have support for being 
seeded, but this isn’t exposed directly and instead requires something like the 
`lit` above. Would it make sense to add overloads like the following?

def hash(seed: Int, cols: Columns*) = …
def hash64(seed: Long, cols: Columns*) = …

Though, it does seem a bit unfortunate to be forced to pass the seed first.

(I sent this email to u...@spark.apache.org a few 
days ago, but didn't get any discussion about the Spark aspects of this, so I'm 
resending it here; I apologise in advance if I'm breaking protocol!)

- Huon Wilson

- To 
unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org



[VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-07 Thread DB Tsai
Please vote on releasing the following candidate as Apache Spark version 2.4.1.

The vote is open until March 11 PST and passes if a majority +1 PMC votes are 
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.1-rc6 (commit 
201ec8c9b46f9d037cc2e3a5d9c896b9840ca1bc):
https://github.com/apache/spark/tree/v2.4.1-rc6

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc6-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1308/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc6-docs/

The list of bug fixes going into 2.4.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/2.4.1

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.1?
===

The current list of open tickets targeted at 2.4.1 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 2.4.1

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Honor ParseMode in AvroFileFormat

2019-03-07 Thread tim
/facepalm

Here we go: https://issues.apache.org/jira/browse/SPARK-27093

Tim



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Honor ParseMode in AvroFileFormat

2019-03-07 Thread tim
Thanks Xiao, it's good to have that validated.

I've created a ticket here: https://issues.apache.org/jira/browse/AVRO-2342



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
right now, i'm using the colums-at-a-time mapping
https://github.com/yupbank/tf-spark-serving/blob/master/tss/utils.py#L129



On Thu, Mar 7, 2019 at 4:00 PM Sean Owen  wrote:

> Maybe, it depends on what you're doing. It sounds like you are trying
> to do row-at-a-time mapping, even on a pandas DataFrame. Is what
> you're doing vectorized? may not help much.
> Just make the pandas Series into a DataFrame if you want? and a single
> col back to Series?
>
> On Thu, Mar 7, 2019 at 2:45 PM peng yu  wrote:
> >
> > pandas/arrow is for the memory efficiency, and mapPartitions is only
> available to rdds, for sure i can do everything in rdd.
> >
> > But i thought that's the whole point of having pandas_udf, so my program
> run faster and consumes less memory ?
> >
> > On Thu, Mar 7, 2019 at 3:40 PM Sean Owen  wrote:
> >>
> >> Are you just applying a function to every row in the DataFrame? you
> >> don't need pandas at all. Just get the RDD of Row from it and map a
> >> UDF that makes another Row, and go back to DataFrame. Or make a UDF
> >> that operates on all columns and returns a new value. mapPartitions is
> >> also available if you want to transform an iterator of Row to another
> >> iterator of Row.
> >>
> >> On Thu, Mar 7, 2019 at 2:33 PM peng yu  wrote:
> >> >
> >> > it is very similar to SCALAR, but for SCALAR the output can't be
> struct/row and the input has to be pd.Series, which doesn't support a row.
> >> >
> >> > I'm doing tensorflow batch inference in spark,
> https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108
> >> >
> >> > Which i have to do the groupBy in order to use the apply function,
> i'm wondering why not just enable apply to df ?
> >> >
> >> > On Thu, Mar 7, 2019 at 3:15 PM Sean Owen  wrote:
> >> >>
> >> >> Are you looking for SCALAR? that lets you map one row to one row, but
> >> >> do it more efficiently in batch. What are you trying to do?
> >> >>
> >> >> On Thu, Mar 7, 2019 at 2:03 PM peng yu  wrote:
> >> >> >
> >> >> > I'm looking for a mapPartition(pandas_udf) for  a
> pyspark.Dataframe.
> >> >> >
> >> >> > ```
> >> >> > @pandas_udf(df.schema, PandasUDFType.MAP)
> >> >> > def do_nothing(pandas_df):
> >> >> > return pandas_df
> >> >> >
> >> >> >
> >> >> > new_df = df.mapPartition(do_nothing)
> >> >> > ```
> >> >> > pandas_udf only support scala or GROUPED_MAP.  Why not support
> just Map?
> >> >> >
> >> >> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen  wrote:
> >> >> >>
> >> >> >> Are you looking for @pandas_udf in Python? Or just mapPartition?
> Those exist already
> >> >> >>
> >> >> >> On Thu, Mar 7, 2019, 1:43 PM peng yu  wrote:
> >> >> >>>
> >> >> >>> There is a nice map_partition function in R `dapply`.  so that
> user can pass a row to udf.
> >> >> >>>
> >> >> >>> I'm wondering why we don't have that in python?
> >> >> >>>
> >> >> >>> I'm trying to have a map_partition function with pandas_udf
> supported
> >> >> >>>
> >> >> >>> thanks!
>


Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
Maybe, it depends on what you're doing. It sounds like you are trying
to do row-at-a-time mapping, even on a pandas DataFrame. Is what
you're doing vectorized? may not help much.
Just make the pandas Series into a DataFrame if you want? and a single
col back to Series?

On Thu, Mar 7, 2019 at 2:45 PM peng yu  wrote:
>
> pandas/arrow is for the memory efficiency, and mapPartitions is only 
> available to rdds, for sure i can do everything in rdd.
>
> But i thought that's the whole point of having pandas_udf, so my program run 
> faster and consumes less memory ?
>
> On Thu, Mar 7, 2019 at 3:40 PM Sean Owen  wrote:
>>
>> Are you just applying a function to every row in the DataFrame? you
>> don't need pandas at all. Just get the RDD of Row from it and map a
>> UDF that makes another Row, and go back to DataFrame. Or make a UDF
>> that operates on all columns and returns a new value. mapPartitions is
>> also available if you want to transform an iterator of Row to another
>> iterator of Row.
>>
>> On Thu, Mar 7, 2019 at 2:33 PM peng yu  wrote:
>> >
>> > it is very similar to SCALAR, but for SCALAR the output can't be 
>> > struct/row and the input has to be pd.Series, which doesn't support a row.
>> >
>> > I'm doing tensorflow batch inference in spark, 
>> > https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108
>> >
>> > Which i have to do the groupBy in order to use the apply function, i'm 
>> > wondering why not just enable apply to df ?
>> >
>> > On Thu, Mar 7, 2019 at 3:15 PM Sean Owen  wrote:
>> >>
>> >> Are you looking for SCALAR? that lets you map one row to one row, but
>> >> do it more efficiently in batch. What are you trying to do?
>> >>
>> >> On Thu, Mar 7, 2019 at 2:03 PM peng yu  wrote:
>> >> >
>> >> > I'm looking for a mapPartition(pandas_udf) for  a pyspark.Dataframe.
>> >> >
>> >> > ```
>> >> > @pandas_udf(df.schema, PandasUDFType.MAP)
>> >> > def do_nothing(pandas_df):
>> >> > return pandas_df
>> >> >
>> >> >
>> >> > new_df = df.mapPartition(do_nothing)
>> >> > ```
>> >> > pandas_udf only support scala or GROUPED_MAP.  Why not support just Map?
>> >> >
>> >> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen  wrote:
>> >> >>
>> >> >> Are you looking for @pandas_udf in Python? Or just mapPartition? Those 
>> >> >> exist already
>> >> >>
>> >> >> On Thu, Mar 7, 2019, 1:43 PM peng yu  wrote:
>> >> >>>
>> >> >>> There is a nice map_partition function in R `dapply`.  so that user 
>> >> >>> can pass a row to udf.
>> >> >>>
>> >> >>> I'm wondering why we don't have that in python?
>> >> >>>
>> >> >>> I'm trying to have a map_partition function with pandas_udf supported
>> >> >>>
>> >> >>> thanks!

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Honor ParseMode in AvroFileFormat

2019-03-07 Thread tim
Hi Spark Devs,

We're processing a large number of Avro files with Spark and found that the
Avro reader is missing the ability to handle malformed or truncated files
like the JSON reader. Currently the Avro reader throws exceptions when it
encounters any bad or truncated record in an Avro file, causing the entire
Spark job to fail from a single dodgy file.

Ideally the AvroFileFormat would accept a Permissive or DropMalformed
ParseMode like Spark's JSON format. This would enable the the Avro reader to
drop bad records and continue processing the good records rather than abort
the entire job.

I've searched through Jira and haven’t found any related issues, but it’s a
relatively straight-forward change that brings consistency across the
readers. Obviously the default could remain as FailFastMode, which is the
current effective behavior, so this wouldn’t break any existing users.

Is there any reason why this behavior doesn't exist or obvious workaround
that I missed?

If not, are there any further details needed to consider adding this
capability to Spark's Avro reader? I’m happy to propose a solution and
contribute this update if somebody isn't already working on it.

Thanks,
Tim



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
it is very similar to SCALAR, but for SCALAR the output can't be struct/row
and the input has to be pd.Series, which doesn't support a row.

I'm doing tensorflow batch inference in spark,
https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108

Which i have to do the groupBy in order to use the apply function, i'm
wondering why not just enable apply to df ?

On Thu, Mar 7, 2019 at 3:15 PM Sean Owen  wrote:

> Are you looking for SCALAR? that lets you map one row to one row, but
> do it more efficiently in batch. What are you trying to do?
>
> On Thu, Mar 7, 2019 at 2:03 PM peng yu  wrote:
> >
> > I'm looking for a mapPartition(pandas_udf) for  a pyspark.Dataframe.
> >
> > ```
> > @pandas_udf(df.schema, PandasUDFType.MAP)
> > def do_nothing(pandas_df):
> > return pandas_df
> >
> >
> > new_df = df.mapPartition(do_nothing)
> > ```
> > pandas_udf only support scala or GROUPED_MAP.  Why not support just Map?
> >
> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen  wrote:
> >>
> >> Are you looking for @pandas_udf in Python? Or just mapPartition? Those
> exist already
> >>
> >> On Thu, Mar 7, 2019, 1:43 PM peng yu  wrote:
> >>>
> >>> There is a nice map_partition function in R `dapply`.  so that user
> can pass a row to udf.
> >>>
> >>> I'm wondering why we don't have that in python?
> >>>
> >>> I'm trying to have a map_partition function with pandas_udf supported
> >>>
> >>> thanks!
>


RE: Hive Hash in Spark

2019-03-07 Thread tcondie
Thanks Ryan and Reynold for the information!

 

Cheers,

Tyson

 

From: Ryan Blue  
Sent: Wednesday, March 6, 2019 3:47 PM
To: Reynold Xin 
Cc: tcon...@gmail.com; Spark Dev List 
Subject: Re: Hive Hash in Spark

 

I think this was needed to add support for bucketed Hive tables. Like Tyson 
noted, if the other side of a join can be bucketed the same way, then Spark can 
use a bucketed join. I have long-term plans to support this in the DataSourceV2 
API, but I don't think we are very close to implementing it yet.

 

rb

 

On Wed, Mar 6, 2019 at 1:57 PM Reynold Xin mailto:r...@databricks.com> > wrote:

  

 

I think they might be used in bucketing? Not 100% sure.

 

 

On Wed, Mar 06, 2019 at 1:40 PM, mailto:tcon...@gmail.com> 
> wrote:

Hi,

 

I noticed the existence of a Hive Hash partitioning implementation in Spark, 
but also noticed that it’s not being used, and that the Spark hash partitioning 
function is presently hardcoded to Murmur3. My question is whether Hive Hash is 
dead code or are their future plans to support reading and understanding data 
the has been partitioned using Hive Hash? By understanding, I mean that I’m 
able to avoid a full shuffle join on Table A (partitioned by Hive Hash) when 
joining with a Table B that I can shuffle via Hive Hash to Table A. 

 

Thank you,

Tyson

 




 

-- 

Ryan Blue

Software Engineer

Netflix



Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
and in this case, i'm actually benefiting from the columns of arrow
support, so that i can pass the whole data block to tensorflow to obtain
the block of prediction all at once.


On Thu, Mar 7, 2019 at 3:45 PM peng yu  wrote:

> pandas/arrow is for the memory efficiency, and mapPartitions is only
> available to rdds, for sure i can do everything in rdd.
>
> But i thought that's the whole point of having pandas_udf, so my program
> run faster and consumes less memory ?
>
> On Thu, Mar 7, 2019 at 3:40 PM Sean Owen  wrote:
>
>> Are you just applying a function to every row in the DataFrame? you
>> don't need pandas at all. Just get the RDD of Row from it and map a
>> UDF that makes another Row, and go back to DataFrame. Or make a UDF
>> that operates on all columns and returns a new value. mapPartitions is
>> also available if you want to transform an iterator of Row to another
>> iterator of Row.
>>
>> On Thu, Mar 7, 2019 at 2:33 PM peng yu  wrote:
>> >
>> > it is very similar to SCALAR, but for SCALAR the output can't be
>> struct/row and the input has to be pd.Series, which doesn't support a row.
>> >
>> > I'm doing tensorflow batch inference in spark,
>> https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108
>> >
>> > Which i have to do the groupBy in order to use the apply function, i'm
>> wondering why not just enable apply to df ?
>> >
>> > On Thu, Mar 7, 2019 at 3:15 PM Sean Owen  wrote:
>> >>
>> >> Are you looking for SCALAR? that lets you map one row to one row, but
>> >> do it more efficiently in batch. What are you trying to do?
>> >>
>> >> On Thu, Mar 7, 2019 at 2:03 PM peng yu  wrote:
>> >> >
>> >> > I'm looking for a mapPartition(pandas_udf) for  a pyspark.Dataframe.
>> >> >
>> >> > ```
>> >> > @pandas_udf(df.schema, PandasUDFType.MAP)
>> >> > def do_nothing(pandas_df):
>> >> > return pandas_df
>> >> >
>> >> >
>> >> > new_df = df.mapPartition(do_nothing)
>> >> > ```
>> >> > pandas_udf only support scala or GROUPED_MAP.  Why not support just
>> Map?
>> >> >
>> >> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen  wrote:
>> >> >>
>> >> >> Are you looking for @pandas_udf in Python? Or just mapPartition?
>> Those exist already
>> >> >>
>> >> >> On Thu, Mar 7, 2019, 1:43 PM peng yu  wrote:
>> >> >>>
>> >> >>> There is a nice map_partition function in R `dapply`.  so that
>> user can pass a row to udf.
>> >> >>>
>> >> >>> I'm wondering why we don't have that in python?
>> >> >>>
>> >> >>> I'm trying to have a map_partition function with pandas_udf
>> supported
>> >> >>>
>> >> >>> thanks!
>>
>


Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
pandas/arrow is for the memory efficiency, and mapPartitions is only
available to rdds, for sure i can do everything in rdd.

But i thought that's the whole point of having pandas_udf, so my program
run faster and consumes less memory ?

On Thu, Mar 7, 2019 at 3:40 PM Sean Owen  wrote:

> Are you just applying a function to every row in the DataFrame? you
> don't need pandas at all. Just get the RDD of Row from it and map a
> UDF that makes another Row, and go back to DataFrame. Or make a UDF
> that operates on all columns and returns a new value. mapPartitions is
> also available if you want to transform an iterator of Row to another
> iterator of Row.
>
> On Thu, Mar 7, 2019 at 2:33 PM peng yu  wrote:
> >
> > it is very similar to SCALAR, but for SCALAR the output can't be
> struct/row and the input has to be pd.Series, which doesn't support a row.
> >
> > I'm doing tensorflow batch inference in spark,
> https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108
> >
> > Which i have to do the groupBy in order to use the apply function, i'm
> wondering why not just enable apply to df ?
> >
> > On Thu, Mar 7, 2019 at 3:15 PM Sean Owen  wrote:
> >>
> >> Are you looking for SCALAR? that lets you map one row to one row, but
> >> do it more efficiently in batch. What are you trying to do?
> >>
> >> On Thu, Mar 7, 2019 at 2:03 PM peng yu  wrote:
> >> >
> >> > I'm looking for a mapPartition(pandas_udf) for  a pyspark.Dataframe.
> >> >
> >> > ```
> >> > @pandas_udf(df.schema, PandasUDFType.MAP)
> >> > def do_nothing(pandas_df):
> >> > return pandas_df
> >> >
> >> >
> >> > new_df = df.mapPartition(do_nothing)
> >> > ```
> >> > pandas_udf only support scala or GROUPED_MAP.  Why not support just
> Map?
> >> >
> >> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen  wrote:
> >> >>
> >> >> Are you looking for @pandas_udf in Python? Or just mapPartition?
> Those exist already
> >> >>
> >> >> On Thu, Mar 7, 2019, 1:43 PM peng yu  wrote:
> >> >>>
> >> >>> There is a nice map_partition function in R `dapply`.  so that user
> can pass a row to udf.
> >> >>>
> >> >>> I'm wondering why we don't have that in python?
> >> >>>
> >> >>> I'm trying to have a map_partition function with pandas_udf
> supported
> >> >>>
> >> >>> thanks!
>


Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
Are you just applying a function to every row in the DataFrame? you
don't need pandas at all. Just get the RDD of Row from it and map a
UDF that makes another Row, and go back to DataFrame. Or make a UDF
that operates on all columns and returns a new value. mapPartitions is
also available if you want to transform an iterator of Row to another
iterator of Row.

On Thu, Mar 7, 2019 at 2:33 PM peng yu  wrote:
>
> it is very similar to SCALAR, but for SCALAR the output can't be struct/row 
> and the input has to be pd.Series, which doesn't support a row.
>
> I'm doing tensorflow batch inference in spark, 
> https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108
>
> Which i have to do the groupBy in order to use the apply function, i'm 
> wondering why not just enable apply to df ?
>
> On Thu, Mar 7, 2019 at 3:15 PM Sean Owen  wrote:
>>
>> Are you looking for SCALAR? that lets you map one row to one row, but
>> do it more efficiently in batch. What are you trying to do?
>>
>> On Thu, Mar 7, 2019 at 2:03 PM peng yu  wrote:
>> >
>> > I'm looking for a mapPartition(pandas_udf) for  a pyspark.Dataframe.
>> >
>> > ```
>> > @pandas_udf(df.schema, PandasUDFType.MAP)
>> > def do_nothing(pandas_df):
>> > return pandas_df
>> >
>> >
>> > new_df = df.mapPartition(do_nothing)
>> > ```
>> > pandas_udf only support scala or GROUPED_MAP.  Why not support just Map?
>> >
>> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen  wrote:
>> >>
>> >> Are you looking for @pandas_udf in Python? Or just mapPartition? Those 
>> >> exist already
>> >>
>> >> On Thu, Mar 7, 2019, 1:43 PM peng yu  wrote:
>> >>>
>> >>> There is a nice map_partition function in R `dapply`.  so that user can 
>> >>> pass a row to udf.
>> >>>
>> >>> I'm wondering why we don't have that in python?
>> >>>
>> >>> I'm trying to have a map_partition function with pandas_udf supported
>> >>>
>> >>> thanks!

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
Are you looking for SCALAR? that lets you map one row to one row, but
do it more efficiently in batch. What are you trying to do?

On Thu, Mar 7, 2019 at 2:03 PM peng yu  wrote:
>
> I'm looking for a mapPartition(pandas_udf) for  a pyspark.Dataframe.
>
> ```
> @pandas_udf(df.schema, PandasUDFType.MAP)
> def do_nothing(pandas_df):
> return pandas_df
>
>
> new_df = df.mapPartition(do_nothing)
> ```
> pandas_udf only support scala or GROUPED_MAP.  Why not support just Map?
>
> On Thu, Mar 7, 2019 at 2:57 PM Sean Owen  wrote:
>>
>> Are you looking for @pandas_udf in Python? Or just mapPartition? Those exist 
>> already
>>
>> On Thu, Mar 7, 2019, 1:43 PM peng yu  wrote:
>>>
>>> There is a nice map_partition function in R `dapply`.  so that user can 
>>> pass a row to udf.
>>>
>>> I'm wondering why we don't have that in python?
>>>
>>> I'm trying to have a map_partition function with pandas_udf supported
>>>
>>> thanks!

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
I'm looking for a mapPartition(pandas_udf) for  a pyspark.Dataframe.

```
@pandas_udf(df.schema, PandasUDFType.MAP)
def do_nothing(pandas_df):
return pandas_df


new_df = df.mapPartition(do_nothing)
```
pandas_udf only support scala or GROUPED_MAP.  Why not support just Map?

On Thu, Mar 7, 2019 at 2:57 PM Sean Owen  wrote:

> Are you looking for @pandas_udf in Python? Or just mapPartition? Those
> exist already
>
> On Thu, Mar 7, 2019, 1:43 PM peng yu  wrote:
>
>> There is a nice map_partition function in R `dapply`.  so that user can
>> pass a row to udf.
>>
>> I'm wondering why we don't have that in python?
>>
>> I'm trying to have a map_partition function with pandas_udf supported
>>
>> thanks!
>>
>


Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
Are you looking for @pandas_udf in Python? Or just mapPartition? Those
exist already

On Thu, Mar 7, 2019, 1:43 PM peng yu  wrote:

> There is a nice map_partition function in R `dapply`.  so that user can
> pass a row to udf.
>
> I'm wondering why we don't have that in python?
>
> I'm trying to have a map_partition function with pandas_udf supported
>
> thanks!
>


[pyspark] dataframe map_partition

2019-03-07 Thread peng yu
There is a nice map_partition function in R `dapply`.  so that user can
pass a row to udf.

I'm wondering why we don't have that in python?

I'm trying to have a map_partition function with pandas_udf supported

thanks!


Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-07 Thread shane knapp
https://issues.apache.org/jira/browse/SPARK-26742



On Thu, Mar 7, 2019 at 10:52 AM shane knapp  wrote:

> i'm ready to update the ubuntu workers/minikube/k8s to support the 4.1.2
> client:
> https://issues.apache.org/jira/browse/SPARK-2674
>
> i am more than comfortable with this build system update, both on the ops
> and spark project side.  we were incredibly far behind the release cycle
> for k8s and minikube, which was beginning to impact the dep graph.
> updating to at least k8s v1.13 and the 4.1.2 client lib gives us a lot of
> breathing room w/little worry about backwards compatibility.
>
> if this is something we're comfortable with doing for the 2.4.1 release
> (+master), then i'll need to take down the pull request builder for ~30
> mins (which will be it's own email to dev@).
>
> shane
>
> On Wed, Mar 6, 2019 at 12:40 PM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
>
>> Yes its a touch decision and as we discussed today (
>> https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA
>> )
>> "Kubernetes support window is 9 months, Spark is two years". So we may
>> end up with old client versions on branches still supported like 2.4.x in
>> the future.
>> That gives us no choice but to upgrade, if we want to be on the safe
>> side. We have tested 3.0.0 with 1.11 internally and it works but I dont
>> know what it means to run with old
>> clients.
>>
>>
>> On Wed, Mar 6, 2019 at 7:54 PM Sean Owen  wrote:
>>
>>> If the old client is basically unusable with the versions of K8S
>>> people mostly use now, and the new client still works with older
>>> versions, I could see including this in 2.4.1.
>>>
>>> Looking at
>>> https://github.com/fabric8io/kubernetes-client#compatibility-matrix
>>> it seems like the 4.1.1 client is needed for 1.10 and above. However
>>> it no longer supports 1.7 and below.
>>> We have 3.0.x, and versions through 4.0.x of the client support the
>>> same K8S versions, so no real middle ground here.
>>>
>>> 1.7.0 came out June 2017, it seems. 1.10 was March 2018. Minor release
>>> branches are maintained for 9 months per
>>> https://kubernetes.io/docs/setup/version-skew-policy/
>>>
>>> Spark 2.4.0 came in Nov 2018. I suppose we could say it should have
>>> used the newer client from the start as at that point (?) 1.7 and
>>> earlier were already at least 7 months past EOL.
>>> If we update the client in 2.4.1, versions of K8S as recently
>>> 'supported' as a year ago won't work anymore. I'm guessing there are
>>> still 1.7 users out there? That wasn't that long ago but if the
>>> project and users generally move fast, maybe not.
>>>
>>> Normally I'd say, that's what the next minor release of Spark is for;
>>> update if you want later infra. But there is no Spark 2.5.
>>> I presume downstream distros could modify the dependency easily (?) if
>>> needed and maybe already do. It wouldn't necessarily help end users.
>>>
>>> Does the 3.0.x client not work at all with 1.10+ or just unsupported.
>>> If it 'basically works but no guarantees' I'd favor not updating. If
>>> it doesn't work at all, hm. That's tough. I think I'd favor updating
>>> the client but think it's a tough call both ways.
>>>
>>>
>>>
>>> On Wed, Mar 6, 2019 at 11:14 AM Stavros Kontopoulos
>>>  wrote:
>>> >
>>> > Yes Shane Knapp has done the work for that already,  and also tests
>>> pass, I am working on a PR now, I could submit it for the 2.4 branch .
>>> > I understand that this is a major dependency update, but the problem I
>>> see is that the client version is so old that I dont think it makes
>>> > much sense for current users who are on k8s 1.10, 1.11 etc(
>>> https://github.com/fabric8io/kubernetes-client#compatibility-matrix,
>>> 3.0.0 does not even exist in there).
>>> > I dont know what it means to use that old version with current k8s
>>> clusters in terms of bugs etc.
>>>
>>
>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-07 Thread shane knapp
i'm ready to update the ubuntu workers/minikube/k8s to support the 4.1.2
client:
https://issues.apache.org/jira/browse/SPARK-2674

i am more than comfortable with this build system update, both on the ops
and spark project side.  we were incredibly far behind the release cycle
for k8s and minikube, which was beginning to impact the dep graph.
updating to at least k8s v1.13 and the 4.1.2 client lib gives us a lot of
breathing room w/little worry about backwards compatibility.

if this is something we're comfortable with doing for the 2.4.1 release
(+master), then i'll need to take down the pull request builder for ~30
mins (which will be it's own email to dev@).

shane

On Wed, Mar 6, 2019 at 12:40 PM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Yes its a touch decision and as we discussed today (
> https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA
> )
> "Kubernetes support window is 9 months, Spark is two years". So we may
> end up with old client versions on branches still supported like 2.4.x in
> the future.
> That gives us no choice but to upgrade, if we want to be on the safe side.
> We have tested 3.0.0 with 1.11 internally and it works but I dont know what
> it means to run with old
> clients.
>
>
> On Wed, Mar 6, 2019 at 7:54 PM Sean Owen  wrote:
>
>> If the old client is basically unusable with the versions of K8S
>> people mostly use now, and the new client still works with older
>> versions, I could see including this in 2.4.1.
>>
>> Looking at
>> https://github.com/fabric8io/kubernetes-client#compatibility-matrix
>> it seems like the 4.1.1 client is needed for 1.10 and above. However
>> it no longer supports 1.7 and below.
>> We have 3.0.x, and versions through 4.0.x of the client support the
>> same K8S versions, so no real middle ground here.
>>
>> 1.7.0 came out June 2017, it seems. 1.10 was March 2018. Minor release
>> branches are maintained for 9 months per
>> https://kubernetes.io/docs/setup/version-skew-policy/
>>
>> Spark 2.4.0 came in Nov 2018. I suppose we could say it should have
>> used the newer client from the start as at that point (?) 1.7 and
>> earlier were already at least 7 months past EOL.
>> If we update the client in 2.4.1, versions of K8S as recently
>> 'supported' as a year ago won't work anymore. I'm guessing there are
>> still 1.7 users out there? That wasn't that long ago but if the
>> project and users generally move fast, maybe not.
>>
>> Normally I'd say, that's what the next minor release of Spark is for;
>> update if you want later infra. But there is no Spark 2.5.
>> I presume downstream distros could modify the dependency easily (?) if
>> needed and maybe already do. It wouldn't necessarily help end users.
>>
>> Does the 3.0.x client not work at all with 1.10+ or just unsupported.
>> If it 'basically works but no guarantees' I'd favor not updating. If
>> it doesn't work at all, hm. That's tough. I think I'd favor updating
>> the client but think it's a tough call both ways.
>>
>>
>>
>> On Wed, Mar 6, 2019 at 11:14 AM Stavros Kontopoulos
>>  wrote:
>> >
>> > Yes Shane Knapp has done the work for that already,  and also tests
>> pass, I am working on a PR now, I could submit it for the 2.4 branch .
>> > I understand that this is a major dependency update, but the problem I
>> see is that the client version is so old that I dont think it makes
>> > much sense for current users who are on k8s 1.10, 1.11 etc(
>> https://github.com/fabric8io/kubernetes-client#compatibility-matrix,
>> 3.0.0 does not even exist in there).
>> > I dont know what it means to use that old version with current k8s
>> clusters in terms of bugs etc.
>>
>
>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-07 Thread DB Tsai
Saisai,

There is no blocker now. I ran into some difficulties in publishing the
jars into Nexus. The publish task was finished, but Nexus gave me the
following error.


*failureMessage Failed to validate the pgp signature of
'/org/apache/spark/spark-streaming-flume-assembly_2.11/2.4.1/spark-streaming-flume-assembly_2.11-2.4.1-tests.jar',
check the logs.*

I am sure my key is in the key server, and the weird thing is that it fails
on different jars each time I ran the publish script.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1


On Wed, Mar 6, 2019 at 6:04 PM Saisai Shao  wrote:

> Do we have other block/critical issues for Spark 2.4.1 or waiting
> something to be fixed? I roughly searched the JIRA, seems there's no
> block/critical issues marked for 2.4.1.
>
> Thanks
> Saisai
>
> shane knapp  于2019年3月7日周四 上午4:57写道:
>
>> i'll be popping in to the sig-big-data meeting on the 20th to talk about
>> stuff like this.
>>
>> On Wed, Mar 6, 2019 at 12:40 PM Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>>
>>> Yes its a touch decision and as we discussed today (
>>> https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA
>>> )
>>> "Kubernetes support window is 9 months, Spark is two years". So we may
>>> end up with old client versions on branches still supported like 2.4.x in
>>> the future.
>>> That gives us no choice but to upgrade, if we want to be on the safe
>>> side. We have tested 3.0.0 with 1.11 internally and it works but I dont
>>> know what it means to run with old
>>> clients.
>>>
>>>
>>> On Wed, Mar 6, 2019 at 7:54 PM Sean Owen  wrote:
>>>
 If the old client is basically unusable with the versions of K8S
 people mostly use now, and the new client still works with older
 versions, I could see including this in 2.4.1.

 Looking at
 https://github.com/fabric8io/kubernetes-client#compatibility-matrix
 it seems like the 4.1.1 client is needed for 1.10 and above. However
 it no longer supports 1.7 and below.
 We have 3.0.x, and versions through 4.0.x of the client support the
 same K8S versions, so no real middle ground here.

 1.7.0 came out June 2017, it seems. 1.10 was March 2018. Minor release
 branches are maintained for 9 months per
 https://kubernetes.io/docs/setup/version-skew-policy/

 Spark 2.4.0 came in Nov 2018. I suppose we could say it should have
 used the newer client from the start as at that point (?) 1.7 and
 earlier were already at least 7 months past EOL.
 If we update the client in 2.4.1, versions of K8S as recently
 'supported' as a year ago won't work anymore. I'm guessing there are
 still 1.7 users out there? That wasn't that long ago but if the
 project and users generally move fast, maybe not.

 Normally I'd say, that's what the next minor release of Spark is for;
 update if you want later infra. But there is no Spark 2.5.
 I presume downstream distros could modify the dependency easily (?) if
 needed and maybe already do. It wouldn't necessarily help end users.

 Does the 3.0.x client not work at all with 1.10+ or just unsupported.
 If it 'basically works but no guarantees' I'd favor not updating. If
 it doesn't work at all, hm. That's tough. I think I'd favor updating
 the client but think it's a tough call both ways.



 On Wed, Mar 6, 2019 at 11:14 AM Stavros Kontopoulos
  wrote:
 >
 > Yes Shane Knapp has done the work for that already,  and also tests
 pass, I am working on a PR now, I could submit it for the 2.4 branch .
 > I understand that this is a major dependency update, but the problem
 I see is that the client version is so old that I dont think it makes
 > much sense for current users who are on k8s 1.10, 1.11 etc(
 https://github.com/fabric8io/kubernetes-client#compatibility-matrix,
 3.0.0 does not even exist in there).
 > I dont know what it means to use that old version with current k8s
 clusters in terms of bugs etc.

>>>
>>>
>>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>


Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-07 Thread Felix Cheung
There is SPARK-26604 we are looking into


From: Saisai Shao 
Sent: Wednesday, March 6, 2019 6:05 PM
To: shane knapp
Cc: Stavros Kontopoulos; Sean Owen; DB Tsai; Spark dev list; d_t...@apple.com
Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

Do we have other block/critical issues for Spark 2.4.1 or waiting something to 
be fixed? I roughly searched the JIRA, seems there's no block/critical issues 
marked for 2.4.1.

Thanks
Saisai

shane knapp mailto:skn...@berkeley.edu>> 于2019年3月7日周四 
上午4:57写道:
i'll be popping in to the sig-big-data meeting on the 20th to talk about stuff 
like this.

On Wed, Mar 6, 2019 at 12:40 PM Stavros Kontopoulos 
mailto:stavros.kontopou...@lightbend.com>> 
wrote:
Yes its a touch decision and as we discussed today 
(https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA)
"Kubernetes support window is 9 months, Spark is two years".So we may end up 
with old client versions on branches still supported like 2.4.x in the future.
That gives us no choice but to upgrade, if we want to be on the safe side. We 
have tested 3.0.0 with 1.11 internally and it works but I dont know what it 
means to run with old
clients.


On Wed, Mar 6, 2019 at 7:54 PM Sean Owen 
mailto:sro...@gmail.com>> wrote:
If the old client is basically unusable with the versions of K8S
people mostly use now, and the new client still works with older
versions, I could see including this in 2.4.1.

Looking at https://github.com/fabric8io/kubernetes-client#compatibility-matrix
it seems like the 4.1.1 client is needed for 1.10 and above. However
it no longer supports 1.7 and below.
We have 3.0.x, and versions through 4.0.x of the client support the
same K8S versions, so no real middle ground here.

1.7.0 came out June 2017, it seems. 1.10 was March 2018. Minor release
branches are maintained for 9 months per
https://kubernetes.io/docs/setup/version-skew-policy/

Spark 2.4.0 came in Nov 2018. I suppose we could say it should have
used the newer client from the start as at that point (?) 1.7 and
earlier were already at least 7 months past EOL.
If we update the client in 2.4.1, versions of K8S as recently
'supported' as a year ago won't work anymore. I'm guessing there are
still 1.7 users out there? That wasn't that long ago but if the
project and users generally move fast, maybe not.

Normally I'd say, that's what the next minor release of Spark is for;
update if you want later infra. But there is no Spark 2.5.
I presume downstream distros could modify the dependency easily (?) if
needed and maybe already do. It wouldn't necessarily help end users.

Does the 3.0.x client not work at all with 1.10+ or just unsupported.
If it 'basically works but no guarantees' I'd favor not updating. If
it doesn't work at all, hm. That's tough. I think I'd favor updating
the client but think it's a tough call both ways.



On Wed, Mar 6, 2019 at 11:14 AM Stavros Kontopoulos
mailto:stavros.kontopou...@lightbend.com>> 
wrote:
>
> Yes Shane Knapp has done the work for that already,  and also tests pass, I 
> am working on a PR now, I could submit it for the 2.4 branch .
> I understand that this is a major dependency update, but the problem I see is 
> that the client version is so old that I dont think it makes
> much sense for current users who are on k8s 1.10, 1.11 
> etc(https://github.com/fabric8io/kubernetes-client#compatibility-matrix, 
> 3.0.0 does not even exist in there).
> I dont know what it means to use that old version with current k8s clusters 
> in terms of bugs etc.




--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-07 Thread Sean Owen
Do you know what change fixed it?
If it's not a regression from 2.4.0 it wouldn't necessarily go into a
maintenance release. If there were no downside, maybe; does it cause
any incompatibility with older HBase versions?
It may be that this support is targeted for Spark 3 on purpose, which
is probably due in the middle of the year.

On Thu, Mar 7, 2019 at 8:57 AM Jakub Wozniak  wrote:
>
> Hello,
>
> I have a question regarding the 2.4.1 release.
>
> It looks like Spark 2.4 (and 2.4.1-rc) is not exactly compatible with Hbase 
> 2.x+ for the Yarn mode.
> The problem is in the 
> org.apache.spark.deploy.security.HbaseDelegationTokenProvider class that 
> expects a specific version of TokenUtil class from Hbase that was changed 
> between Hbase 1.x & 2.x.
> On top the HadoopDelegationTokenManager does not use the ServiceLoader class 
> so I cannot attach my own provider (providers are hardcoded).
>
> It seems that both problems are resolved on the Spark master branch.
>
> Is there any reason not to include this fix in the 2.4.1 release?
> If so when do you plan to release it (the fix for Hbase)?
>
> Or maybe there is something I’ve overlooked, please correct me if I’m wrong.
>
> Best regards,
> Jakub
>
>
> On 7 Mar 2019, at 03:04, Saisai Shao  wrote:
>
> Do we have other block/critical issues for Spark 2.4.1 or waiting something 
> to be fixed? I roughly searched the JIRA, seems there's no block/critical 
> issues marked for 2.4.1.
>
> Thanks
> Saisai
>
> shane knapp  于2019年3月7日周四 上午4:57写道:
>>
>> i'll be popping in to the sig-big-data meeting on the 20th to talk about 
>> stuff like this.
>>
>> On Wed, Mar 6, 2019 at 12:40 PM Stavros Kontopoulos 
>>  wrote:
>>>
>>> Yes its a touch decision and as we discussed today 
>>> (https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA)
>>> "Kubernetes support window is 9 months, Spark is two years". So we may end 
>>> up with old client versions on branches still supported like 2.4.x in the 
>>> future.
>>> That gives us no choice but to upgrade, if we want to be on the safe side. 
>>> We have tested 3.0.0 with 1.11 internally and it works but I dont know what 
>>> it means to run with old
>>> clients.
>>>
>>>
>>> On Wed, Mar 6, 2019 at 7:54 PM Sean Owen  wrote:

 If the old client is basically unusable with the versions of K8S
 people mostly use now, and the new client still works with older
 versions, I could see including this in 2.4.1.

 Looking at 
 https://github.com/fabric8io/kubernetes-client#compatibility-matrix
 it seems like the 4.1.1 client is needed for 1.10 and above. However
 it no longer supports 1.7 and below.
 We have 3.0.x, and versions through 4.0.x of the client support the
 same K8S versions, so no real middle ground here.

 1.7.0 came out June 2017, it seems. 1.10 was March 2018. Minor release
 branches are maintained for 9 months per
 https://kubernetes.io/docs/setup/version-skew-policy/

 Spark 2.4.0 came in Nov 2018. I suppose we could say it should have
 used the newer client from the start as at that point (?) 1.7 and
 earlier were already at least 7 months past EOL.
 If we update the client in 2.4.1, versions of K8S as recently
 'supported' as a year ago won't work anymore. I'm guessing there are
 still 1.7 users out there? That wasn't that long ago but if the
 project and users generally move fast, maybe not.

 Normally I'd say, that's what the next minor release of Spark is for;
 update if you want later infra. But there is no Spark 2.5.
 I presume downstream distros could modify the dependency easily (?) if
 needed and maybe already do. It wouldn't necessarily help end users.

 Does the 3.0.x client not work at all with 1.10+ or just unsupported.
 If it 'basically works but no guarantees' I'd favor not updating. If
 it doesn't work at all, hm. That's tough. I think I'd favor updating
 the client but think it's a tough call both ways.



 On Wed, Mar 6, 2019 at 11:14 AM Stavros Kontopoulos
  wrote:
 >
 > Yes Shane Knapp has done the work for that already,  and also tests 
 > pass, I am working on a PR now, I could submit it for the 2.4 branch .
 > I understand that this is a major dependency update, but the problem I 
 > see is that the client version is so old that I dont think it makes
 > much sense for current users who are on k8s 1.10, 1.11 
 > etc(https://github.com/fabric8io/kubernetes-client#compatibility-matrix, 
 > 3.0.0 does not even exist in there).
 > I dont know what it means to use that old version with current k8s 
 > clusters in terms of bugs etc.
>>>
>>>
>>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-07 Thread Jakub Wozniak
Hello,

I have a question regarding the 2.4.1 release.

It looks like Spark 2.4 (and 2.4.1-rc) is not exactly compatible with Hbase 
2.x+ for the Yarn mode.
The problem is in the 
org.apache.spark.deploy.security.HbaseDelegationTokenProvider class that 
expects a specific version of TokenUtil class from Hbase that was changed 
between Hbase 1.x & 2.x.
On top the HadoopDelegationTokenManager does not use the ServiceLoader class so 
I cannot attach my own provider (providers are hardcoded).

It seems that both problems are resolved on the Spark master branch.

Is there any reason not to include this fix in the 2.4.1 release?
If so when do you plan to release it (the fix for Hbase)?

Or maybe there is something I’ve overlooked, please correct me if I’m wrong.

Best regards,
Jakub


On 7 Mar 2019, at 03:04, Saisai Shao 
mailto:sai.sai.s...@gmail.com>> wrote:

Do we have other block/critical issues for Spark 2.4.1 or waiting something to 
be fixed? I roughly searched the JIRA, seems there's no block/critical issues 
marked for 2.4.1.

Thanks
Saisai

shane knapp mailto:skn...@berkeley.edu>> 于2019年3月7日周四 
上午4:57写道:
i'll be popping in to the sig-big-data meeting on the 20th to talk about stuff 
like this.

On Wed, Mar 6, 2019 at 12:40 PM Stavros Kontopoulos 
mailto:stavros.kontopou...@lightbend.com>> 
wrote:
Yes its a touch decision and as we discussed today 
(https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA)
"Kubernetes support window is 9 months, Spark is two years". So we may end up 
with old client versions on branches still supported like 2.4.x in the future.
That gives us no choice but to upgrade, if we want to be on the safe side. We 
have tested 3.0.0 with 1.11 internally and it works but I dont know what it 
means to run with old
clients.


On Wed, Mar 6, 2019 at 7:54 PM Sean Owen 
mailto:sro...@gmail.com>> wrote:
If the old client is basically unusable with the versions of K8S
people mostly use now, and the new client still works with older
versions, I could see including this in 2.4.1.

Looking at https://github.com/fabric8io/kubernetes-client#compatibility-matrix
it seems like the 4.1.1 client is needed for 1.10 and above. However
it no longer supports 1.7 and below.
We have 3.0.x, and versions through 4.0.x of the client support the
same K8S versions, so no real middle ground here.

1.7.0 came out June 2017, it seems. 1.10 was March 2018. Minor release
branches are maintained for 9 months per
https://kubernetes.io/docs/setup/version-skew-policy/

Spark 2.4.0 came in Nov 2018. I suppose we could say it should have
used the newer client from the start as at that point (?) 1.7 and
earlier were already at least 7 months past EOL.
If we update the client in 2.4.1, versions of K8S as recently
'supported' as a year ago won't work anymore. I'm guessing there are
still 1.7 users out there? That wasn't that long ago but if the
project and users generally move fast, maybe not.

Normally I'd say, that's what the next minor release of Spark is for;
update if you want later infra. But there is no Spark 2.5.
I presume downstream distros could modify the dependency easily (?) if
needed and maybe already do. It wouldn't necessarily help end users.

Does the 3.0.x client not work at all with 1.10+ or just unsupported.
If it 'basically works but no guarantees' I'd favor not updating. If
it doesn't work at all, hm. That's tough. I think I'd favor updating
the client but think it's a tough call both ways.



On Wed, Mar 6, 2019 at 11:14 AM Stavros Kontopoulos
mailto:stavros.kontopou...@lightbend.com>> 
wrote:
>
> Yes Shane Knapp has done the work for that already,  and also tests pass, I 
> am working on a PR now, I could submit it for the 2.4 branch .
> I understand that this is a major dependency update, but the problem I see is 
> that the client version is so old that I dont think it makes
> much sense for current users who are on k8s 1.10, 1.11 
> etc(https://github.com/fabric8io/kubernetes-client#compatibility-matrix, 
> 3.0.0 does not even exist in there).
> I dont know what it means to use that old version with current k8s clusters 
> in terms of bugs etc.




--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu



Re: Two spark applications listen on same port on same machine

2019-03-07 Thread Moein Hosseini
I'm sure just the first one listen on port, but in master UI, both
application redirects to same machie, same port. Just as I checked url,
they redirects to application ui of first sumbitted one. So I think it
could be only problem in UI.

On Wed, Mar 6, 2019, 10:29 PM Sean Owen  wrote:

> Two drivers can't be listening on port 4040 at the same time -- on the
> same machine. The OS wouldn't allow it. Are they actually on different
> machines or somehow different interfaces? or are you saying the reported
> port is wrong?
>
> On Wed, Mar 6, 2019 at 12:23 PM Moein Hosseini  wrote:
>
>> I've submitted two spark applications in cluster of 3 standalone nodes in
>> near the same time (I have bash script to submit them one after one without
>> delay). But something goes wrong. In the master UI, Running applications
>> section show both of my job with true configuration (cores, memory and
>> different application-id) but both of redirect to port number 4040 which is
>> listen by second submitted job.
>> I think it could be race condition in UI but found nothing in logs. Could
>> you help me to investigate where should I look for reason?
>>
>> Best Regards
>> Moein
>>
>> --
>>
>> Moein Hosseini
>> Data Engineer
>> mobile: +98 912 468 1859 <+98+912+468+1859>
>> site: www.moein.xyz
>> email: moein...@gmail.com
>> [image: linkedin] 
>> [image: twitter] 
>>
>>