Re: Time for 2.3.2?

2018-07-03 Thread Saisai Shao
FYI, currently we have one block issue (
https://issues.apache.org/jira/browse/SPARK-24535), will start the release
after this is fixed.

Also please let me know if there're other blocks or fixes want to land to
2.3.2 release.

Thanks
Saisai

Saisai Shao  于2018年7月2日周一 下午1:16写道:

> I will start preparing the release.
>
> Thanks
>
> John Zhuge  于2018年6月30日周六 上午10:31写道:
>
>> +1  Looking forward to the critical fixes in 2.3.2.
>>
>> On Thu, Jun 28, 2018 at 9:37 AM Ryan Blue 
>> wrote:
>>
>>> +1
>>>
>>> On Thu, Jun 28, 2018 at 9:34 AM Xiao Li  wrote:
>>>
 +1. Thanks, Saisai!

 The impact of SPARK-24495 is large. We should release Spark 2.3.2 ASAP.

 Thanks,

 Xiao

 2018-06-27 23:28 GMT-07:00 Takeshi Yamamuro :

> +1, I heard some Spark users have skipped v2.3.1 because of these bugs.
>
> On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang 
> wrote:
>
>> +1
>>
>> Wenchen Fan 于2018年6月28日 周四下午2:06写道:
>>
>>> Hi Saisai, that's great! please go ahead!
>>>
>>> On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
>>> wrote:
>>>
 +1, like mentioned by Marcelo, these issues seems quite severe.

 I can work on the release if short of hands :).

 Thanks
 Jerry


 Marcelo Vanzin  于2018年6月28日周四
 上午11:40写道:

> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get
> fixes
> for those out.
>
> (Those are what delayed 2.2.2 and 2.1.3 for those watching...)
>
> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
> wrote:
> > Hi all,
> >
> > Spark 2.3.1 was released just a while ago, but unfortunately we
> discovered
> > and fixed some critical issues afterward.
> >
> > SPARK-24495: SortMergeJoin may produce wrong result.
> > This is a serious correctness bug, and is easy to hit: have
> duplicated join
> > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a =
> t2.c`, and the
> > join is a sort merge join. This bug is only present in Spark 2.3.
> >
> > SPARK-24588: stream-stream join may produce wrong result
> > This is a correctness bug in a new feature of Spark 2.3: the
> stream-stream
> > join. Users can hit this bug if one of the join side is
> partitioned by a
> > subset of the join keys.
> >
> > SPARK-24552: Task attempt numbers are reused when stages are
> retried
> > This is a long-standing bug in the output committer that may
> introduce data
> > corruption.
> >
> > SPARK-24542: UDFXPath allow users to pass carefully crafted
> XML to
> > access arbitrary files
> > This is a potential security issue if users build access control
> module upon
> > Spark.
> >
> > I think we need a Spark 2.3.2 to address these issues(especially
> the
> > correctness bugs) ASAP. Any thoughts?
> >
> > Thanks,
> > Wenchen
>
>
>
> --
> Marcelo
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
> --
> ---
> Takeshi Yamamuro
>


>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>> --
>>> John Zhuge
>>>
>>


Re: Revisiting Online serving of Spark models?

2018-07-03 Thread Matei Zaharia
Just wondering, is there an update on this? I haven’t seen a summary of the 
offline discussion but maybe I’ve missed it.

Matei 

> On Jun 11, 2018, at 8:51 PM, Holden Karau  wrote:
> 
> So I kicked of a thread on user@ to collect people's feedback there but I'll 
> summarize the offline results later this week too.
> 
> On Tue, Jun 12, 2018, 5:03 AM Liang-Chi Hsieh  wrote:
> 
> Hi,
> 
> It'd be great if there can be any sharing of the offline discussion. Thanks!
> 
> 
> 
> Holden Karau wrote
> > We’re by the registration sign going to start walking over at 4:05
> > 
> > On Wed, Jun 6, 2018 at 2:43 PM Maximiliano Felice <
> 
> > maximilianofelice@
> 
> >> wrote:
> > 
> >> Hi!
> >>
> >> Do we meet at the entrance?
> >>
> >> See you
> >>
> >>
> >> El mar., 5 de jun. de 2018 3:07 PM, Nick Pentreath <
> >> 
> 
> > nick.pentreath@
> 
> >> escribió:
> >>
> >>> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.
> >>>
> >>> On Sun, 3 Jun 2018 at 00:24 Holden Karau 
> 
> > holden@
> 
> >  wrote:
> >>>
>  On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
>  
> 
> > maximilianofelice@
> 
> >> wrote:
> 
> > Hi!
> >
> > We're already in San Francisco waiting for the summit. We even think
> > that we spotted @holdenk this afternoon.
> >
>  Unless you happened to be walking by my garage probably not super
>  likely, spent the day working on scooters/motorcycles (my style is a
>  little
>  less unique in SF :)). Also if you see me feel free to say hi unless I
>  look
>  like I haven't had my first coffee of the day, love chatting with folks
>  IRL
>  :)
> 
> >
> > @chris, we're really interested in the Meetup you're hosting. My team
> > will probably join it since the beginning of you have room for us, and
> > I'll
> > join it later after discussing the topics on this thread. I'll send
> > you an
> > email regarding this request.
> >
> > Thanks
> >
> > El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <
> > 
> 
> > sxk1969@
> 
> >> escribió:
> >
> >> @Chris This sounds fantastic, please send summary notes for Seattle
> >> folks
> >>
> >> @Felix I work in downtown Seattle, am wondering if we should a tech
> >> meetup around model serving in spark at my work or elsewhere close,
> >> thoughts?  I’m actually in the midst of building microservices to
> >> manage
> >> models and when I say models I mean much more than machine learning
> >> models
> >> (think OR, process models as well)
> >>
> >> Regards
> >>
> >> Sent from my iPhone
> >>
> >> On May 31, 2018, at 10:32 PM, Chris Fregly 
> 
> > chris@
> 
> >  wrote:
> >>
> >> Hey everyone!
> >>
> >> @Felix:  thanks for putting this together.  i sent some of you a
> >> quick
> >> calendar event - mostly for me, so i don’t forget!  :)
> >>
> >> Coincidentally, this is the focus of June 6th's *Advanced Spark and
> >> TensorFlow Meetup*
> >> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/;
> >> @5:30pm
> >> on June 6th (same night) here in SF!
> >>
> >> Everybody is welcome to come.  Here’s the link to the meetup that
> >> includes the signup link:
> >> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
> >> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/;
> >>
> >> We have an awesome lineup of speakers covered a lot of deep,
> >> technical
> >> ground.
> >>
> >> For those who can’t attend in person, we’ll be broadcasting live -
> >> and
> >> posting the recording afterward.
> >>
> >> All details are in the meetup link above…
> >>
> >> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
> >> welcome to give a talk. I can move things around to make room.
> >>
> >> @joseph:  I’d personally like an update on the direction of the
> >> Databricks proprietary ML Serving export format which is similar to
> >> PMML
> >> but not a standard in any way.
> >>
> >> Also, the Databricks ML Serving Runtime is only available to
> >> Databricks customers.  This seems in conflict with the community
> >> efforts
> >> described here.  Can you comment on behalf of Databricks?
> >>
> >> Look forward to your response, joseph.
> >>
> >> See you all soon!
> >>
> >> —
> >>
> >>
> >> *Chris Fregly *Founder @ *PipelineAI* https://pipeline.ai/;
> >> (100,000
> >> Users)
> >> Organizer @ *Advanced Spark and TensorFlow Meetup*
> >> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/;
> >> (85,000
> >> Global Members)
> >>
> >>
> >>
> >> *San Francisco - Chicago - Austin -
> >> Washington DC - London - Dusseldorf *
> >> *Try our PipelineAI Community Edition 

Re: [SPARK][SQL][CORE] Running sql-tests

2018-07-03 Thread Marco Gaido
Hi Daniel,
please check
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala
.
You should find all your answers in the comments there.

Thanks,
Marco

2018-07-03 19:08 GMT+02:00 dmateusp :

> Hey everyone!
>
> Newbie question,
>
> I'm trying to run the tests under
> spark/sql/core/src/test/resources/sql-tests/inputs/ but I got no luck so
> far
>
> How are they called ? What is even the format of those files ? I've never
> seen testing in the format of the inputs/ and results/ what's the name of
> it
> ? where can I read about it ?
>
> Thanks!
>
> Best regards,
> Daniel
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Use Gensim with PySpark

2018-07-03 Thread philipghu
I'm new to PySpark and Python in general, I'm trying to use PySpark to
process a large text file using Gensim's preprocess_string function. What I
did was simply putting preprocess_string in PySpark's map function:

rdd1 = text.map(lambda x: (x.preprocess_string(x, CUSTOM_FILTERS)))

it gave me the error when I ran it:

Traceback (most recent call last):

RDD1 = text.map(lambda x: (x.preprocess_string(x, CUSTOM_FILTERS))).cache()
  File "$SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 226, in
cache
  File "$SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 242, in
persist
  File "$SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 2380, in
_jrdd
  File "$SPARK_HOME/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
813, in __call__
  File "$SPARK_HOME/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 312,
in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o24.rdd. Trace:
py4j.Py4JException: Method rdd([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)

philguang...@gmail.com

Gensim is installed on all the nodes in my cluster, I thought how PySpark
works is that it will ship the Python code on the driver program to the
worker nodes and invoke Python processes to execute it. My question is what
kind of user defined map function can PySpark handle? Thanks!







--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark model serving

2018-07-03 Thread Saikat Kanjilal
Ping, would love to hear back on this.



From: Saikat Kanjilal 
Sent: Tuesday, June 26, 2018 7:27 AM
To: dev@spark.apache.org
Subject: Spark model serving

HoldenK and interested folks,
Am just following up on the spark model serving discussions as this is highly 
relevant to what I’m embarking on at work.  Is there a concrete list of next 
steps or can someone summarize what was discussed at the summit , would love to 
have a Seattle version of this discussion with some folks.

Look forward to hearing back and driving this.

Regards

Sent from my iPhone


[SPARK][SQL][CORE] Running sql-tests

2018-07-03 Thread dmateusp
Hey everyone!

Newbie question,

I'm trying to run the tests under
spark/sql/core/src/test/resources/sql-tests/inputs/ but I got no luck so far

How are they called ? What is even the format of those files ? I've never
seen testing in the format of the inputs/ and results/ what's the name of it
? where can I read about it ?

Thanks!

Best regards,
Daniel



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[Spark Streaming] mapStateByKey Python SDK

2018-07-03 Thread tpawlowski
I am a user of Spark Streaming Python SDK. In many cases stateful operations
are the only way to approach problems I am solving. UpdateStateByKey method,
which I believe is the only way to introduce a stateful operator, requires
some extra code to extract result from the states and to remove values for
keys not updated for a long time. Yet I can see there is a method in Scala
SDK, mapStateByKey, which solves this problems. Is there a plan to introduce
it into Python SDK?

Thanks!
Tomasz



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-03 Thread Chetan Khatri
Hello Dear Spark User / Dev,

I would like to pass Python user defined function to Spark Job developed
using Scala and return value of that function would be returned to DF /
Dataset API.

Can someone please guide me, which would be best approach to do this.
Python function would be mostly transformation function. Also would like to
pass Java Function as a String to Spark / Scala job and it applies to RDD /
Data Frame and should return RDD / Data Frame.

Thank you.


Re: Spark data source resiliency

2018-07-03 Thread assaf.mendelson
You are correct, this solved it.
Thanks



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark data source resiliency

2018-07-03 Thread Wenchen Fan
I believe you are using something like `local[8]` as your Spark mater,
which can't retry tasks. Please try `local[8, 3]` which can re-try failed
tasks 3 times.

On Tue, Jul 3, 2018 at 2:42 PM assaf.mendelson 
wrote:

> That is what I expected, however, I did a very simple test (using println
> just to see when the exception is triggered in the iterator) using local
> master and I saw it failed once and cause the entire operation to fail.
>
> Is this something which may be unique to local master (or some default
> configuration which should be tested)?  I can't see a specific
> configuration
> to handle this in the documentation.
>
> Thanks,
> Assaf.
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Spark data source resiliency

2018-07-03 Thread assaf.mendelson
That is what I expected, however, I did a very simple test (using println
just to see when the exception is triggered in the iterator) using local
master and I saw it failed once and cause the entire operation to fail.

Is this something which may be unique to local master (or some default
configuration which should be tested)?  I can't see a specific configuration
to handle this in the documentation.

Thanks,
Assaf.




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark data source resiliency

2018-07-03 Thread Wenchen Fan
a failure in the data reader results to a task failure, and Spark will
re-try the task for you (IIRC re-try 3 times before fail the job).

Can you check your Spark log and see if the task fails consistently?

On Tue, Jul 3, 2018 at 2:17 PM assaf.mendelson 
wrote:

> Hi All,
>
> I am implemented a data source V2 which integrates with an internal system
> and I need to make it resilient to errors in the internal data source.
>
> The issue is that currently, if there is an exception in the data reader,
> the exception seems to fail the entire task. I would prefer instead to just
> restart the relevant partition.
>
> Is there a way to do it or would I need to solve it inside the iterator
> itself?
>
> Thanks,
> Assaf.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Spark data source resiliency

2018-07-03 Thread assaf.mendelson
Hi All,

I am implemented a data source V2 which integrates with an internal system
and I need to make it resilient to errors in the internal data source.

The issue is that currently, if there is an exception in the data reader,
the exception seems to fail the entire task. I would prefer instead to just
restart the relevant partition.

Is there a way to do it or would I need to solve it inside the iterator
itself?

Thanks,
Assaf.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org