[ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-14 Thread Dongjoon Hyun
We are happy to announce the availability of Spark 2.2.3!

Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2
maintenance branch of Spark. We strongly recommend all 2.2.x users to
upgrade to this stable release.

To download Spark 2.2.3, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-2-3.html

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.

Bests,
Dongjoon.


Unsubscribe

2019-01-14 Thread Liu, Jialin
Unsubscribe

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Unsubscribe

2019-01-14 Thread Tariq Hasan
Unsubscribe


Re: Is it possible to rate limit an UDP?

2019-01-14 Thread Ramandeep Singh Nanda
Basically, it is a zipping two flowables using the defined function[takes
two parameters and returns one, Hence the name BiFunction].

Obviously, you could avoid using RXJava and by using a TimerTask.

val a = Seq(1, 2, 3)
val b = a.zipWithIndex
b.foreach(b => new Timer().schedule(new TimerTask {
  override def run(): Unit = println(b._1)
}, b._2 * 200));


On Sat, Jan 12, 2019 at 9:25 PM  wrote:

> Thank you for your suggestion Ramandeep , but the code is not clear to me.
> Could you please explain it?  Particularly this part :
>
>
>
> Flowable.zip[java.lang.Long, Row, Row](delSt, itF, new
> BiFunction[java.lang.Long, Row, Row]() {
>
>
>
> Also , is it possible to achieve this without third party libraries?
>
>
>
> Thank you
>
>
>
> *From:* Ramandeep Singh 
> *Sent:* Thursday, January 10, 2019 1:48 AM
> *To:* Sonal Goyal 
> *Cc:* em...@yeikel.com; user 
> *Subject:* Re: Is it possible to rate limit an UDP?
>
>
>
> Backpressure is the suggested way out here and is the correct approach, it
> rate limits at the source itself for safety.   Imagine a service with
> throttling enabled, It can outright reject your calls.
>
>
>
> Even if you split your df that alone won't achieve your purpose, You can
> combine that with backpressure enabled API or restricting by time.
>
>
>
> Here's an example, Using RxJava, if you don't want to use any streaming
> api.
>
> *def *main(args: Array[String]): Unit = {
>   *val *ss = 
> SparkSession.*builder*().master("local[*]").enableHiveSupport().getOrCreate()
>
>   *import *ss.*sqlContext*.implicits._
>
>   *val *df = ss.read.json("src/main/resources/person.json")
>   *implicit val *encoder = *RowEncoder*(df.schema)
>   df.repartition(2).mapPartitions(it => {
> *val *itF = Flowable.*fromIterable*[Row](it.toIterable.asJava)
> *val *delSt = Flowable.*interval*(1, TimeUnit.*SECONDS*)
> Flowable.*zip*[java.lang.Long, Row, Row](delSt, itF, *new 
> *BiFunction[java.lang.Long, Row, Row]() {
>   *override def *apply(t1: java.lang.Long, t2: Row): Row = {
> //call api here
> t2
>   }
> }).toList.blockingGet().iterator().asScala
>   })
>   df.show()
> }
>
>
>
> On Wed, Jan 9, 2019 at 6:12 AM Sonal Goyal  wrote:
>
> Have you tried controlling the number of partitions of the dataframe? Say
> you have 5 partitions, it means you are making 5 concurrent calls to the
> web service. The throughput of the web service would be your bottleneck and
> Spark workers would be waiting for tasks, but if you cant control the REST
> service, maybe its worth a shot.
>
>
> Thanks,
> Sonal
> Nube Technologies
> 
>
>
>
>
>
>
>
>
>
>
> On Wed, Jan 9, 2019 at 4:51 AM  wrote:
>
> I have a data frame for which I apply an UDF that calls a REST web
> service.  This web service is distributed in only a few nodes and it won’t
> be able to handle a massive load from Spark.
>
>
>
> Is it possible to rate limit this UDP? For example , something like 100
> op/s.
>
>
>
> If not , what are the options? Is splitting the df an option?
>
>
>
> I’ve read a similar question in Stack overflow [1] and the solution
> suggests Spark Streaming , but my application does not involve streaming.
> Do I need to turn the operations into a streaming workflow to achieve
> something like that?
>
>
>
> Current Workflow : Hive -> Spark ->  Service
>
>
>
> Thank you
>
>
>
> [1]
> https://stackoverflow.com/questions/43953882/how-to-rate-limit-a-spark-map-operation
> 
>
>
>
>
> --
>
> Regards,
>
> Ramandeep Singh
>
> Blog:http://ramannanda.blogspot.com
>


-- 
Regards,
Ramandeep Singh
http://orastack.com
+13474792296
ramannan...@gmail.com


Re: State of datasource api v2

2019-01-14 Thread Arnaud LARROQUE
Hi Vladimir,

I've try to do the same here when I attempted to write a Spark connector
for remote file.
>From my point of view, There was a lot of change in the V2 API => Better
semantic at least !

I understood that only continuous streaming use datasourceV2 (Not sure if
im correct). But for file streaming, it falls back to V1 datasource. It is
also the case for file reading.

I also would be glad to have a state of this.

Regards
Arnaud

On Mon, Jan 14, 2019 at 9:48 AM Vladimir Prus 
wrote:

> Hi,
>
> I am trying to understand the state of datasource v2, and I'm a bit lost.
> On one hand, it is supposed to be more flexible approach, as described for
> example here:
>
> https://www.slideshare.net/databricks/apache-spark-data-source
> -v2-with-wenchen-fan-and-gengliang-wang
>
> On another hand, it appears both Parquet and ORC file readers are still
> not using v2 interface. There's an umbrella issue to address that:
>
> https://issues.apache.org/jira/browse/SPARK-23507
>
> but it does not have any sub-issues to address Parquet and the issue about
> ORC:
>
> https://issues.apache.org/jira/browse/SPARK-23817
>
> includes this text: "Not supported( due to limitation of data source V2):
> (1) Read multiple file path (2) Read bucketed file.".
>
> Is there some up-to-date information whether datasource v2 will indeed
> become to primary datasource, whether parquet reader
> will be converted to V2, and whether these limitations above will be fixed.
>
> Thanks in advance,
>
> --
> Vladimir Prus
> http://vladimirprus.com
>


State of datasource api v2

2019-01-14 Thread Vladimir Prus
Hi,

I am trying to understand the state of datasource v2, and I'm a bit lost.
On one hand, it is supposed to be more flexible approach, as described for
example here:

https://www.slideshare.net/databricks/apache-spark-data-source
-v2-with-wenchen-fan-and-gengliang-wang

On another hand, it appears both Parquet and ORC file readers are still not
using v2 interface. There's an umbrella issue to address that:

https://issues.apache.org/jira/browse/SPARK-23507

but it does not have any sub-issues to address Parquet and the issue about
ORC:

https://issues.apache.org/jira/browse/SPARK-23817

includes this text: "Not supported( due to limitation of data source V2):
(1) Read multiple file path (2) Read bucketed file.".

Is there some up-to-date information whether datasource v2 will indeed
become to primary datasource, whether parquet reader
will be converted to V2, and whether these limitations above will be fixed.

Thanks in advance,

-- 
Vladimir Prus
http://vladimirprus.com