Re: [VOTE] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-16 Thread Gabor Somogyi
+1 (non-binding)

I think it's good from directional perspective.
Apache Flink is already using this approach for quite some time in
production.
The overall conclusion is that it's a big gain :)

G


On Tue, Nov 14, 2023 at 6:42 PM L. C. Hsieh  wrote:

> Hi all,
>
> I’d like to start a vote for SPIP: An Official Kubernetes Operator for
> Apache Spark.
>
> The proposal is to develop an official Java-based Kubernetes operator
> for Apache Spark to automate the deployment and simplify the lifecycle
> management and orchestration of Spark applications and Spark clusters
> on k8s at prod scale.
>
> This aims to reduce the learning curve and operation overhead for
> Spark users so they can concentrate on core Spark logic.
>
> Please also refer to:
>
>- Discussion thread:
> https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz
>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-45923
>- SPIP doc:
> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
>
> Thank you!
>
> Liang-Chi Hsieh
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] Flip the default value of Kafka offset fetching config (spark.sql.streaming.kafka.useDeprecatedOffsetFetching)

2022-10-13 Thread Gabor Somogyi
Hi Jungtaek,

Good to hear that the new approach is working fine. +1 from my side.

BR,
G


On Thu, Oct 13, 2022 at 4:12 AM Jungtaek Lim 
wrote:

> Hi all,
>
> I would like to propose flipping the default value of Kafka offset
> fetching config. The context is following:
>
> Before Spark 3.1, there was only one approach on fetching offset, using
> consumer.poll(0). This has been pointed out as a root cause for hang since
> there is no timeout for metadata fetch.
>
> In Spark 3.1, we addressed this via introducing a new approach on fetching
> offset, via SPARK-32032
> . Since the new
> approach leverages AdminClient and consumer group is no longer needed for
> fetching offset, required security ACLs are loosen.
>
> Reference:
> https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#offset-fetching
>
> There was some concern about behavioral change on the security model hence
> we couldn't make the new approach by default.
>
> During the time, we have observed various Kafka connector related issues
> which came from old offset fetching (e.g. hang, issues on rebalance on
> customer group, etc.) and we fixed many of these issues via simply flipping
> the config.
>
> Based on this, I would consider the default value as "incorrect". The
> security-related behavioral change would be introduced inevitably (they can
> set topic based ACL rule), but most people will get benefited. IMHO this is
> something we can deal with release/migration note.
>
> Would like to hear the voices on this.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>


Re: Issue on Spark on K8s with Proxy user on Kerberized HDFS : Spark-25355

2022-05-03 Thread Gabor Somogyi
29.doCall(DistributedFileSystem.java:1576)
>> at 
>> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>> at 
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
>> at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:65)
>> at org.apache.hadoop.fs.Globber.doGlob(Globber.java:270)
>> at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
>> at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2067)
>> at 
>> org.apache.spark.util.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:318)
>> at 
>> org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:273)
>> at 
>> org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:271)
>> at 
>> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
>> at 
>> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>> at 
>> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>> at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
>> at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
>> at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
>> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
>> at 
>> org.apache.spark.util.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:271)
>> at 
>> org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$4(SparkSubmit.scala:364)
>> at scala.Option.map(Option.scala:230)
>> at 
>> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:364)
>> at 
>> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
>> at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:165)
>> at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:163)
>> at java.base/java.security.AccessController.doPrivileged(Native Method)
>> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>> at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:163)
>> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>> at 
>> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>>
>>
>>
>> Again Thx for the input and advice regarding documentation and
>> apologies for putting the wrong error stack earlier.
>>
>> Regards
>> Pralabh Kumar
>> On Tue, May 3, 2022 at 7:39 PM Steve Loughran 
>> wrote:
>>
>>>
>>> Prablah, did you follow the URL provided in the exception message? i put
>>> a lot of effort in to improving the diagnostics, where the wiki articles
>>> are part of the troubleshooing process
>>> https://issues.apache.org/jira/browse/HADOOP-7469
>>>
>>> it's really disappointing when people escalate the problem to open
>>> source developers before trying to fix the problem themselves, in this
>>> case, read the error message.
>>>
>>> now, if there is some k8s related issue which makes this more common,
>>> you are encouraged to update the wiki entry with a new cause. documentation
>>> is an important contribution to open source projects, and if you have
>>> discovered a new way to recreate the failure, it would be welcome. which
>>> reminds me, i have to add something to connection reset and docker which
>>> comes down to "turn off http keepalive in maven builds"
>>>
>>> -Steve
>>>
>>>
>>>
>>>
>>>
>>> On Sat, 30 Apr 2022 at 10:45, Gabor Somogyi 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Please be aware that ConnectionRefused exception is has nothing to do
>>>> w/ authentication. See the description from Hadoop wiki:
>>>> "You get a ConnectionRefused
>>>> <https://cwiki.apache.org/confluence/display/HADOOP2/ConnectionRefused> 
>>>> Exception
>>>> when there is a machine at the address specified, but there is no program
>>>> listening on the specific TCP port the client i

Re: Issue on Spark on K8s with Proxy user on Kerberized HDFS : Spark-25355

2022-04-30 Thread Gabor Somogyi
Hi,

Please be aware that ConnectionRefused exception is has nothing to do w/
authentication. See the description from Hadoop wiki:
"You get a ConnectionRefused

Exception
when there is a machine at the address specified, but there is no program
listening on the specific TCP port the client is using -and there is no
firewall in the way silently dropping TCP connection requests. If you do
not know what a TCP connection request is, please consult the specification
."

This means the namenode on host:port is not reachable in the TCP layer.
Maybe there are multiple issues but I'm pretty sure that something is wrong
in the K8S net config.

BR,
G


On Fri, Apr 29, 2022 at 6:23 PM Pralabh Kumar 
wrote:

> Hi dev Team
>
> Spark-25355 added the functionality of the proxy user on K8s . However
> proxy user on K8s with Kerberized HDFS is not working .  It is throwing
> exception and
>
> 22/04/21 17:50:30 WARN Client: Exception encountered while connecting to
> the server : org.apache.hadoop.security.AccessControlException: Client
> cannot authenticate via:[TOKEN, KERBEROS]
>
>
> Exception in thread "main" java.net.ConnectException: Call From
>  to  failed on connection exception:
> java.net.ConnectException: Connection refused; For more details see:  http:
> //wiki.apache.org/hadoop/ConnectionRefused
>
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
> Source)
>
> at
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Source)
>
> at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source)
>
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
>
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:755)
>
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1501)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1443)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1353)
>
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
>
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>
> at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
>
> at
>
>
>
> On debugging deep , we found the proxy user doesn't have access to
> delegation tokens in case of K8s .SparkSubmit.submit explicitly creating
> the proxy user and this user doesn't have delegation token.
>
>
> Please help me with the same.
>
>
> Regards
>
> Pralabh Kumar
>
>
>
>


Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-24 Thread Gabor Somogyi
I've had a small talk to the Kafka guys to find out a little bit more and
the oversimplified conclusion is that if
the producer version >= 3.0 and broker version < 0.11.0 with message format
version V1 then
either `enable.idempotence = false` needed or broker upgrade to 0.11.0+ is
required to make it work.

There are also ACL related changes where the error message is quite obvious
what needs to be done.

I'm not sure how useful the error message is in the mentioned broker issue
case. If the message is obvious then we may not need to add any special
explanation.

AFAIK users are not really using 0.11- brokers because that's an ancient
version so the problem surface may be not so significant.

G


On Wed, Mar 23, 2022 at 11:32 PM Mich Talebzadeh 
wrote:

> Maybe I misunderstood this explanation.
>
> Agreed. Spark relies on Kafka, Google Pub/Sub or any other messaging
> systems to process the related streaming data via topic or topics and
> present them to Spark. At this stage, Spark does not care to know how this
> streaming data is produced. Spark relies on the appropriate API to read
> data from Kafka or from Google Pub/Sub. For example if you are processing
> temperature, you construct a streaming dataframe that subscribes to a
> topic say temperature. As long as you have the correct jar files to
> interface with Kafka and that takes care of compatibility, this should
> work. In reality Kafka will be running on its own container(s) plus the
> zookeeper containers if any. So as far as i can ascertain, the
> discussion is about deploying the compatible APIs
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 23 Mar 2022 at 20:12, Jungtaek Lim 
> wrote:
>
>> If it requires a Kafka broker update, we should not simply bump the
>> version of Kafka client. Probably we should at least provide separate
>> artifacts.
>>
>> We should not enforce the upgrade of other component just because we want
>> to upgrade the dependency. At least it should not happen in minor versions
>> of Spark. Hopefully that’s not a case.
>>
>> There’s a belief that Kafka client-broker compatibility is both backwards
>> and forwards. That is true in many cases (that’s what Kafka excels to), but
>> there seems to be the case it is not satisfied with specific configuration
>> and specific setup of Kafka broker. E.g KIP-679.
>>
>> The less compatible config is going to turn on by default in 3.0 and will
>> not work correctly with the specific setup of Kafka broker. So that is us
>> who breaks things for specific setup, and my point is how much
>> responsibility we should have to guide the end users to avoid the
>> frustration.
>>
>> 2022년 3월 23일 (수) 오후 9:41, Sean Owen 님이 작성:
>>
>>> Well, yes, but if it requires a Kafka server-side update, it does, and
>>> that is out of scope for us to document.
>>> It is important that we document if and how (if we know) the client
>>> update will impact existing Kafka installations (does it require a
>>> server-side update or not?), and document the change itself for sure along
>>> with any Spark-side migration notes.
>>>
>>> On Fri, Mar 18, 2022 at 8:47 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 The thing is, it is “us” who upgrades Kafka client and makes possible
 divergence between client and broker in end users’ production env.

 Someone can claim that end users can downgrade the kafka-client
 artifact when building their app so that the version can be matched, but we
 don’t test anything against downgrading kafka-client version for kafka
 connector. That sounds to me we defer our work to end users.

 It sounds to me “someone” should refer to us, and then it is no longer
 a matter of “help”. It is a matter of “responsibility”, as you said.

 2022년 3월 18일 (금) 오후 10:15, Sean Owen 님이 작성:

> I think we can assume that someone upgrading Kafka will be responsible
> for thinking through the breaking changes. We can help by listing anything
> we know could affect Spark-Kafka usage and calling those out in a release
> note, for sure. I don't think we need to get into items that would affect
> Kafka usage itself; focus on the connector-related issues.
>
> On Fri, Mar 18, 2022 at 5:15 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> CORRECTION: in option 2, we enumerate KIPs which may bring
>> incompatibility with older brokers (not all KIPs).
>>
>> On Fri, Mar 18, 2022 at 7:12 PM Jungtaek Lim <
>> 

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Gabor Somogyi
I've just read the related PR and seems like the situation is not so black
and white as I've presumed purely from tech point of view...

On Fri, 18 Mar 2022, 12:44 Gabor Somogyi,  wrote:

> Hi Jungtaek,
>
> I've taken a deeper look at the issue and here are my findings.
> As far as I'm concerned there are basically 2 ways with some minor
> decorations:
> * We care
> * We don't care
>
> I'm pretty sure users are clever enough but setting the expectation that
> all users are tracking Kafka KIPs one-by-one would be a hard requirement.
> This implies that I would vote on the "We care" point, the only question
> is how.
>
> Unless we have a specific reason for point 3 I wouldn't override default
> configs. The reason behind is simple.
> Kafka has it's strategic direction and going against it w/o good reason is
> rarely a good idea (maybe we have such but that would be said out).
>
> I think when Kafka version upgrade happens we engineers are having a look
> whether all the changes in the new version
> are backward compatible or not so point 2 makes sense to me. Honestly I'm
> drinking coffee with some of the Kafka devs
> so I've never ever read through all the KIPs between releases because
> they've told what's important to check :)
>
> Seems like my Kafka Spark compatibility gist is out-of-date so maybe I
> need to invest some time to resurrect it:
> https://gist.github.com/gaborgsomogyi/3476c32d69ff2087ed5d7d031653c7a9
>
> Hope my thoughts are helpful!
>
> BR,
> G
>
>
> On Fri, Mar 18, 2022 at 11:15 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> CORRECTION: in option 2, we enumerate KIPs which may bring
>> incompatibility with older brokers (not all KIPs).
>>
>> On Fri, Mar 18, 2022 at 7:12 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi dev,
>>>
>>> I would like to initiate the discussion about how to deal with the
>>> migration guide on upgrading Kafka to 3.1 (from 2.8.1) in upcoming Spark
>>> 3.3.
>>>
>>> We didn't care much about the upgrade of Kafka dependency since our
>>> belief on Kafka client has been that the new Kafka client version should
>>> have no compatibility issues with older brokers. Based on semantic
>>> versioning, upgrading major versions rings an alarm for me.
>>>
>>> I haven't gone through changes that happened between versions, but found
>>> one KIP (KIP-679
>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-679%3A+Producer+will+enable+the+strongest+delivery+guarantee+by+default>)
>>> which may not work with older brokers with specific setup. (It's described
>>> in the "Compatibility, Deprecation, and Migration Plan" section of the KIP).
>>>
>>> This may not be problematic for the users who upgrade both client and
>>> broker altogether, but end users of Spark may be unlikely the case.
>>> Computation engines are relatively easier to upgrade. Storage systems
>>> aren't. End users would think the components are independent.
>>>
>>> I looked through the notable changes in the Kafka doc, and it does
>>> mention this KIP, but it just says the default config has changed and
>>> doesn't mention about the impacts. There is a link to
>>> KIP, that said, everyone needs to read through the KIP wiki page for
>>> details.
>>>
>>> Based on the context, what would be the best way to notice end users for
>>> the major version upgrade of Kafka? I can imagine several options
>>> including...
>>>
>>> 1. Explicitly mention that Spark 3.3 upgrades Kafka to 3.1 with linking
>>> the noticeable changes in the Kafka doc in the migration guide.
>>> 2. Do 1 & spend more effort to read through all KIPs and check
>>> "Compatibility, Deprecation, and Migration Plan" section, and enumerate all
>>> KIPs (or even summarize) in the migration guide.
>>> 3. Do 2 & actively override the default configs to be compatible with
>>> older versions if the change of the default configs in Kafka 3.0 is
>>> backward incompatible. End users should set these configs explicitly to
>>> override them back.
>>> 4. Do not care. End users can indicate the upgrade in the release note,
>>> and we expect end users to actively check the notable changes (& KIPs) from
>>> Kafka doc.
>>> 5. Options not described above...
>>>
>>> Please take a look and provide your voice on this.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> ps. Probably this would be applied to all non-bugfix versions of
>>> dependency upgrades. We may still want to be pragmatic, e.g. pass-through
>>> for minor versions, though.
>>>
>>


Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Gabor Somogyi
Hi Jungtaek,

I've taken a deeper look at the issue and here are my findings.
As far as I'm concerned there are basically 2 ways with some minor
decorations:
* We care
* We don't care

I'm pretty sure users are clever enough but setting the expectation that
all users are tracking Kafka KIPs one-by-one would be a hard requirement.
This implies that I would vote on the "We care" point, the only question is
how.

Unless we have a specific reason for point 3 I wouldn't override default
configs. The reason behind is simple.
Kafka has it's strategic direction and going against it w/o good reason is
rarely a good idea (maybe we have such but that would be said out).

I think when Kafka version upgrade happens we engineers are having a look
whether all the changes in the new version
are backward compatible or not so point 2 makes sense to me. Honestly I'm
drinking coffee with some of the Kafka devs
so I've never ever read through all the KIPs between releases because
they've told what's important to check :)

Seems like my Kafka Spark compatibility gist is out-of-date so maybe I need
to invest some time to resurrect it:
https://gist.github.com/gaborgsomogyi/3476c32d69ff2087ed5d7d031653c7a9

Hope my thoughts are helpful!

BR,
G


On Fri, Mar 18, 2022 at 11:15 AM Jungtaek Lim 
wrote:

> CORRECTION: in option 2, we enumerate KIPs which may bring incompatibility
> with older brokers (not all KIPs).
>
> On Fri, Mar 18, 2022 at 7:12 PM Jungtaek Lim 
> wrote:
>
>> Hi dev,
>>
>> I would like to initiate the discussion about how to deal with the
>> migration guide on upgrading Kafka to 3.1 (from 2.8.1) in upcoming Spark
>> 3.3.
>>
>> We didn't care much about the upgrade of Kafka dependency since our
>> belief on Kafka client has been that the new Kafka client version should
>> have no compatibility issues with older brokers. Based on semantic
>> versioning, upgrading major versions rings an alarm for me.
>>
>> I haven't gone through changes that happened between versions, but found
>> one KIP (KIP-679
>> )
>> which may not work with older brokers with specific setup. (It's described
>> in the "Compatibility, Deprecation, and Migration Plan" section of the KIP).
>>
>> This may not be problematic for the users who upgrade both client and
>> broker altogether, but end users of Spark may be unlikely the case.
>> Computation engines are relatively easier to upgrade. Storage systems
>> aren't. End users would think the components are independent.
>>
>> I looked through the notable changes in the Kafka doc, and it does
>> mention this KIP, but it just says the default config has changed and
>> doesn't mention about the impacts. There is a link to
>> KIP, that said, everyone needs to read through the KIP wiki page for
>> details.
>>
>> Based on the context, what would be the best way to notice end users for
>> the major version upgrade of Kafka? I can imagine several options
>> including...
>>
>> 1. Explicitly mention that Spark 3.3 upgrades Kafka to 3.1 with linking
>> the noticeable changes in the Kafka doc in the migration guide.
>> 2. Do 1 & spend more effort to read through all KIPs and check
>> "Compatibility, Deprecation, and Migration Plan" section, and enumerate all
>> KIPs (or even summarize) in the migration guide.
>> 3. Do 2 & actively override the default configs to be compatible with
>> older versions if the change of the default configs in Kafka 3.0 is
>> backward incompatible. End users should set these configs explicitly to
>> override them back.
>> 4. Do not care. End users can indicate the upgrade in the release note,
>> and we expect end users to actively check the notable changes (& KIPs) from
>> Kafka doc.
>> 5. Options not described above...
>>
>> Please take a look and provide your voice on this.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> ps. Probably this would be applied to all non-bugfix versions of
>> dependency upgrades. We may still want to be pragmatic, e.g. pass-through
>> for minor versions, though.
>>
>


Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Gabor Somogyi
Good to hear and great work Hyukjin! 

On Wed, 3 Mar 2021, 11:15 Jungtaek Lim, 
wrote:

> Thanks Hyukjin for driving the huge release, and thanks everyone for
> contributing the release!
>
> On Wed, Mar 3, 2021 at 6:54 PM angers zhu  wrote:
>
>> Great work, Hyukjin !
>>
>> Bests,
>> Angers
>>
>> Wenchen Fan  于2021年3月3日周三 下午5:02写道:
>>
>>> Great work and congrats!
>>>
>>> On Wed, Mar 3, 2021 at 3:51 PM Kent Yao  wrote:
>>>
 Congrats, all!

 Bests,
 *Kent Yao *
 @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
 *a spark enthusiast*
 *kyuubi is a
 unified multi-tenant JDBC interface for large-scale data processing and
 analytics, built on top of Apache Spark .*
 *spark-authorizer A Spark
 SQL extension which provides SQL Standard Authorization for **Apache
 Spark .*
 *spark-postgres  A library
 for reading data from and transferring data to Postgres / Greenplum with
 Spark SQL and DataFrames, 10~100x faster.*
 *spark-func-extras A
 library that brings excellent and useful functions from various modern
 database management systems to Apache Spark .*



 On 03/3/2021 15:11,Takeshi Yamamuro
  wrote:

 Great work and Congrats, all!

 Bests,
 Takeshi

 On Wed, Mar 3, 2021 at 2:18 PM Mridul Muralidharan 
 wrote:

>
> Thanks Hyukjin and congratulations everyone on the release !
>
> Regards,
> Mridul
>
> On Tue, Mar 2, 2021 at 8:54 PM Yuming Wang  wrote:
>
>> Great work, Hyukjin!
>>
>> On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon 
>> wrote:
>>
>>> We are excited to announce Spark 3.1.1 today.
>>>
>>> Apache Spark 3.1.1 is the second release of the 3.x line. This
>>> release adds
>>> Python type annotations and Python dependency management support as
>>> part of Project Zen.
>>> Other major updates include improved ANSI SQL compliance support,
>>> history server support
>>> in structured streaming, the general availability (GA) of Kubernetes
>>> and node decommissioning
>>> in Kubernetes and Standalone. In addition, this release continues to
>>> focus on usability, stability,
>>> and polish while resolving around 1500 tickets.
>>>
>>> We'd like to thank our contributors and users for their
>>> contributions and early feedback to
>>> this release. This release would not have been possible without you.
>>>
>>> To download Spark 3.1.1, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-1-1.html
>>>
>>>

 --
 ---
 Takeshi Yamamuro




Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-24 Thread Gabor Somogyi
+1 (non-binding)

Tested my added security related featues, found an issue but not a blocker.

On Wed, 24 Feb 2021, 09:47 Hyukjin Kwon,  wrote:

> I remember HiveExternalCatalogVersionsSuite was flaky for a while which
> is fixed in
> https://github.com/apache/spark/commit/0d5d248bdc4cdc71627162a3d20c42ad19f24ef4
> and .. KafkaDelegationTokenSuite is flaky (
> https://issues.apache.org/jira/browse/SPARK-31250).
>
> 2021년 2월 24일 (수) 오후 5:19, Mridul Muralidharan 님이 작성:
>
>>
>> Signatures, digests, etc check out fine.
>> Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Phive
>> -Phive-thriftserver -Pmesos -Pkubernetes
>>
>> I keep getting test failures with
>> * org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite
>> * org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.
>> (Note: I remove $HOME/.m2 and $HOME/.iv2 paths before build)
>>
>> Removing these suites gets the build through though - does anyone have
>> suggestions on how to fix it ? I did not face this with RC1.
>>
>> Regards,
>> Mridul
>>
>>
>> On Mon, Feb 22, 2021 at 12:57 AM Hyukjin Kwon 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.1.1.
>>>
>>> The vote is open until February 24th 11PM PST and passes if a majority
>>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.1.1
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.1.1-rc3 (commit
>>> 1d550c4e90275ab418b9161925049239227f3dc9):
>>> https://github.com/apache/spark/tree/v3.1.1-rc3
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> 
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1367
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-docs/
>>>
>>> The list of bug fixes going into 3.1.1 can be found at the following URL:
>>> https://s.apache.org/41kf2
>>>
>>> This release is using the release script of the tag v3.1.1-rc3.
>>>
>>> FAQ
>>>
>>> ===
>>> What happened to 3.1.0?
>>> ===
>>>
>>> There was a technical issue during Apache Spark 3.1.0 preparation, and
>>> it was discussed and decided to skip 3.1.0.
>>> Please see
>>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
>>> more details.
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC via "pip install
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/pyspark-3.1.1.tar.gz
>>> "
>>> and see if anything important breaks.
>>> In the Java/Scala, you can add the staging repository to your projects
>>> resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with an out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.1.1?
>>> ===
>>>
>>> The current list of open tickets targeted at 3.1.1 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.1.1
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>>


Re: [DISCUSS] Add RocksDB StateStore

2021-02-08 Thread Gabor Somogyi
+1 adding it any way.

On Mon, 8 Feb 2021, 21:54 Holden Karau,  wrote:

> +1 for an external module.
>
> On Mon, Feb 8, 2021 at 11:51 AM Cheng Su  wrote:
>
>> +1 for (2) adding to external module.
>>
>> I think this feature is useful and popular in practice, and option 2 is
>> not conflict with previous concern for dependency.
>>
>>
>>
>> Thanks,
>>
>> Cheng Su
>>
>>
>>
>> *From: *Dongjoon Hyun 
>> *Date: *Monday, February 8, 2021 at 10:39 AM
>> *To: *Jacek Laskowski 
>> *Cc: *Liang-Chi Hsieh , dev 
>> *Subject: *Re: [DISCUSS] Add RocksDB StateStore
>>
>>
>>
>> Thank you, Liang-chi and all.
>>
>>
>>
>> +1 for (2) external module design because it can deliver the new feature
>> in a safe way.
>>
>>
>>
>> Bests,
>>
>> Dongjoon
>>
>>
>>
>> On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski  wrote:
>>
>> Hi,
>>
>>
>>
>> I'm "okay to add RocksDB StateStore as external module". See no reason
>> not to.
>>
>>
>> Pozdrawiam,
>>
>> Jacek Laskowski
>>
>> 
>>
>> https://about.me/JacekLaskowski
>>
>> "The Internals Of" Online Books 
>>
>> Follow me on https://twitter.com/jaceklaskowski
>>
>>
>> 
>>
>>
>>
>>
>>
>> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh  wrote:
>>
>> Hi devs,
>>
>> In Spark structured streaming, we need state store for state management
>> for
>> stateful operators such streaming aggregates, joins, etc. We have one and
>> only one state store implementation now. It is in-memory hashmap which was
>> backed up in HDFS complaint file system at the end of every micro-batch.
>>
>> As it basically uses in-memory map to store states, memory consumption is
>> a
>> serious issue and state store size is limited by the size of the executor
>> memory. Moreover, state store using more memory means it may impact the
>> performance of task execution that requires memory too.
>>
>> Internally we see more streaming applications that requires large state in
>> stateful operations. For such requirements, we need a StateStore not rely
>> on
>> memory to store states.
>>
>> This seems to be also true externally as several other major streaming
>> frameworks already use RocksDB for state management. RocksDB is an
>> embedded
>> DB and streaming engines can use it to store state instead of memory
>> storage.
>>
>> So seems to me, it is proven to be good choice for large state usage. But
>> Spark SS still lacks of a built-in state store for the requirement.
>>
>> Previously there was one attempt SPARK-28120 to add RocksDB StateStore
>> into
>> Spark SS. IIUC, it was pushed back due to two concerns: extra code
>> maintenance cost and it introduces RocksDB dependency.
>>
>> For the first concern, as more users require to use the feature, it should
>> be highly used code in SS and more developers will look at it. For second
>> one, we propose (SPARK-34198) to add it as an external module to relieve
>> the
>> dependency concern.
>>
>> Because it was pushed back previously, I'm going to raise this discussion
>> to
>> know what people think about it now, in advance of submitting any code.
>>
>> I think there might be some possible opinions:
>>
>> 1. okay to add RocksDB StateStore into sql core module
>> 2. not okay for 1, but okay to add RocksDB StateStore as external module
>> 3. either 1 or 2 is okay
>> 4. not okay to add RocksDB StateStore, no matter into sql core or as
>> external module
>>
>> Please let us know if you have some thoughts.
>>
>> Thank you.
>>
>> Liang-Chi Hsieh
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [DISCUSS] Time to evaluate "continuous mode" in SS?

2020-09-15 Thread Gabor Somogyi
Hi Jungtaek,

All I see at the moment is that most of the users choose Flink over Spark
when continues processing is needed.
Unless there is a revolution in this area there is no point to keep
maintenance. 2.5 years is lot in bigdata industry.
If there will be efforts in this area then happy to join to push this
forward...

BR,
G


On Tue, Sep 15, 2020 at 6:34 AM Jungtaek Lim 
wrote:

> Hi devs,
>
> It was Spark 2.3 in Feb 2018 which introduced continuous mode in
> Structured Streaming as "experimental".
>
> Now we are here at 2.5 years after its release - I feel it would be a good
> time to evaluate the mode, whether the mode has been widely used or not,
> and the mode has been making progress, as the mode is "experimental".
>
> At least from the surface I don't see any active effort for continuous
> mode around the community - the last major effort was stateful operation
> which was incomplete and I removed that. There were some couples of bug
> reports as well as fixes more than a year ago and almost nothing has been
> handled. (A trivial bugfix PR has been merged recently but that's all.) The
> new features introduced to the Structured Streaming (at least observable
> metrics, SS UI) don't apply to continuous mode, and no one made "support
> continuous mode" as a hard requirement on passing review in these PRs.
>
> I have no idea how many companies are using the mode in production (please
> add the voice if someone has statistics about this) but I don't see any bug
> reports recently, and see only a few questions in SO, which makes me think
> about cost on maintenance.
>
> I know there's a mood to avoid discontinue support as possible, but it
> sounds weird to keep something as "unmaintained", especially it's still
> "experimental" and main authors are no more active enough to promise
> maintenance/improvement on the module. Thoughts?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>


Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-07-01 Thread Gabor Somogyi
Hi Dongjoon,

I would add JDBC Kerberos support w/ keytab:
https://issues.apache.org/jira/browse/SPARK-12312

BR,
G


On Mon, Jun 29, 2020 at 6:07 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> After a short celebration of Apache Spark 3.0, I'd like to ask you the
> community opinion on Apache Spark 3.1 feature expectations.
>
> First of all, Apache Spark 3.1 is scheduled for December 2020.
> - https://spark.apache.org/versioning-policy.html
>
> I'm expecting the following items:
>
> 1. Support Scala 2.13
> 2. Use Apache Hadoop 3.2 by default for better cloud support
> 3. Declaring Kubernetes Scheduler GA
> In my perspective, the last main missing piece was Dynamic allocation
> and
> - Dynamic allocation with shuffle tracking is already shipped at 3.0.
> - Dynamic allocation with worker decommission/data migration is
> targeting 3.1. (Thanks, Holden)
> 4. DSv2 Stabilization
>
> I'm aware of some more features which are on the way currently, but I love
> to hear the opinions from the main developers and more over the main users
> who need those features.
>
> Thank you in advance. Welcome for any comments.
>
> Bests,
> Dongjoon.
>


Re: Adding JIRA ID as the prefix for the test case name

2019-11-12 Thread Gabor Somogyi
+1 for having that consistent rule in test names.
+1 for making it a guideline.
+1 defining exact guides in general.

Until now I've followed the alternative (only add the prefix when the
JIRA's type is bug) and that way I knew that such tests contain edge cases.
In case of new features I'm pretty sure there is a reason to introduce it
but at the moment can't imagine a use-case where it can help us (want to
convert it to daily routine).

> This is helpful when the test cases are moved to a different file.
The test can be found by name without jira ID


On Tue, Nov 12, 2019 at 5:31 AM Hyukjin Kwon  wrote:

> In few days, I will wrote this in our guidelines probably after rewording
> it a bit better:
>
> 1. Add a prefix into a test name when a PR adds a couple of tests.
> 2. Uses "SPARK-: test name" format.
>
> Please let me know if you have any different opinion about what/when to
> write the JIRA ID as the prefix.
> I would like to make sure this simple rule is closer to the actual
> practice from you guys.
>
>
> 2019년 11월 12일 (화) 오전 8:41, Gengliang 님이 작성:
>
>> +1 for making it a guideline. This is helpful when the test cases are
>> moved to a different file.
>>
>> On Mon, Nov 11, 2019 at 3:23 PM Takeshi Yamamuro 
>> wrote:
>>
>>> +1 for having that consistent rule in test names.
>>> This is a trivial problem though, I think documenting this rule in the
>>> contribution guide
>>> might be able to make reviewer overhead a little smaller.
>>>
>>> Bests,
>>> Takeshi
>>>
>>> On Tue, Nov 12, 2019 at 1:46 AM Hyukjin Kwon 
>>> wrote:
>>>
 Hi all,

 Maybe it's not a big deal but it brought some confusions time to time
 into Spark dev and community. I think it's time to discuss about when/which
 format to add a JIRA ID as a prefix for the test case name in Scala test
 cases.

 Currently we have many test case names with prefixes as below:

- test("SPARK-X blah blah")
- test("SPARK-X: blah blah")
- test("SPARK-X - blah blah")
- test("[SPARK-X] blah blah")
- …

 It is a good practice to have the JIRA ID in general because, for
 instance,
 it makes us put less efforts to track commit histories (or even when
 the files
 are totally moved), or to track related information of tests failed.
 Considering Spark's getting big, I think it's good to document.

 I would like to suggest this and document it in our guideline:

 1. Add a prefix into a test name when a PR adds a couple of tests.
 2. Uses "SPARK-: test name" format which is used in our code base
 most
   often[1].

 We should make it simple and clear but closer to the actual practice.
 So, I would like to listen to what other people think. I would appreciate
 if you guys give some feedback about when to add the JIRA prefix. One
 alternative is that, we only add the prefix when the JIRA's type is bug.

 [1]
 git grep -E 'test\("\SPARK-([0-9]+):' | wc -l
  923
 git grep -E 'test\("\SPARK-([0-9]+) ' | wc -l
  477
 git grep -E 'test\("\[SPARK-([0-9]+)\]' | wc -l
   16
 git grep -E 'test\("\SPARK-([0-9]+) -' | wc -l
   13




>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>


Re: FYI - filed bunch of issues for flaky tests in recent CI builds

2019-09-18 Thread Gabor Somogyi
Had a look at the Kafka test(SPARK-29136
) and commented.

BR,
G


On Wed, Sep 18, 2019 at 7:54 AM Jungtaek Lim  wrote:

> Hi devs,
>
> I've found bunch of test failures (intermittently) in both CI build for
> master branch as well as PR builder (only checked for mine) yesterday. I
> just filed issues which I observed, but I guess there's more as I only
> checked my PR.
>
> https://issues.apache.org/jira/browse/SPARK-29129
> https://issues.apache.org/jira/browse/SPARK-29130
> https://issues.apache.org/jira/browse/SPARK-29131
> https://issues.apache.org/jira/browse/SPARK-29132
> https://issues.apache.org/jira/browse/SPARK-29133
> https://issues.apache.org/jira/browse/SPARK-29134
> https://issues.apache.org/jira/browse/SPARK-29135
> https://issues.apache.org/jira/browse/SPARK-29136
> https://issues.apache.org/jira/browse/SPARK-29137
> https://issues.apache.org/jira/browse/SPARK-29138
> https://issues.apache.org/jira/browse/SPARK-29139
> https://issues.apache.org/jira/browse/SPARK-29140
>
> Other than that, there're another lots of failures with below message:
>
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException:
>> Cannot call methods on a stopped SparkContext.
>
>
> Even some of them above might be affected as well.
>
> I couldn't check whether these issues have been resolved (not for PR
> builder as these failures were yesterday, but for master build) so any
> helps are appreciated.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>


Re: Welcoming some new committers and PMC members

2019-09-10 Thread Gabor Somogyi
Congrats Guys!

G


On Tue, Sep 10, 2019 at 2:32 AM Matei Zaharia 
wrote:

> Hi all,
>
> The Spark PMC recently voted to add several new committers and one PMC
> member. Join me in welcoming them to their new roles!
>
> New PMC member: Dongjoon Hyun
>
> New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang,
> Weichen Xu, Ruifeng Zheng
>
> The new committers cover lots of important areas including ML, SQL, and
> data sources, so it’s great to have them here. All the best,
>
> Matei and the Spark PMC
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [SS] KafkaSource doesn't use KafkaSourceInitialOffsetWriter for initial offsets?

2019-08-26 Thread Gabor Somogyi
OK, starting with this tomorrow...

On Mon, 26 Aug 2019, 16:05 Jungtaek Lim,  wrote:

> Thanks! The patch is here: https://github.com/apache/spark/pull/25583
>
> On Mon, Aug 26, 2019 at 11:02 PM Gabor Somogyi 
> wrote:
>
>> Just checked this and it's a copy-paste :) It works properly when
>> KafkaSourceInitialOffsetWriter used. Pull me in if review needed.
>>
>> BR,
>> G
>>
>>
>> On Mon, Aug 26, 2019 at 3:57 PM Jungtaek Lim  wrote:
>>
>>> Nice finding! I don't see any reason to not use
>>> KafkaSourceInitialOffsetWriter from KafkaSource, as they're identical. I
>>> guess it was copied and pasted sometime before and not addressed yet.
>>> As you haven't submit a patch, I'll submit a patch shortly, with
>>> mentioning credit. I'd close mine and wait for your patch if you plan to do
>>> it. Please let me know.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>> On Mon, Aug 26, 2019 at 8:03 PM Jacek Laskowski  wrote:
>>>
>>>> Hi,
>>>>
>>>> Just found out that KafkaSource [1] does not
>>>> use KafkaSourceInitialOffsetWriter (of KafkaMicroBatchStream) [2] for
>>>> initial offsets.
>>>>
>>>> Any reason for that? Should I report an issue? Just checking out as I'm
>>>> with 2.4.3 exclusively and have no idea what's coming for 3.0.
>>>>
>>>> [1]
>>>> https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala#L102
>>>>
>>>> [2]
>>>> https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L281
>>>>
>>>> Pozdrawiam,
>>>> Jacek Laskowski
>>>> 
>>>> https://about.me/JacekLaskowski
>>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>>>> The Internals of Spark Structured Streaming
>>>> https://bit.ly/spark-structured-streaming
>>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>>>> Follow me at https://twitter.com/jaceklaskowski
>>>>
>>>>
>>>
>>> --
>>> Name : Jungtaek Lim
>>> Blog : http://medium.com/@heartsavior
>>> Twitter : http://twitter.com/heartsavior
>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>
>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>


Re: [SS] KafkaSource doesn't use KafkaSourceInitialOffsetWriter for initial offsets?

2019-08-26 Thread Gabor Somogyi
Just checked this and it's a copy-paste :) It works properly when
KafkaSourceInitialOffsetWriter used. Pull me in if review needed.

BR,
G


On Mon, Aug 26, 2019 at 3:57 PM Jungtaek Lim  wrote:

> Nice finding! I don't see any reason to not use
> KafkaSourceInitialOffsetWriter from KafkaSource, as they're identical. I
> guess it was copied and pasted sometime before and not addressed yet.
> As you haven't submit a patch, I'll submit a patch shortly, with
> mentioning credit. I'd close mine and wait for your patch if you plan to do
> it. Please let me know.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> On Mon, Aug 26, 2019 at 8:03 PM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> Just found out that KafkaSource [1] does not
>> use KafkaSourceInitialOffsetWriter (of KafkaMicroBatchStream) [2] for
>> initial offsets.
>>
>> Any reason for that? Should I report an issue? Just checking out as I'm
>> with 2.4.3 exclusively and have no idea what's coming for 3.0.
>>
>> [1]
>> https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala#L102
>>
>> [2]
>> https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L281
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>> The Internals of Spark Structured Streaming
>> https://bit.ly/spark-structured-streaming
>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>


Re: JDBC connector for DataSourceV2

2019-07-15 Thread Gabor Somogyi
I've had a look at the jiras and seems like the intention is the same
(correct me if I'm wrong).
I think one is enough and the rest can be closed with duplicate.
We should keep multiple jiras only when the intention is different.

BR,
G


On Mon, Jul 15, 2019 at 6:01 AM Xianyin Xin 
wrote:

> There’s another pr https://github.com/apache/spark/pull/21861 but which
> is based the old V2 APIs.
>
>
>
> We’d better link the JIRAs, SPARK-24907
> <https://issues.apache.org/jira/browse/SPARK-24907>, SPARK-25547
> <https://issues.apache.org/jira/browse/SPARK-25547>, and SPARK-28380
> <https://issues.apache.org/jira/browse/SPARK-28380> and finalize a plan.
>
>
>
> Xianyin
>
>
>
> *From: *Shiv Prashant Sood 
> *Date: *Sunday, July 14, 2019 at 2:59 AM
> *To: *Gabor Somogyi 
> *Cc: *Xianyin Xin , Ryan Blue <
> rb...@netflix.com>, , Spark Dev List <
> dev@spark.apache.org>
> *Subject: *Re: JDBC connector for DataSourceV2
>
>
>
> To me this looks like refactoring of DS1 JDBC to enable user provided
> connection factories. In itself a good change, but IMO not DSV2 related.
>
>
>
> I created a JIRA and added some goals. Please comments/add as relevant.
>
>
>
> https://issues.apache.org/jira/browse/SPARK-28380
>
>
>
> JIRA for DataSourceV2 API based JDBC connector.
>
> Goals :
>
>- Generic connector based on JDBC that supports all databases (min bar
>is support for all V1 data bases).
>- Reference implementation and Interface for any specialized JDBC
>connectors.
>
>
>
> Regards,
>
> Shiv
>
>
>
> On Sat, Jul 13, 2019 at 2:17 AM Gabor Somogyi 
> wrote:
>
> Hi Guys,
>
>
>
> Don't know what's the intention exactly here but there is such a PR:
> https://github.com/apache/spark/pull/22560
>
> If that's what we need maybe we can resurrect it. BTW, I'm also interested
> in...
>
>
>
> BR,
>
> G
>
>
>
>
>
> On Sat, Jul 13, 2019 at 4:09 AM Shiv Prashant Sood 
> wrote:
>
> Thanks all. I can also contribute toward this effort.
>
>
>
> Regards,
>
> Shiv
>
> Sent from my iPhone
>
>
> On Jul 12, 2019, at 6:51 PM, Xianyin Xin 
> wrote:
>
> If there’s nobody working on that, I’d like to contribute.
>
>
>
> Loop in @Gengliang Wang.
>
>
>
> Xianyin
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *
> *Date: *Saturday, July 13, 2019 at 6:54 AM
> *To: *Shiv Prashant Sood 
> *Cc: *Spark Dev List 
> *Subject: *Re: JDBC connector for DataSourceV2
>
>
>
> I'm not aware of a JDBC connector effort. It would be great to have
> someone build one!
>
>
>
> On Fri, Jul 12, 2019 at 3:33 PM Shiv Prashant Sood 
> wrote:
>
> Can someone please help understand the current Status of DataSource V2
> based JDBC connector? I see connectors for various file formats in Master,
> but can't find a JDBC implementation or related JIRA.
>
>
>
> DatasourceV2 APIs to me look in good shape to attempt a JDBC connector for
> READ/WRITE path.
>
> Thanks & Regards,
>
> Shiv
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>


Re: JDBC connector for DataSourceV2

2019-07-13 Thread Gabor Somogyi
Hi Guys,

Don't know what's the intention exactly here but there is such a PR:
https://github.com/apache/spark/pull/22560
If that's what we need maybe we can resurrect it. BTW, I'm also interested
in...

BR,
G


On Sat, Jul 13, 2019 at 4:09 AM Shiv Prashant Sood 
wrote:

> Thanks all. I can also contribute toward this effort.
>
> Regards,
> Shiv
>
> Sent from my iPhone
>
> On Jul 12, 2019, at 6:51 PM, Xianyin Xin 
> wrote:
>
> If there’s nobody working on that, I’d like to contribute.
>
>
>
> Loop in @Gengliang Wang.
>
>
>
> Xianyin
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *
> *Date: *Saturday, July 13, 2019 at 6:54 AM
> *To: *Shiv Prashant Sood 
> *Cc: *Spark Dev List 
> *Subject: *Re: JDBC connector for DataSourceV2
>
>
>
> I'm not aware of a JDBC connector effort. It would be great to have
> someone build one!
>
>
>
> On Fri, Jul 12, 2019 at 3:33 PM Shiv Prashant Sood 
> wrote:
>
> Can someone please help understand the current Status of DataSource V2
> based JDBC connector? I see connectors for various file formats in Master,
> but can't find a JDBC implementation or related JIRA.
>
>
>
> DatasourceV2 APIs to me look in good shape to attempt a JDBC connector for
> READ/WRITE path.
>
> Thanks & Regards,
>
> Shiv
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>


Re: DSv1 removal

2019-06-21 Thread Gabor Somogyi
Hi Ryan,

Thanks for the explanation! This shed lights on areas but also triggered
some questions.

The main conclusion to me on the Kafka connector side is to keep the v1 as
default. Let the users some time to migrate to v2 and later delete v1 when
its stable (which makes sense from my perspective).

The interesting part is that the Kafka microbatch already uses v2 as
default which I don't fully understand how to fit into this.
Please see this test:
https://github.com/apache/spark/blob/54da3bbfb2c936827897c52ed6e5f0f428b98e9f/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala#L1084
Since https://issues.apache.org/jira/browse/SPARK-23362 merged into 2.4 it
shouldn't be breaking (I assume batch part should be similar).

We can continue the discussion about Kafka batch v1/v2 default on
https://github.com/apache/spark/pull/24738 not to bomb everybody.

Please send me an invite to the sync meeting. Not sure when exactly that
happens but presume it's in the night from CET timezone perspective.
Try to organize my time to participate...

BR,
G


On Thu, Jun 20, 2019 at 8:24 PM Ryan Blue  wrote:

> Hi Gabor,
>
> First, a little context... one of the goals of DSv2 is to standardize the
> behavior of SQL operations in Spark. For example, running CTAS when a table
> exists will fail, not take some action depending on what the source
> chooses, like drop & CTAS, inserting, or failing.
>
> Unfortunately, this means that DSv1 can't be easily replaced because it
> has behavior differences between sources. In addition, we're not really
> sure how DSv1 works in all cases -- it really depends on what seemed
> reasonable to authors at the time. For example, we don't have a good
> understanding of how file-based tables behave (those not backed by a
> Metastore). There are also changes that we know are breaking and are okay
> with, like only inserting safe casts when writing with v2.
>
> Because of this, we can't just replace v1 with v2 transparently, so the
> plan is to allow deployments to migrate to v2 in stages. Here's the plan:
> 1. Use v1 by default so all existing queries work as they do today for
> identifiers like `db.table`
> 2. Allow users to add additional v2 catalogs that will be used when
> identifiers specifically start with one, like `test_catalog.db.table`
> 3. Add a v2 catalog that delegates to the session catalog, so that v2
> read/write implementations can be used, but are stored just like v1 tables
> in the session catalog
> 4. Add a setting to use a v2 catalog as the default. Setting this would
> use a v2 catalog for all identifiers without a catalog, like `db.table`
> 5. Add a way for a v2 catalog to return a table that gets converted to v1.
> This is what `CatalogTableAsV2` does in #24768
> <https://github.com/apache/spark/pull/24768>.
>
> PR #24768 <https://github.com/apache/spark/pull/24768> implements the
> rest of these changes. Specifically, we initially used the default catalog
> for v2 sources, but that causes namespace problems, so we need the v2
> session catalog (point #3) as the default when there is no default v2
> catalog.
>
> I hope that answers your question. If not, I'm happy to answer follow-ups
> and we can add this as a topic in the next v2 sync on Wednesday. I'm also
> planning on talking about metadata columns or function push-down from the
> Kafka v2 PR at that sync, so you may want to attend.
>
> rb
>
>
> On Thu, Jun 20, 2019 at 4:45 AM Gabor Somogyi 
> wrote:
>
>> Hi All,
>>
>>   I've taken a look at the code and docs to find out when DSv1 sources
>> has to be removed (in case of DSv2 replacement is implemented). After some
>> digging I've found DSv1 sources which are already removed but in some cases
>> v1 and v2 still exists in parallel.
>>
>> Can somebody please tell me what's the overall plan in this area?
>>
>> BR,
>> G
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


DSv1 removal

2019-06-20 Thread Gabor Somogyi
Hi All,

  I've taken a look at the code and docs to find out when DSv1 sources has
to be removed (in case of DSv2 replacement is implemented). After some
digging I've found DSv1 sources which are already removed but in some cases
v1 and v2 still exists in parallel.

Can somebody please tell me what's the overall plan in this area?

BR,
G


Re: Exposing JIRA issue types at GitHub PRs

2019-06-19 Thread Gabor Somogyi
Does the label merged into the commit? Somehow it would be still good to
see component in the msg.

G

On Wed, Jun 19, 2019 at 11:09 AM Gengliang Wang  wrote:

> Hi Dongjoon,
>
> +1 with the nice work.
> Quick question: if the github_jira_sync script is fully automated, should
> contributors skip adding the duplicated labels in new PR titles?
>
>
> On Jun 17, 2019, at 4:21 PM, Gabor Somogyi 
> wrote:
>
> Dongjoon, I think it's useful. Thanks for adding it!
>
> On Mon, Jun 17, 2019 at 8:05 AM Dongjoon Hyun 
> wrote:
>
>> Thank you, Hyukjin !
>>
>> On Sun, Jun 16, 2019 at 4:12 PM Hyukjin Kwon  wrote:
>>
>>> Labels look good and useful.
>>>
>>> On Sat, 15 Jun 2019, 02:36 Dongjoon Hyun, 
>>> wrote:
>>>
>>>> Now, you can see the exposed component labels (ordered by the number of
>>>> PRs) here and click the component to search.
>>>>
>>>> https://github.com/apache/spark/labels?sort=count-desc
>>>>
>>>> Dongjoon.
>>>>
>>>>
>>>> On Fri, Jun 14, 2019 at 1:15 AM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> JIRA and PR is ready for reviews.
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-28051 (Exposing JIRA
>>>>> issue component types at GitHub PRs)
>>>>> https://github.com/apache/spark/pull/24871
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Thu, Jun 13, 2019 at 10:48 AM Dongjoon Hyun <
>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>
>>>>>> Thank you for the feedbacks and requirements, Hyukjin, Reynold, Marco.
>>>>>>
>>>>>> Sure, we can do whatever we want.
>>>>>>
>>>>>> I'll wait for more feedbacks and proceed to the next steps.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 12, 2019 at 11:51 PM Marco Gaido 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Dongjoon,
>>>>>>> Thanks for the proposal! I like the idea. Maybe we can extend it to
>>>>>>> component too and to some jira labels such as correctness which may be
>>>>>>> worth to highlight in PRs too. My only concern is that in many cases 
>>>>>>> JIRAs
>>>>>>> are created not very carefully so they may be incorrect at the moment of
>>>>>>> the pr creation and it may be updated later: so keeping them in sync 
>>>>>>> may be
>>>>>>> an extra effort..
>>>>>>>
>>>>>>> On Thu, 13 Jun 2019, 08:09 Reynold Xin,  wrote:
>>>>>>>
>>>>>>>> Seems like a good idea. Can we test this with a component first?
>>>>>>>>
>>>>>>>> On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun <
>>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi, All.
>>>>>>>>>
>>>>>>>>> Since we use both Apache JIRA and GitHub actively for Apache Spark
>>>>>>>>> contributions, we have lots of JIRAs and PRs consequently. One 
>>>>>>>>> specific
>>>>>>>>> thing I've been longing to see is `Jira Issue Type` in GitHub.
>>>>>>>>>
>>>>>>>>> How about exposing JIRA issue types at GitHub PRs as GitHub
>>>>>>>>> `Labels`? There are two main benefits:
>>>>>>>>> 1. It helps the communication between the contributors and
>>>>>>>>> reviewers with more information.
>>>>>>>>> (In some cases, some people only visit GitHub to see the PR
>>>>>>>>> and commits)
>>>>>>>>> 2. `Labels` is searchable. We don't need to visit Apache Jira to
>>>>>>>>> search PRs to see a specific type.
>>>>>>>>> (For example, the reviewers can see and review 'BUG' PRs first
>>>>>>>>> by using `is:open is:pr label:BUG`.)
>>>>>>>>>
>>>>>>>>> Of course, this can be done automatically without human
>>>>>>>>> intervention. Since we already have GitHub Jenkins job to access
>>>>>>>>> JIRA/GitHub, that job can add the labels from the beginning. If 
>>>>>>>>> needed, I
>>>>>>>>> can volunteer to update the script.
>>>>>>>>>
>>>>>>>>> To show the demo, I labeled several PRs manually. You can see the
>>>>>>>>> result right now in Apache Spark PR page.
>>>>>>>>>
>>>>>>>>>   - https://github.com/apache/spark/pulls
>>>>>>>>>
>>>>>>>>> If you're surprised due to those manual activities, I want to
>>>>>>>>> apologize for that. I hope we can take advantage of the existing 
>>>>>>>>> GitHub
>>>>>>>>> features to serve Apache Spark community in a way better than 
>>>>>>>>> yesterday.
>>>>>>>>>
>>>>>>>>> How do you think about this specific suggestion?
>>>>>>>>>
>>>>>>>>> Bests,
>>>>>>>>> Dongjoon
>>>>>>>>>
>>>>>>>>> PS. I saw that `Request Review` and `Assign` features are already
>>>>>>>>> used for some purposes, but these feature are out of the scope in this
>>>>>>>>> email.
>>>>>>>>>
>>>>>>>>
>


Re: Exposing JIRA issue types at GitHub PRs

2019-06-17 Thread Gabor Somogyi
Dongjoon, I think it's useful. Thanks for adding it!

On Mon, Jun 17, 2019 at 8:05 AM Dongjoon Hyun 
wrote:

> Thank you, Hyukjin !
>
> On Sun, Jun 16, 2019 at 4:12 PM Hyukjin Kwon  wrote:
>
>> Labels look good and useful.
>>
>> On Sat, 15 Jun 2019, 02:36 Dongjoon Hyun, 
>> wrote:
>>
>>> Now, you can see the exposed component labels (ordered by the number of
>>> PRs) here and click the component to search.
>>>
>>> https://github.com/apache/spark/labels?sort=count-desc
>>>
>>> Dongjoon.
>>>
>>>
>>> On Fri, Jun 14, 2019 at 1:15 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 JIRA and PR is ready for reviews.

 https://issues.apache.org/jira/browse/SPARK-28051 (Exposing JIRA issue
 component types at GitHub PRs)
 https://github.com/apache/spark/pull/24871

 Bests,
 Dongjoon.


 On Thu, Jun 13, 2019 at 10:48 AM Dongjoon Hyun 
 wrote:

> Thank you for the feedbacks and requirements, Hyukjin, Reynold, Marco.
>
> Sure, we can do whatever we want.
>
> I'll wait for more feedbacks and proceed to the next steps.
>
> Bests,
> Dongjoon.
>
>
> On Wed, Jun 12, 2019 at 11:51 PM Marco Gaido 
> wrote:
>
>> Hi Dongjoon,
>> Thanks for the proposal! I like the idea. Maybe we can extend it to
>> component too and to some jira labels such as correctness which may be
>> worth to highlight in PRs too. My only concern is that in many cases 
>> JIRAs
>> are created not very carefully so they may be incorrect at the moment of
>> the pr creation and it may be updated later: so keeping them in sync may 
>> be
>> an extra effort..
>>
>> On Thu, 13 Jun 2019, 08:09 Reynold Xin,  wrote:
>>
>>> Seems like a good idea. Can we test this with a component first?
>>>
>>> On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> wrote:
>>>
 Hi, All.

 Since we use both Apache JIRA and GitHub actively for Apache Spark
 contributions, we have lots of JIRAs and PRs consequently. One specific
 thing I've been longing to see is `Jira Issue Type` in GitHub.

 How about exposing JIRA issue types at GitHub PRs as GitHub
 `Labels`? There are two main benefits:
 1. It helps the communication between the contributors and
 reviewers with more information.
 (In some cases, some people only visit GitHub to see the PR and
 commits)
 2. `Labels` is searchable. We don't need to visit Apache Jira to
 search PRs to see a specific type.
 (For example, the reviewers can see and review 'BUG' PRs first
 by using `is:open is:pr label:BUG`.)

 Of course, this can be done automatically without human
 intervention. Since we already have GitHub Jenkins job to access
 JIRA/GitHub, that job can add the labels from the beginning. If 
 needed, I
 can volunteer to update the script.

 To show the demo, I labeled several PRs manually. You can see the
 result right now in Apache Spark PR page.

   - https://github.com/apache/spark/pulls

 If you're surprised due to those manual activities, I want to
 apologize for that. I hope we can take advantage of the existing GitHub
 features to serve Apache Spark community in a way better than 
 yesterday.

 How do you think about this specific suggestion?

 Bests,
 Dongjoon

 PS. I saw that `Request Review` and `Assign` features are already
 used for some purposes, but these feature are out of the scope in this
 email.

>>>


Re: dynamic allocation manager in SS

2019-05-27 Thread Gabor Somogyi
K8s is a different story, please take a look at the doc "Future Work" part.

On Fri, May 24, 2019 at 9:40 PM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Btw the heuristics for batch mode (
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L289)
> vs
> streaming (
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L91-L92)
> are different. In batch mode you care about the numRunningOrPendingTasks while
> for streaming about the ratio: averageBatchProcTime.toDouble /
> batchDurationMs so there are some concerns beyond scaling down when idle.
> A scenario things might now work for batch dynamic allocation with SS is
> as follows. I start with a query that reads x kafka partitions and the data
> arriving is low and all tasks (1 per partition) are running since there are
> enough resources anyway.
> At some point the data increases per partition (maxOffsetsPerTrigger is
> high enough) and so processing takes more time. AFAIK SS will wait for a
> batch to finish before running the next (waits for the trigger to finish,
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala#L46
> ).
> In this case I suspect there is no scaling up with the batch dynamic
> allocation mode as there are no pending tasks, only processing time
> changed. In this case the streaming dynamic heuristics I think are better.
> Batch mode heuristics could work, if not mistaken, if you have multiple
> streaming queries and there are batches waiting (using fair-scheduling etc).
>
> PS. this has been discussed, not in depth, in the past on the list (
> https://mail-archives.apache.org/mod_mbox/spark-user/201708.mbox/%3c1503626484779-29104.p...@n3.nabble.com%3E
> )
>
>
>
>
> On Fri, May 24, 2019 at 9:22 PM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
>
>> I am on k8s where there is no support yet afaik, there is wip wrt the
>> shuffle service. So from your experience there are no issues with using the
>> batch dynamic allocation version like there was before with dstreams as
>> described in the related jira?
>>
>> Στις Παρ, 24 Μαΐ 2019, 8:28 μ.μ. ο χρήστης Gabor Somogyi <
>> gabor.g.somo...@gmail.com> έγραψε:
>>
>>> It scales down with yarn. Not sure how you've tested.
>>>
>>> On Fri, 24 May 2019, 19:10 Stavros Kontopoulos, <
>>> stavros.kontopou...@lightbend.com> wrote:
>>>
>>>> Yes nothing happens. In this case it could propagate info to the
>>>> resource manager to scale down the number of executors no? Just a thought.
>>>>
>>>> Στις Παρ, 24 Μαΐ 2019, 7:17 μ.μ. ο χρήστης Gabor Somogyi <
>>>> gabor.g.somo...@gmail.com> έγραψε:
>>>>
>>>>> Structured Streaming works differently. If no data arrives no tasks
>>>>> are executed (just had a case in this area).
>>>>>
>>>>> BR,
>>>>> G
>>>>>
>>>>>
>>>>> On Fri, 24 May 2019, 18:14 Stavros Kontopoulos, <
>>>>> stavros.kontopou...@lightbend.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Some while ago the streaming dynamic allocation part was added in
>>>>>> DStreams(https://issues.apache.org/jira/browse/SPARK-12133)  to
>>>>>> improve the issues with the batch based one. Should this be ported
>>>>>> to structured streaming? Thoughts?
>>>>>> AFAIK there is no support in SS for it.
>>>>>>
>>>>>> Best,
>>>>>> Stavros
>>>>>>
>>>>>>
>
> --
> Stavros Kontopoulos
> *Principal Engineer*
> *Lightbend Platform <https://www.lightbend.com/lightbend-platform>*
> *mob: **+30 6977967274 <+30+6977967274>*
>
>


Re: dynamic allocation manager in SS

2019-05-24 Thread Gabor Somogyi
It scales down with yarn. Not sure how you've tested.

On Fri, 24 May 2019, 19:10 Stavros Kontopoulos, <
stavros.kontopou...@lightbend.com> wrote:

> Yes nothing happens. In this case it could propagate info to the resource
> manager to scale down the number of executors no? Just a thought.
>
> Στις Παρ, 24 Μαΐ 2019, 7:17 μ.μ. ο χρήστης Gabor Somogyi <
> gabor.g.somo...@gmail.com> έγραψε:
>
>> Structured Streaming works differently. If no data arrives no tasks are
>> executed (just had a case in this area).
>>
>> BR,
>> G
>>
>>
>> On Fri, 24 May 2019, 18:14 Stavros Kontopoulos, <
>> stavros.kontopou...@lightbend.com> wrote:
>>
>>> Hi,
>>>
>>> Some while ago the streaming dynamic allocation part was added in
>>> DStreams(https://issues.apache.org/jira/browse/SPARK-12133)  to improve
>>> the issues with the batch based one. Should this be ported to
>>> structured streaming? Thoughts?
>>> AFAIK there is no support in SS for it.
>>>
>>> Best,
>>> Stavros
>>>
>>>


Re: dynamic allocation manager in SS

2019-05-24 Thread Gabor Somogyi
Structured Streaming works differently. If no data arrives no tasks are
executed (just had a case in this area).

BR,
G


On Fri, 24 May 2019, 18:14 Stavros Kontopoulos, <
stavros.kontopou...@lightbend.com> wrote:

> Hi,
>
> Some while ago the streaming dynamic allocation part was added in DStreams(
> https://issues.apache.org/jira/browse/SPARK-12133)  to improve the issues
> with the batch based one. Should this be ported to structured streaming?
> Thoughts?
> AFAIK there is no support in SS for it.
>
> Best,
> Stavros
>
>


Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2019-05-20 Thread Gabor Somogyi
There is PR for this but not yet merged.

On Mon, May 20, 2019 at 10:13 AM 张万新  wrote:

> Hi there,
>
> I'd like to know what's the root reason why multiple aggregations on
> streaming dataframe is not allowed since it's a very useful feature, and
> flink has supported it for a long time.
>
> Thanks.
>


Jenkins issue

2019-03-22 Thread Gabor Somogyi
Hi All,

Seems like there is a jenkins issue again. After a PR builder unit test
failure I'm not able to open the jenkins page to take a look at the issue
(it got stuck in infinite wait).

BR,
G


Re: Contribution

2019-02-12 Thread Gabor Somogyi
Hi Valeria,

Welcome, ping me if you need review.

BR,
G


On Tue, Feb 12, 2019 at 2:51 PM Valeria Vasylieva <
valeria.vasyli...@gmail.com> wrote:

> Hi Gabor,
>
> Thank you for clarification! Will do it!
> I am happy to join the community!
>
> Best Regards,
> Valeria
>
> вт, 12 февр. 2019 г. в 16:32, Gabor Somogyi :
>
>> Hi Valeria,
>>
>> Glad to hear you would like to contribute! It will be assigned to you
>> when you create a PR.
>> Before you create it please read the following guide which describe the
>> details: https://spark.apache.org/contributing.html
>>
>> BR,
>> G
>>
>>
>> On Tue, Feb 12, 2019 at 2:28 PM Valeria Vasylieva <
>> valeria.vasyli...@gmail.com> wrote:
>>
>>> Hi!
>>>
>>> My name is Valeria Vasylieva and I would like to help with the task:
>>> https://issues.apache.org/jira/browse/SPARK-20597
>>>
>>> Please assign it to me, my JIRA account is:
>>> nimfadora (
>>> https://issues.apache.org/jira/secure/ViewProfile.jspa?name=nimfadora)
>>>
>>> Thank you!
>>>
>>


Re: Contribution

2019-02-12 Thread Gabor Somogyi
Hi Valeria,

Glad to hear you would like to contribute! It will be assigned to you when
you create a PR.
Before you create it please read the following guide which describe the
details: https://spark.apache.org/contributing.html

BR,
G


On Tue, Feb 12, 2019 at 2:28 PM Valeria Vasylieva <
valeria.vasyli...@gmail.com> wrote:

> Hi!
>
> My name is Valeria Vasylieva and I would like to help with the task:
> https://issues.apache.org/jira/browse/SPARK-20597
>
> Please assign it to me, my JIRA account is:
> nimfadora (
> https://issues.apache.org/jira/secure/ViewProfile.jspa?name=nimfadora)
>
> Thank you!
>


Re: Welcome Jose Torres as a Spark committer

2019-01-30 Thread Gabor Somogyi
Congrats Jose!

BR,
G

On Wed, Jan 30, 2019 at 9:05 AM Nuthan Reddy 
wrote:

> Congrats Jose,
>
> Regards,
> Nuthan Reddy
>
>
>
> On Wed, Jan 30, 2019 at 1:22 PM Marco Gaido 
> wrote:
>
>> Congrats, Jose!
>>
>> Bests,
>> Marco
>>
>> Il giorno mer 30 gen 2019 alle ore 03:17 JackyLee  ha
>> scritto:
>>
>>> Congrats, Joe!
>>>
>>> Best,
>>> Jacky
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [build system] speeding up maven build building only changed modules compared to master branch

2019-01-28 Thread Gabor Somogyi
Do you have some numbers how much is this faster? I'm asking it because
previously I've evaluated another plugin and found the following:
- Incremental build didn't bring too much even in bigger than spark projects
- Incremental test was buggy and sometimes the required tests were not
executed which caused several issues
All in all a single tiny little bug in the incremental test could cause
horror for developers so it must be rock solid.
Is this project used somewhere in production?

On Sat, Jan 26, 2019 at 4:03 PM Sean Owen  wrote:

> Sounds interesting; would it be able to handle R and Python modules built
> by this project ? The home grown solution here does I think and that is
> helpful.
>
> On Sat, Jan 26, 2019, 6:57 AM vaclavkosar 
>> I think it would be good idea to use gitflow-incremental-builder maven
>> plugin for Spark builds. It saves resources by building only modules
>> that are impacted by changes compared to git master branch via
>> gitflow-incremental-builder maven plugin. For example if there is only a
>> change introduced into on of files of spark-avro_2.11 then only that maven
>> module and its maven dependencies and dependents would be build or tested.
>> If there are no disagreements, I can submit a pull request for that.
>>
>>
>> Project URL: https://github.com/vackosar/gitflow-incremental-builder
>>
>> Disclaimer: I originally created the project. But most of recent
>> improvements and maintenance were deved by Falko.
>>
>


DSv2 question

2019-01-24 Thread Gabor Somogyi
Hi All,

Given org.apache.spark.sql.sources.v2.DataSourceOptions which states the
following:

* An immutable string-to-string map in which keys are
case-insensitive. This is used to represent
* data source options.

Case-insensitivity can be reached many ways.The implementation provides
lowercase solution.

I've seen code parts which take advantage of this implementation detail. My
questions are:

1. As the class only states case-insensitive is the lowercase a subject to
change?
2. If it's not subject to change wouldn't it be better to change
case-insensitive to lowercase or something?

I've seen similar pattern on interfaces...

Thanks in advance!

BR,
G


Re: Reading compacted Kafka topic is slow

2019-01-24 Thread Gabor Somogyi
Hi Tomas,

Presume the 60 sec window means trigger interval. Maybe a quick win could
be to try structured streaming because there the trigger interval is
optional.
If it is not specified, the system will check for availability of new data
as soon as the previous processing has completed.

BR,
G


On Thu, Jan 24, 2019 at 12:55 PM Tomas Bartalos 
wrote:

> Hello Spark folks,
>
> I'm reading compacted Kafka topic with spark 2.4, using direct stream -
> KafkaUtils.createDirectStream(...). I have configured necessary options for
> compacted stream, so its processed with CompactedKafkaRDDIterator.
> It works well, however in case of many gaps in the topic, the processing
> is very slow and 90% of time the executors are idle.
>
> I had a look to the source are here are my findings:
> Spark first computes number of records to stream from Kafka (processing
> rate * batch window size). # of records are translated to Kafka's
> (offset_from, offset_to) and eventually the Iterator reads records within
> the offset boundaries.
> This works fine until there are many gaps in the topic, which reduces the
> real number of processed records.
> Let's say we wanted to read 100k records in 60 sec window. With gaps it
> gets to 10k (because 90k are just compacted gaps) in 60 sec.
> As a result executor is working only 6 sec and 54 sec doing nothing.
> I'd like to utilize the executor as much as possible.
>
> A great feature would be to read 100k real records (skip the gaps) no
> matter what are the offsets.
>
> I've tried to make some improvement with backpressure and my custom
> RateEstimator (decorating PidRateEstimator and boosting the rate per
> second). And was even able to fully utilize the executors, but my approach
> have a big problem when compacted part of the topic meets non compacted
> part. The executor just tries to read a too big chunk of Kafka and the
> whole processing dies.
>
> BR,
> Tomas
>


Re: Behavior of checkpointLocation from options vs setting conf spark.sql.streaming.checkpointLocation

2018-12-12 Thread Gabor Somogyi
Hi Shubham,

I've just checked the latest master branch and I can confirm it works as
you've described.
As a workaround one can read the ** in the directory
structure and can be set with .queryName("") before
restart.

BR,
G


On Tue, Dec 11, 2018 at 6:45 AM Shubham Chaurasia 
wrote:

> Hi,
>
> I would like to confirm checkpointing behavior, I have observed following
> scenarios:
>
> *1)* When I set checkpointLocation from streaming query like:
>
> val query =
> rateDF.writeStream.format("console").outputMode("append").trigger(Trigger.ProcessingTime("1
> seconds")).*option("checkpointLocation",
> "/Users/shubham/checkpoint_from_query1")*.queryName("q2").start
>
> It generates all the metadata in */Users/shubham/checkpoint_from_query1 
> *regardless
> of whether queryName is set or not.
>
> *2)* When I set it from conf like:  
> *spark.conf.set("spark.sql.streaming.checkpointLocation",
> "/Users/shubham/checkpoint_from_conf")*
>
> I observed two cases here:
> *2.1)* When I set the queryName like .queryName("q2"), it generates all
> metadata under */Users/shubham/checkpoint_from_conf/q2*
>
> *2.2)* When queryName is not set, it generates all metadata under
> */Users/shubham/checkpoint_from_conf/*
>
> I have seen query successfully recovers in scenario *1)* and *2.1) *which
> is fine.
> It does not recover from  *2.2) *which is also fine as it is unable to
> somehow get the query handle.
>
> Can there be any other possibility? Would like to confirm.
>
> Thanks,
> Shubham
>
>


Re: Implementation for exactly-once streaming sink

2018-12-06 Thread Gabor Somogyi
Hi Eric,

In order to have exactly-once one need re-playable source and idempotent
sink.
The cases what you've mentioned are covering the 2 main group of issues.
Practically any kind of programming problem can end-up in duplicated data
(even in the code which feeds kafka).
Don't know why have you asked this because if the sink see an already
processed key then it should be just skipped and doesn't matter why it is
duplicated.
Cody has a really good writing about semantics:
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md#delivery-semantics

I think if you reach Continuous Processing this it worth to consider:
"There are currently no automatic retries of failed tasks. Any failure will
lead to the query being stopped and it needs to be manually restarted from
the checkpoint."

BR,
G


On Wed, Dec 5, 2018 at 8:36 PM Eric Wohlstadter  wrote:

> Hi all,
>  We are working on implementing a streaming sink on 2.3.1 with the
> DataSourceV2 APIs.
>
> Can anyone help check if my understanding is correct, with respect to the
> failure modes which need to be covered?
>
> We are assuming that a Reliable Receiver (such as Kafka) is used as the
> stream source. And we only want to support micro-batch execution at this
> time (not yet Continuous Processing).
>
> I believe the possible failures that need to be covered are:
>
> 1. Task failure: If a task fails, it may have written data to the sink
> output before failure. Subsequent attempts for a failed task must be
> idempotent, so that no data is duplicated in the output.
> 2. Driver failure: If the driver fails, upon recovery, it might replay a
> micro-batch that was already seen by the sink (if a failure occurs after
> the sink has committed output but before the driver has updated the
> checkpoint). In this case, the sink must be idempotent when a micro-batch
> is replayed so that no data is duplicated in the output.
>
> Are there any other cases where data might be duplicated in the stream?
> i.e. if neither of these 2 failures occur, is there still a case where
> data can be duplicated?
>
> Thanks for any help to check if my understanding is correct.
>
>
>
>
>


Re: [SS] No reponse on a PR: Report numOutputRows in SinkProgress

2018-11-30 Thread Gabor Somogyi
Hi Vaclav,

As I see it's conflicting at the moment.

BR,
G


On Fri, Nov 30, 2018 at 11:49 AM Vaclav Kosar  wrote:

> Fellow Structured Streamers,
>
> I am having trouble getting any feedback on my PR for reporting number
> of written rows in Structured Streaming:
> https://github.com/apache/spark/pull/21919
>
> Please help me either close or suggest a change to the PR.
>
> Thanks,
>
> Vaclav
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: SPIP: Support Kafka delegation token in Structured Streaming

2018-10-01 Thread Gabor Somogyi
Hi Saisai,

The reasons why I've originally set the goal only structured streaming is
the following:
 * Haven't seen big interest in the DStream area for new features
 * Separate the concerns even if there is a need

All in all happy to port the feature to DStream if you think it worth and
you can help me, because no major differences are there (at least
considering this area).

BR,
G


On Sun, Sep 30, 2018 at 4:06 AM Saisai Shao  wrote:

> I like this proposal. Since Kafka already provides delegation token
> mechanism, we can also leverage Spark's delegation token framework to add
> Kafka as a built-in support.
>
> BTW I think there's no much difference in support structured streaming and
> DStream, maybe we can set both as goal.
>
> Thanks
> Saisai
>
> Gabor Somogyi  于2018年9月27日周四 下午7:58写道:
>
>> Hi all,
>>
>> I am writing this e-mail in order to discuss the delegation token support
>> for kafka feature which is reported in SPARK-25501
>> <https://issues.apache.org/jira/browse/SPARK-25501>. I've prepared a SPIP
>> <https://docs.google.com/document/d/1ouRayzaJf_N5VQtGhVq9FURXVmRpXzEEWYHob0ne3NY/edit?usp=sharing>
>>  for
>> it. PR is on the way...
>>
>> Looking forward to hear your feedback.
>>
>> BR,
>> G
>>
>>


Re: SPIP: Support Kafka delegation token in Structured Streaming

2018-10-01 Thread Gabor Somogyi
Hi Jungtaek,

Thanks for your comments, just reacted on them.

BR,
G


On Sat, Sep 29, 2018 at 2:50 PM Jungtaek Lim  wrote:

> Hi Gabor,
>
> Thanks for proposing the feature. I'm definitely interested to see this
> feature, but honestly I'm not familiar with how Spark deals with delegation
> token for HDFS and HBase. I'll try to review the doc in general, and try to
> learn it, and review again based on understanding.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2018년 9월 27일 (목) 오후 8:58, Gabor Somogyi 님이 작성:
>
>> Hi all,
>>
>> I am writing this e-mail in order to discuss the delegation token support
>> for kafka feature which is reported in SPARK-25501
>> <https://issues.apache.org/jira/browse/SPARK-25501>. I've prepared a SPIP
>> <https://docs.google.com/document/d/1ouRayzaJf_N5VQtGhVq9FURXVmRpXzEEWYHob0ne3NY/edit?usp=sharing>
>>  for
>> it. PR is on the way...
>>
>> Looking forward to hear your feedback.
>>
>> BR,
>> G
>>
>>


SPIP: Support Kafka delegation token in Structured Streaming

2018-09-27 Thread Gabor Somogyi
Hi all,

I am writing this e-mail in order to discuss the delegation token support
for kafka feature which is reported in SPARK-25501
. I've prepared a SPIP

for
it. PR is on the way...

Looking forward to hear your feedback.

BR,
G