Re: [VOTE] SPARK 2.4.0 (RC5)

2018-10-31 Thread Chitral Verma
+1

On Wed, 31 Oct 2018 at 11:56, Reynold Xin  wrote:

> +1
>
> Look forward to the release!
>
>
>
> On Mon, Oct 29, 2018 at 3:22 AM Wenchen Fan  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.4.0.
>>
>> The vote is open until November 1 PST and passes if a majority +1 PMC
>> votes are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.4.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.4.0-rc5 (commit
>> 0a4c03f7d084f1d2aa48673b99f3b9496893ce8d):
>> https://github.com/apache/spark/tree/v2.4.0-rc5
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc5-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1291
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc5-docs/
>>
>> The list of bug fixes going into 2.4.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.4.0?
>> ===
>>
>> The current list of open tickets targeted at 2.4.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>


[Discussion] Clarification regarding Stateful Aggregations over Structured Streaming

2018-12-16 Thread Chitral Verma
Hi Devs,

For quite some time i've been looking at the structured streaming API to
solve lots of use cases at my workplace, I've have some doubts I wanted to
clarify regarding stateful aggregations over structured streaming.

Currently, spark provides flatMapGroupWithState (FMGWS) / mapGroupWithState
(MGWS) APIs to allow custom streaming aggregations by setting/ updating
intermediate `GroupedState` which may or may not expire. This
GroupedState is stored in form of snapshots and the latest snapshot is
entirely in memory, what might be memory consuming approach and may result
in OOMs.

Other than this, in my opinion, FMGWS is not very flexible in terms of
usage (aggregation logic and needs to be written on Rows and spark sql
inbuilt functions can be used) and the timeouts require query to progress
in order expire keys.

To remedy this i have contributed to this project
 which basically moves the
expiration logic to state store (rocks db) and the state store is no longer
managed by the executor jvm allowing true expiration of state with nano sec
precision.

My question is, is there a specific reason FMGWS API is designed the way it
is and are there any down sides to the approach I have mentioned above.

Do let me know you thoughts.

Thanks


Re: [Discussion] Clarification regarding Stateful Aggregations over Structured Streaming

2018-12-16 Thread Chitral Verma
Thanks Stavros for the clarification, I'll create some documentation for
the same and raise this as an enhancement issue with pull request.

Meanwhile if users want to use this functionality, they can always add
spark-states <https://github.com/chermenin/spark-states> as a dependency
and use it.

On Mon, 17 Dec 2018 at 03:10, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Hi,
>
> Databricks runtime as you already know has this enhancement and so it is
> considered a good option if you want to decouple state from the jvm.
> Some arguments why to do so are given by the Flink paper along with
> incremental snapshotting:
> http://www.vldb.org/pvldb/vol10/p1718-carbone.pdf. Also timers
> implemented in RockDb can give you higher scalability with very large
> states (and many timers). I am not aware of the history behind the FMGWS
> API (others could provide more info), but I was also looking at the API
> recently thinking of an API for this:
> https://issues.apache.org/jira/browse/SPARK-16738
>
> Best,
> Stavros
>
> On Sun, Dec 16, 2018 at 7:58 PM Chitral Verma 
> wrote:
>
>> Hi Devs,
>>
>> For quite some time i've been looking at the structured streaming API to
>> solve lots of use cases at my workplace, I've have some doubts I wanted to
>> clarify regarding stateful aggregations over structured streaming.
>>
>> Currently, spark provides flatMapGroupWithState (FMGWS) /
>> mapGroupWithState (MGWS) APIs to allow custom streaming aggregations by
>> setting/ updating intermediate `GroupedState` which may or may not expire.
>> This GroupedState is stored in form of snapshots and the latest snapshot is
>> entirely in memory, what might be memory consuming approach and may result
>> in OOMs.
>>
>> Other than this, in my opinion, FMGWS is not very flexible in terms of
>> usage (aggregation logic and needs to be written on Rows and spark sql
>> inbuilt functions can be used) and the timeouts require query to progress
>> in order expire keys.
>>
>> To remedy this i have contributed to this project
>> <https://github.com/chermenin/spark-states> which basically moves the
>> expiration logic to state store (rocks db) and the state store is no longer
>> managed by the executor jvm allowing true expiration of state with nano sec
>> precision.
>>
>> My question is, is there a specific reason FMGWS API is designed the way
>> it is and are there any down sides to the approach I have mentioned above.
>>
>> Do let me know you thoughts.
>>
>> Thanks
>>
>
>
>
>


Query regarding stateless aggregations

2019-11-28 Thread Chitral Verma
Hi Devs,
I have a query regarding stateless aggregations.

I understand that its possible to do stateless aggregation using mapGroups
 and flatMapGroups API in Spark 2.x+. I want to use aggregate queries on a
streaming registered temporary view. Is there any way to do the same
using spark.sql("
... ") ?

Also posted here,
https://stackoverflow.com/questions/59050663/is-it-possible-to-do-stateless-aggregations-using-spark-sql

Any helps will be appreciated.

Regards,
Chitral Verma


[New Project] sparksql-ml : Distributed Machine Learning using SparkSQL.

2023-02-27 Thread Chitral Verma
Hi All,
I worked on this idea a few years back as a pet project to bridge *SparkSQL*
and *SparkML* and empower anyone to implement production grade, distributed
machine learning over Apache Spark as long as they have SQL skills.

In principle the idea works exactly like Google's BigQueryML but at a much
wider scope with no vendor lock-in on basically every source that's
supported by Spark in cloud or on-prem.

*Training* a ML model can look like,

FIT 'LogisticRegression' ESTIMATOR WITH PARAMS(maxIter = 3) TO (
SELECT * FROM mlDataset) AND OVERWRITE AT LOCATION '/path/to/lr-model';

*Prediction* a ML model can look like,

PREDICT FOR (SELECT * FROM mlTestDataset) USING MODEL STORED AT
LOCATION '/path/to/lr-model'

*Feature Preprocessing* can look like,

TRANSFORM (SELECT * FROM dataset) using 'StopWordsRemover' TRANSFORMER WITH
PARAMS (inputCol='raw', outputCol='filtered') AND WRITE AT LOCATION
'/path/to/test-transformer'


But a lot more can be done with this library.

I was wondering if any of you find this interesting and would like to
contribute to the project here,

https://github.com/chitralverma/sparksql-ml


Regards,
Chitral Verma


Re: Slack for Spark Community: Merging various threads

2023-04-10 Thread Chitral Verma
Hi all,
Thanks for starting a discussion on this super-important topic.

I'm not sure if this is already considered, but Discord is also a viable
option and many many open-source projects and communities are using it.

   - It's *mostly* free with no online user limitations like slack.
   - Has a big feature overlap with Slack.
   - Sign up process is simple and straightforward. Clients for web/
   desktop/ mobile are available.
   - Policies and controlling can be possibly done via bots + designated
   channel managers.
   - allows channels like Slack for the organisation of messages.


On Mon, 10 Apr 2023 at 08:09, Dongjoon Hyun  wrote:

> Thank you, Holden, Bjorn, Maciej.
>
> Yes, those are also valid.
>
> Dongjoon.
>
> On Sat, Apr 8, 2023 at 4:20 AM Maciej  wrote:
>
>> @Bjørn Matrix (represented by element in the linked summary. Also, since
>> the last year, Rocket Chat uses Matrix under the hood) is already used for
>> user support and related discussions for a number of large projects, since
>> gitter migrated there. And it is not like we need Slack or its replacement
>> in the first place. Some of the Slack features are useful for us, but its
>> not exactly the best tool for user support IMHO.
>>
>> @Dongjoon There are probably two more things we should discuss:
>>
>>- What are data privacy obligations while keeping a communication
>>channel, advertised as official, outside the ASF?  Does it put it out of
>>scope for the ASF legal and data privacy teams?
>>
>>If I recall correctly, Slack requires at least some of the
>>formalities to be handled by the primary owner and as far as I am aware 
>> the
>>project is not a legal person. Not sure how linen.dev or another
>>retention tool fits into all of that, but it's unrealistic to expect it
>>makes things easier.
>>
>>This might sound hypothetical, but we've already seen users leaking
>>sensitive information on the mail list and requesting erasure (which,
>>luckily for us, is not technically possible).
>>
>>- How are we going to handle moderation, if we assume number of users
>>comparable to Delta Lake Slack and open registration? At minimum we have 
>> to
>>ensure that the ASF Code of Conduct is respected. An official channel or
>>not, failure to do that reflects badly on the project, the ASF and all of
>>us.
>>
>> --
>> Maciej
>>
>>
>>
>> On 4/7/23 21:02, Bjørn Jørgensen wrote:
>>
>> Yes, I have done some search for slack alternatives
>> 
>> I feel that we should do some search, to find if there can be a
>> better solution than slack.
>> For what I have found, there are two that can be an alternative for
>> slack.
>>
>> Rocket.Chat  
>>
>> and
>>
>> Zulip Chat 
>> Zulip Cloud Standard is free for open-source projects
>> 
>> Witch means we get
>>
>>- Unlimited search history
>>- File storage up to 10 GB per user
>>- Message retention policies
>>
>>- Brand Zulip with your logo
>>- Priority commercial support
>>- Funds the Zulip open source project
>>
>>
>> Rust is using zulip  
>>
>> We can import chats from slack
>> 
>> We can use zulip for events   With multi-use
>> invite links , there’s no need
>> to create individual Zulip invitations.  This means that PMC doesn't have
>> to send a link to every user.
>> CODE BLOCKS
>>
>> Discuss code with ease using Markdown code blocks, syntax highlighting,
>> and code playgrounds
>> .
>>
>>
>>
>>
>>
>>
>> fre. 7. apr. 2023 kl. 18:54 skrev Holden Karau :
>>
>>> I think there was some concern around how to make any sync channel show
>>> up in logs / index / search results?
>>>
>>> On Fri, Apr 7, 2023 at 9:41 AM Dongjoon Hyun 
>>> wrote:
>>>
 Thank you, All.

 I'm very satisfied with the focused and right questions for the real
 issues by removing irrelevant claims. :)

 Let me collect your relevant comments simply.


 # Category 1: Invitation Hurdle

 > The key question here is that do PMC members have the bandwidth of
 inviting everyone in user@ and dev@?

 > Extending this to inviting everyone on @user (over >4k  subscribers
 according to the previous thread) might be a stretch,

 > we should have an official project Slack with an easy invitation
 process.


 # Category 2: Controllability

 > Additionally. there is no indication that the-asf.slack.com is
 intended for general support.

 > I would also lean towards a standalone workspace, where we have more
 control over organizing the channels,
>

Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-11 Thread Chitral Verma
try explain codegen on your DF and then pardee the string

On Fri, 7 Apr, 2023, 3:53 pm Chenghao Lyu,  wrote:

> Hi,
>
> The detailed stage page shows the involved WholeStageCodegen Ids in its
> DAG visualization from the Spark UI when running a SparkSQL. (e.g., under
> the link
> node:18088/history/application_1663600377480_62091/stages/stage/?id=1&attempt=0).
>
> However, I have trouble extracting the WholeStageCodegen ids from the DAG
> visualization via the RESTAPIs. Is there any other way to get the
> WholeStageCodegen Ids information for each stage automatically?
>
> Cheers,
> Chenghao
>