Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-09 Thread Chetan Khatri
Hello Jayant,

Thank you so much for suggestion. My view was to  use Python function as
transformation which can take couple of column names and return object.
which you explained. would that possible to point me to similiar codebase
example.

Thanks.

On Fri, Jul 6, 2018 at 2:56 AM, Jayant Shekhar 
wrote:

> Hello Chetan,
>
> We have currently done it with .pipe(.py) as Prem suggested.
>
> That passes the RDD as CSV strings to the python script. The python script
> can either process it line by line, create the result and return it back.
> Or create things like Pandas Dataframe for processing and finally write the
> results back.
>
> In the Spark/Scala/Java code, you get an RDD of string, which we convert
> back to a Dataframe.
>
> Feel free to ping me directly in case of questions.
>
> Thanks,
> Jayant
>
>
> On Thu, Jul 5, 2018 at 3:39 AM, Chetan Khatri  > wrote:
>
>> Prem sure, Thanks for suggestion.
>>
>> On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure  wrote:
>>
>>> try .pipe(.py) on RDD
>>>
>>> Thanks,
>>> Prem
>>>
>>> On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>>
 Can someone please suggest me , thanks

 On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, 
 wrote:

> Hello Dear Spark User / Dev,
>
> I would like to pass Python user defined function to Spark Job
> developed using Scala and return value of that function would be returned
> to DF / Dataset API.
>
> Can someone please guide me, which would be best approach to do this.
> Python function would be mostly transformation function. Also would like 
> to
> pass Java Function as a String to Spark / Scala job and it applies to RDD 
> /
> Data Frame and should return RDD / Data Frame.
>
> Thank you.
>
>
>
>
>>>
>>
>


Re: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2

2018-07-09 Thread Shivaram Venkataraman
I dont think we need to respin 2.2.2 -- Given that 2.3.2 is on the way
we can just submit that.

Shivaram
On Mon, Jul 9, 2018 at 6:19 PM Tom Graves  wrote:
>
> is there anyway to push it to CRAN without this fix, I don't really want to 
> respin 2.2.2 just with the test fix.
>
> Tom
>
> On Monday, July 9, 2018, 4:50:18 PM CDT, Shivaram Venkataraman 
>  wrote:
>
>
> Yes. I think Felix checked in a fix to ignore tests run on java
> versions that are not Java 8 (I think the fix was in
> https://github.com/apache/spark/pull/21666 which is in 2.3.2)
>
> Shivaram
> On Mon, Jul 9, 2018 at 5:39 PM Sean Owen  wrote:
> >
> > Yes, this flavor of error should only come up in Java 9. Spark doesn't 
> > support that. Is there any way to tell CRAN this should not be tested?
> >
> > On Mon, Jul 9, 2018, 4:17 PM Shivaram Venkataraman 
> >  wrote:
> >>
> >> The upcoming 2.2.2 release was submitted to CRAN. I think there are
> >> some knows issues on Windows, but does anybody know what the following
> >> error with Netty is ?
> >>
> >> >WARNING: Illegal reflective access by 
> >> > io.netty.util.internal.PlatformDependent0$1 
> >> > (file:/home/hornik/.cache/spark/spark-2.2.2-bin-hadoop2.7/jars/netty-all-4.0.43.Final.jar)
> >> >  to field java.nio.Buffer.address
> >>
> >> Thanks
> >> Shivaram
> >>
> >>
> >> -- Forwarded message -
> >> From: 
> >> Date: Mon, Jul 9, 2018 at 12:12 PM
> >> Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2
> >> To: 
> >> Cc: 
> >>
> >>
> >> Dear maintainer,
> >>
> >> package SparkR_2.2.2.tar.gz does not pass the incoming checks
> >> automatically, please see the following pre-tests:
> >> Windows: 
> >> 
> >> Status: 1 ERROR, 1 WARNING
> >> Debian: 
> >> 
> >> Status: 1 ERROR, 2 WARNINGs
> >>
> >> Last released version's CRAN status: ERROR: 1, OK: 1
> >> See: 
> >>
> >> CRAN Web: 
> >>
> >> Please fix all problems and resubmit a fixed version via the webform.
> >> If you are not sure how to fix the problems shown, please ask for help
> >> on the R-package-devel mailing list:
> >> 
> >> If you are fairly certain the rejection is a false positive, please
> >> reply-all to this message and explain.
> >>
> >> More details are given in the directory:
> >> 
> >> The files will be removed after roughly 7 days.
> >>
> >> No strong reverse dependencies to be checked.
> >>
> >> Best regards,
> >> CRAN teams' auto-check service
> >> Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
> >> Check: CRAN incoming feasibility, Result: WARNING
> >>  Maintainer: 'Shivaram Venkataraman '
> >>
> >>  New submission
> >>
> >>  Package was archived on CRAN
> >>
> >>  Insufficient package version (submitted: 2.2.2, existing: 2.3.0)
> >>
> >>  Possibly mis-spelled words in DESCRIPTION:
> >>Frontend (4:10, 5:28)
> >>
> >>  CRAN repository db overrides:
> >>X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
> >>  corrected despite reminders.
> >>
> >>  Found the following (possibly) invalid URLs:
> >>URL: http://spark.apache.org/docs/latest/api/R/mean.html
> >>  From: inst/doc/sparkr-vignettes.html
> >>  Status: 404
> >>  Message: Not Found
> >>
> >> Flavor: r-devel-windows-ix86+x86_64
> >> Check: running tests for arch 'x64', Result: ERROR
> >>Running 'run-all.R' [175s]
> >>  Running the tests in 'tests/run-all.R' failed.
> >>  Complete output:
> >>> #
> >>> # Licensed to the Apache Software Foundation (ASF) under one or more
> >>> # contributor license agreements.  See the NOTICE file distributed 
> >> with
> >>> # this work for additional information regarding copyright ownership.
> >>> # The ASF licenses this file to You under the Apache License, Version 
> >> 2.0
> >>> # (the "License"); you may not use this file except in compliance with
> >>> # the License.  You may obtain a copy of the License at
> >>> #
> >>> #http://www.apache.org/licenses/LICENSE-2.0
> >>> #
> >>> # Unless required by applicable law or agreed to in writing, software
> >>> # distributed under the License is distributed on an "AS IS" BASIS,
> >>> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> >> implied.
> >>> # See the License for the specific language governing permissions and
> >>> # limitations under the License.
> >>> #
> >>>
> >>> library(testthat)
> >>> library(SparkR)
> >>
> >>Attaching package: 'SparkR'
> >>
> >>The following object is masked from 'package:testthat':
> >>
> >>describe
> >>
> 

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2

2018-07-09 Thread Shivaram Venkataraman
Yes. I think Felix checked in a fix to ignore tests run on java
versions that are not Java 8 (I think the fix was in
https://github.com/apache/spark/pull/21666 which is in 2.3.2)

Shivaram
On Mon, Jul 9, 2018 at 5:39 PM Sean Owen  wrote:
>
> Yes, this flavor of error should only come up in Java 9. Spark doesn't 
> support that. Is there any way to tell CRAN this should not be tested?
>
> On Mon, Jul 9, 2018, 4:17 PM Shivaram Venkataraman 
>  wrote:
>>
>> The upcoming 2.2.2 release was submitted to CRAN. I think there are
>> some knows issues on Windows, but does anybody know what the following
>> error with Netty is ?
>>
>> > WARNING: Illegal reflective access by 
>> > io.netty.util.internal.PlatformDependent0$1 
>> > (file:/home/hornik/.cache/spark/spark-2.2.2-bin-hadoop2.7/jars/netty-all-4.0.43.Final.jar)
>> >  to field java.nio.Buffer.address
>>
>> Thanks
>> Shivaram
>>
>>
>> -- Forwarded message -
>> From: 
>> Date: Mon, Jul 9, 2018 at 12:12 PM
>> Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2
>> To: 
>> Cc: 
>>
>>
>> Dear maintainer,
>>
>> package SparkR_2.2.2.tar.gz does not pass the incoming checks
>> automatically, please see the following pre-tests:
>> Windows: 
>> 
>> Status: 1 ERROR, 1 WARNING
>> Debian: 
>> 
>> Status: 1 ERROR, 2 WARNINGs
>>
>> Last released version's CRAN status: ERROR: 1, OK: 1
>> See: 
>>
>> CRAN Web: 
>>
>> Please fix all problems and resubmit a fixed version via the webform.
>> If you are not sure how to fix the problems shown, please ask for help
>> on the R-package-devel mailing list:
>> 
>> If you are fairly certain the rejection is a false positive, please
>> reply-all to this message and explain.
>>
>> More details are given in the directory:
>> 
>> The files will be removed after roughly 7 days.
>>
>> No strong reverse dependencies to be checked.
>>
>> Best regards,
>> CRAN teams' auto-check service
>> Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
>> Check: CRAN incoming feasibility, Result: WARNING
>>   Maintainer: 'Shivaram Venkataraman '
>>
>>   New submission
>>
>>   Package was archived on CRAN
>>
>>   Insufficient package version (submitted: 2.2.2, existing: 2.3.0)
>>
>>   Possibly mis-spelled words in DESCRIPTION:
>> Frontend (4:10, 5:28)
>>
>>   CRAN repository db overrides:
>> X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
>>   corrected despite reminders.
>>
>>   Found the following (possibly) invalid URLs:
>> URL: http://spark.apache.org/docs/latest/api/R/mean.html
>>   From: inst/doc/sparkr-vignettes.html
>>   Status: 404
>>   Message: Not Found
>>
>> Flavor: r-devel-windows-ix86+x86_64
>> Check: running tests for arch 'x64', Result: ERROR
>> Running 'run-all.R' [175s]
>>   Running the tests in 'tests/run-all.R' failed.
>>   Complete output:
>> > #
>> > # Licensed to the Apache Software Foundation (ASF) under one or more
>> > # contributor license agreements.  See the NOTICE file distributed with
>> > # this work for additional information regarding copyright ownership.
>> > # The ASF licenses this file to You under the Apache License, Version 
>> 2.0
>> > # (the "License"); you may not use this file except in compliance with
>> > # the License.  You may obtain a copy of the License at
>> > #
>> > #http://www.apache.org/licenses/LICENSE-2.0
>> > #
>> > # Unless required by applicable law or agreed to in writing, software
>> > # distributed under the License is distributed on an "AS IS" BASIS,
>> > # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>> implied.
>> > # See the License for the specific language governing permissions and
>> > # limitations under the License.
>> > #
>> >
>> > library(testthat)
>> > library(SparkR)
>>
>> Attaching package: 'SparkR'
>>
>> The following object is masked from 'package:testthat':
>>
>> describe
>>
>> The following objects are masked from 'package:stats':
>>
>> cov, filter, lag, na.omit, predict, sd, var, window
>>
>> The following objects are masked from 'package:base':
>>
>> as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
>> rank, rbind, sample, startsWith, subset, summary, transform, union
>>
>> >
>> > # Turn all warnings into errors
>> > options("warn" = 2)
>> >
>> > if (.Platform$OS.type == "windows") {
>> +   Sys.setenv(TZ = "GMT")
>> + }
>> >
>> > # Setup global test 

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2

2018-07-09 Thread Felix Cheung
I recall this might be a problem running Spark on java 9



From: Shivaram Venkataraman 
Sent: Monday, July 9, 2018 2:17 PM
To: dev; Felix Cheung; Tom Graves
Subject: Fwd: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2

The upcoming 2.2.2 release was submitted to CRAN. I think there are
some knows issues on Windows, but does anybody know what the following
error with Netty is ?

> WARNING: Illegal reflective access by 
> io.netty.util.internal.PlatformDependent0$1 
> (file:/home/hornik/.cache/spark/spark-2.2.2-bin-hadoop2.7/jars/netty-all-4.0.43.Final.jar)
>  to field java.nio.Buffer.address

Thanks
Shivaram


-- Forwarded message -
From: 
Date: Mon, Jul 9, 2018 at 12:12 PM
Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2
To: 
Cc: 


Dear maintainer,

package SparkR_2.2.2.tar.gz does not pass the incoming checks
automatically, please see the following pre-tests:
Windows: 

Status: 1 ERROR, 1 WARNING
Debian: 

Status: 1 ERROR, 2 WARNINGs

Last released version's CRAN status: ERROR: 1, OK: 1
See: 

CRAN Web: 

Please fix all problems and resubmit a fixed version via the webform.
If you are not sure how to fix the problems shown, please ask for help
on the R-package-devel mailing list:

If you are fairly certain the rejection is a false positive, please
reply-all to this message and explain.

More details are given in the directory:

The files will be removed after roughly 7 days.

No strong reverse dependencies to be checked.

Best regards,
CRAN teams' auto-check service
Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
Check: CRAN incoming feasibility, Result: WARNING
Maintainer: 'Shivaram Venkataraman '

New submission

Package was archived on CRAN

Insufficient package version (submitted: 2.2.2, existing: 2.3.0)

Possibly mis-spelled words in DESCRIPTION:
Frontend (4:10, 5:28)

CRAN repository db overrides:
X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
corrected despite reminders.

Found the following (possibly) invalid URLs:
URL: http://spark.apache.org/docs/latest/api/R/mean.html
From: inst/doc/sparkr-vignettes.html
Status: 404
Message: Not Found

Flavor: r-devel-windows-ix86+x86_64
Check: running tests for arch 'x64', Result: ERROR
Running 'run-all.R' [175s]
Running the tests in 'tests/run-all.R' failed.
Complete output:
> #
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # contributor license agreements. See the NOTICE file distributed with
> # this work for additional information regarding copyright ownership.
> # The ASF licenses this file to You under the Apache License, Version 2.0
> # (the "License"); you may not use this file except in compliance with
> # the License. You may obtain a copy of the License at
> #
> # http://www.apache.org/licenses/LICENSE-2.0
> #
> # Unless required by applicable law or agreed to in writing, software
> # distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
> #
>
> library(testthat)
> library(SparkR)

Attaching package: 'SparkR'

The following object is masked from 'package:testthat':

describe

The following objects are masked from 'package:stats':

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

>
> # Turn all warnings into errors
> options("warn" = 2)
>
> if (.Platform$OS.type == "windows") {
+ Sys.setenv(TZ = "GMT")
+ }
>
> # Setup global test environment
> # Install Spark first to set SPARK_HOME
>
> # NOTE(shivaram): We set overwrite to handle any old tar.gz
files or directories left behind on
> # CRAN machines. For Jenkins we should already have SPARK_HOME set.
> install.spark(overwrite = TRUE)
Overwrite = TRUE: download and overwrite the tar fileand Spark
package directory if they exist.
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: http://mirror.dkd.de/apache/spark
Downloading spark-2.2.2 for Hadoop 2.7 from:
- http://mirror.dkd.de/apache/spark/spark-2.2.2/spark-2.2.2-bin-hadoop2.7.tgz
trying URL 
'http://mirror.dkd.de/apache/spark/spark-2.2.2/spark-2.2.2-bin-hadoop2.7.tgz'
Content type 

Fwd: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2

2018-07-09 Thread Shivaram Venkataraman
The upcoming 2.2.2 release was submitted to CRAN. I think there are
some knows issues on Windows, but does anybody know what the following
error with Netty is ?

> WARNING: Illegal reflective access by 
> io.netty.util.internal.PlatformDependent0$1 
> (file:/home/hornik/.cache/spark/spark-2.2.2-bin-hadoop2.7/jars/netty-all-4.0.43.Final.jar)
>  to field java.nio.Buffer.address

Thanks
Shivaram


-- Forwarded message -
From: 
Date: Mon, Jul 9, 2018 at 12:12 PM
Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.2.2
To: 
Cc: 


Dear maintainer,

package SparkR_2.2.2.tar.gz does not pass the incoming checks
automatically, please see the following pre-tests:
Windows: 

Status: 1 ERROR, 1 WARNING
Debian: 

Status: 1 ERROR, 2 WARNINGs

Last released version's CRAN status: ERROR: 1, OK: 1
See: 

CRAN Web: 

Please fix all problems and resubmit a fixed version via the webform.
If you are not sure how to fix the problems shown, please ask for help
on the R-package-devel mailing list:

If you are fairly certain the rejection is a false positive, please
reply-all to this message and explain.

More details are given in the directory:

The files will be removed after roughly 7 days.

No strong reverse dependencies to be checked.

Best regards,
CRAN teams' auto-check service
Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
Check: CRAN incoming feasibility, Result: WARNING
  Maintainer: 'Shivaram Venkataraman '

  New submission

  Package was archived on CRAN

  Insufficient package version (submitted: 2.2.2, existing: 2.3.0)

  Possibly mis-spelled words in DESCRIPTION:
Frontend (4:10, 5:28)

  CRAN repository db overrides:
X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
  corrected despite reminders.

  Found the following (possibly) invalid URLs:
URL: http://spark.apache.org/docs/latest/api/R/mean.html
  From: inst/doc/sparkr-vignettes.html
  Status: 404
  Message: Not Found

Flavor: r-devel-windows-ix86+x86_64
Check: running tests for arch 'x64', Result: ERROR
Running 'run-all.R' [175s]
  Running the tests in 'tests/run-all.R' failed.
  Complete output:
> #
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # contributor license agreements.  See the NOTICE file distributed with
> # this work for additional information regarding copyright ownership.
> # The ASF licenses this file to You under the Apache License, Version 2.0
> # (the "License"); you may not use this file except in compliance with
> # the License.  You may obtain a copy of the License at
> #
> #http://www.apache.org/licenses/LICENSE-2.0
> #
> # Unless required by applicable law or agreed to in writing, software
> # distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
> #
>
> library(testthat)
> library(SparkR)

Attaching package: 'SparkR'

The following object is masked from 'package:testthat':

describe

The following objects are masked from 'package:stats':

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

>
> # Turn all warnings into errors
> options("warn" = 2)
>
> if (.Platform$OS.type == "windows") {
+   Sys.setenv(TZ = "GMT")
+ }
>
> # Setup global test environment
> # Install Spark first to set SPARK_HOME
>
> # NOTE(shivaram): We set overwrite to handle any old tar.gz
files or directories left behind on
> # CRAN machines. For Jenkins we should already have SPARK_HOME set.
> install.spark(overwrite = TRUE)
Overwrite = TRUE: download and overwrite the tar fileand Spark
package directory if they exist.
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: http://mirror.dkd.de/apache/spark
Downloading spark-2.2.2 for Hadoop 2.7 from:
- 
http://mirror.dkd.de/apache/spark/spark-2.2.2/spark-2.2.2-bin-hadoop2.7.tgz
trying URL 

Re: [build system] taking ubuntu workers offline for docker update

2018-07-09 Thread shane knapp
this is done.

On Mon, Jul 9, 2018 at 6:48 PM, shane knapp  wrote:

> we need to update docker to something more modern (17.05.0-ce ->
> 18.03.1-ce), so i have taken the two ubuntu workers offline and once the
> current builds finish, i will perform the update.
>
> this shouldn't take more than an hour.
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] taking ubuntu workers offline for docker update

2018-07-09 Thread shane knapp
we need to update docker to something more modern (17.05.0-ce ->
18.03.1-ce), so i have taken the two ubuntu workers offline and once the
current builds finish, i will perform the update.

this shouldn't take more than an hour.

shane
--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Register catalyst expression as SQL DSL

2018-07-09 Thread geoHeil
Hi,

I would like to register custom catalyst expressions as SQL DSL
https://stackoverflow.com/questions/51199761/spark-register-expression-for-sql-dsl
can someone shed some light here? The documentation does not seem to contain
a lot of information regarding catalyst internals.

Thanks a lot.
Georg



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Unsubscribe

2018-07-09 Thread Prateek Goel



Register now for ApacheCon and save $250

2018-07-09 Thread Rich Bowen

Greetings, Apache software enthusiasts!

(You’re getting this because you’re on one or more dev@ or users@ lists 
for some Apache Software Foundation project.)


ApacheCon North America, in Montreal, is now just 80 days away, and 
early bird prices end in just two weeks - on July 21. Prices will be 
going up from $550 to $800 so register NOW to save $250, at 
http://apachecon.com/acna18


And don’t forget to reserve your hotel room. We have negotiated a 
special rate and the room block closes August 24. 
http://www.apachecon.com/acna18/venue.html


Our schedule includes over 100 talks and we’ll be featuring talks from 
dozens of ASF projects.,  We have inspiring keynotes from some of the 
brilliant members of our community and the wider tech space, including:


 * Myrle Krantz, PMC chair for Apache Fineract, and leader in the open 
source financing space
 * Cliff Schmidt, founder of Literacy Bridge (now Amplio) and creator 
of the Talking Book project

 * Bridget Kromhout, principal cloud developer advocate at Microsoft
 * Euan McLeod, Comcast engineer, and pioneer in streaming video

We’ll also be featuring tracks for Geospatial science, Tomcat, 
Cloudstack, and Big Data, as well as numerous other fields where Apache 
software is leading the way. See the full schedule at 
http://apachecon.com/acna18/schedule.html


As usual we’ll be running our Apache BarCamp, the traditional ApacheCon 
Hackathon, and the Wednesday evening Lighting Talks, too, so you’ll want 
to be there.


Register today at http://apachecon.com/acna18 and we’ll see you in Montreal!

--
Rich Bowen
VP, Conferences, The Apache Software Foundation
h...@apachecon.com
@ApacheCon

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Asking for reviewing PRs regarding structured streaming

2018-07-09 Thread Jungtaek Lim
Now I'm adding one more issue (SPARK-24763 [1]), which proposes a new
option to enable optimization of state size in streaming aggregation
without hurting performance.

The idea is to remove data for key fields from value which is duplicated
between key and value in state row. This requires additional operations
like projection and join, but smaller state row would also give performance
benefit, which can offset each other.

Please refer the comment in JIRA issue [2] to see the numbers from simple
perf. test.

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-24763


2018년 7월 6일 (금) 오후 1:54, Jungtaek Lim 님이 작성:

> Ted Yu suggested posting the improved numbers to this thread and I think
> it's good idea, so also posting here, but I also think explaining
> rationalization of my issues would help understanding why I'm submitting
> couple of patches, so I'll explain it first. (Sorry to post a wall of text).
>
> tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS state
> store provider from 10x~80x of size of state for a batch according to
> various stateful workloads to less than or around 2x. The new option is
> flexible so it can be even around 1x or even effectively disable cache.
> Please refer the comment in the PR [2] to get more details. (hard to post
> detailed numbers in mail format so link a Github comment instead)
>
> I have interest on stateful streaming processing on Structured Streaming,
> and have been learning from codebase as well as analyzing memory usage as
> well as latency (while I admit it is hard to measure latency correctly...).
>
>
> https://community.hortonworks.com/content/kbentry/199257/memory-usage-of-state-in-structured-streaming.html
>
> While took a look at HDFSBackedStateStoreProvider I indicated a kind of
> excessive caching. As I described in section "The impact of storing
> multiple versions from HDFSBackedStateStoreProvider" in above link, while
> multiple versions share the same UnsafeRow unless there's a change on the
> value which lessen the impact of caching multiple versions (credit to Jose
> Torres since I realized it from his comment). But in some workloads which
> lots of writes to state incurs in a batch, the overall memory usage of
> state is going to be out of expectation.
>
> Related patch [3] is also submitted from other contributor (so I'm not the
> one to notice this behavior), whereas the patch might not look enough
> generalized to apply.
>
> First I decided to track the overall memory size of state provider cache
> and expose to UI as well as query status (SPARK-24441 [4]). The metric
> looked like critical and worth to monitor, so I thought it is better to
> expose it (and watermark) to Dropwizard (SPARK-24637 [5]).
>
> Based on adoption of SPARK-24441, I could find more flexible way to
> resolve the issue (SPARK-24717) what I've mentioned in tl;dr.
>
> So 3 of 5 issues are coupled so far to track and resolve one issue. Hope
> that it helps explaining worth of reviews for these patches.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-24717
> 2. https://github.com/apache/spark/pull/21700#issuecomment-402902576
> 3. https://github.com/apache/spark/pull/21500
> 4. https://issues.apache.org/jira/browse/SPARK-24441
> 5. https://issues.apache.org/jira/browse/SPARK-24637
>
> ps. Before all mentioned issues I also submitted some other issues
> regarding feature addition/refactor (2 of 5 issues).
>
>
> 2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim 님이 작성:
>
>> Bump. I have been having hard time working on making additional PRs since
>> some of these rely on non-merged PRs, so spending additional time to
>> decouple these things if possible.
>>
>> https://github.com/apache/spark/pulls/HeartSaVioR
>>
>> Pending 5 PRs so far, and may add more sooner or later.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim 님이 작성:
>>
>>> Kindly reminder since around 2 weeks passed. I've added more PR during 2
>>> weeks and even planning to do more.
>>>
>>> 2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim 님이 작성:
>>>
 Hi Spark devs,

 I have couple of pull requests for structured streaming which are
 getting older and fading out from earlier pages in PR pages.

 https://github.com/apache/spark/pull/21469
 https://github.com/apache/spark/pull/21357
 https://github.com/apache/spark/pull/21222

 Two of them are in a kind of approval by couple of folks, but no
 approval from committers yet.
 One of them needs rebase and I would be happy to do it after reviewing
 or in progress of reviewing.

 Getting reviewed in time would be critical for contributors to be
 honest, so I'd like to ask dev mailing list to review my PRs.

 Thanks in advance,
 Jungtaek Lim (HeartSaVioR)

>>>