[Streaming][Kinesis][SPARK-20168] Could I get some reviews of the patch that resolves kinesis timestamp resume

2018-07-05 Thread Yash Sharma
Hi Team, Could I get some review at the patch here. Would love to hear suggestions here on the patch. I had to reopen SPARK-20168 because of this bug. https://github.com/apache/spark/pull/21541 https://issues.apache.org/jira/browse/SPARK-20168 Cheers, Yash

Structured Streaming with S3 file source duplicates data because of eventual consistency

2018-01-11 Thread Yash Sharma
Hi Team, I have been using Structured Streaming with the S3 data source but I am seeing it duplicate the data intermittently. New run seem to fix it, but the duplication happens ~10% of time. The ratio increases with more number of files in the source. Investigating more, I see this is clearly an

[kinesis][streaming] Could I request a review on this PR

2017-12-11 Thread Yash Sharma
Hi All, Could I request a review on this patch on Spark-Kinesis streaming. It has been sitting there for few months looking for some love. Please help. The patch proposes resuming Kinesis data from a specified timestamp, similar to Kafka, and improves kinesis crash recovery avoiding scanning ok

[spark-kinesis] [SPARK-20168] Requesting some attention for a review

2017-11-14 Thread Yash Sharma
Hi Team, Could I please pull some attention towards the pull request on Spark-Kinesis operability. We have iterated over the patch for past few months, and it would be great to have some final review of the patch. I think its very close now. I would love to work on improvements if any. This patch

[Streaming] Requesting more Committers for Spark-Kinesis integration

2017-09-28 Thread Yash Sharma
Hi Fellow Spark developers/ PMC Members, I am a new member of the community and have started my tiny contributions to Spark-Kinesis Integration. I am trying to fill in the gaps in making spark operate with Kinesis as nicely as Kafka. I am writing this mail to highlight an issue with the kinesis

[Spark][Kinesis] Could I get some committer review on the pull request

2017-09-05 Thread Yash Sharma
Hi All, I've been working on a pull request [1] to allow Spark read from a specific timestamp from Kinesis. I have iterated the patch with the help of other contributors and we think that its in a good state now. This patch would save hours of crash recovery time for Spark while reading off

Spark reading parquet files behaved differently with number of paths

2017-04-27 Thread Yash Sharma
Hi Fellow Devs, I have noticed the spark parquet reader behaves very differently in the two scenarios over the same data set while: 1. passing a single path to parent path to data, vs 2. passing all the files individually to parquet(paths: String*) The paths has about ~50K files. The first option

[DStream][Kinesis] Requesting review for spark-kinesis retries

2017-04-18 Thread Yash Sharma
Hi Fellow Devs, Please share your thoughts on the pull request that allows spark to have more graceful retries with kinesis streaming. The patch removes simple hard codings in the code and allows user to pass the values in config. This will help users to cope up with kinesis throttling errors and

Re: [Discuss][Spark staging dir] way to disable spark writing to _temporary

2017-04-08 Thread Yash Sharma
; > You're probably interested in the S3PartitionedOutputCommitter. > > rb > > On Thu, Apr 6, 2017 at 10:08 PM, Yash Sharma <yash...@gmail.com> wrote: > > Hi All, > This is another issue that I was facing with the spark - s3 operability > and wanted to ask to the bro

[Discuss][Spark staging dir] way to disable spark writing to _temporary

2017-04-06 Thread Yash Sharma
Hi All, This is another issue that I was facing with the spark - s3 operability and wanted to ask to the broader community if its faced by anyone else. I have a rather simple aggregation query with a basic transformation. The output however has lot of output partitions (20K partitions). The spark

[Streaming][Kinesis] Please review the kinesis-spark hard codings pull request

2017-04-06 Thread Yash Sharma
Hi fellow Spark Devs, If anyone here has some experience in spark kinesis streaming, would it be possible to provide your thoughts on this pull request [1]. Some info: The patch removes two important hard coded values for kinesis retries and will make kinesis recovery from crashes more reliable.

Spark - Kinesis integration needs improvements

2017-03-30 Thread Yash Sharma
Hello fellow spark devs, hope you are doing fabulous, Dropping a brain dump here about the Spark kinesis integration. I am able to get spark kinesis to work perfectly under ideal conditions, but see a lot of open ends when things are not so ideal. I feel there are lot of open ends and are

Re: subscribe to spark dev list

2017-03-21 Thread Yash Sharma
Sorry for the spam, used the wrong email address. On Wed, 22 Mar 2017 at 12:01 Yash Sharma <yash...@gmail.com> wrote: > subscribe to spark dev list >

subscribe to spark dev list

2017-03-21 Thread Yash Sharma
subscribe to spark dev list

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-24 Thread Yash Sharma
too many small files you are trying to read? Number of > executors are very high > On 24 Sep 2016 10:28, "Yash Sharma" <yash...@gmail.com> wrote: > >> Have been playing around with configs to crack this. Adding them here >> where it would be helpful to others :)

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma
n them reasonable > memory. This can be around 48 assuming 12 nodes x 4 cores each. You could > start with processing a subset of your data and see if you are able to get > a decent performance. Then gradually increase the maximum # of execs for > dynamic allocation and process the remaining

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma
:27 AM, Yash Sharma <yash...@gmail.com> wrote: > Have been playing around with configs to crack this. Adding them here > where it would be helpful to others :) > Number of executors and timeout seemed like the core issue. > > {code} > --driver-memory 4G \ > --conf spar

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma
with fix number of executors and try. May > be 12 executors for testing and let know the status. > > Get Outlook for Android <https://aka.ms/ghei36> > > > > On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <yash...@gmail.com> > wrote: > > Than

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma
r of executors you are allocating. The logs > shows it as 168510 which is on very high side. Try reducing your executors. > > > On Friday 23 September 2016 12:34 PM, Yash Sharma wrote: > >> Hi All, >> I have a spark job which runs over a huge bulk of data with Dynamic >> a

Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma
Hi All, I have a spark job which runs over a huge bulk of data with Dynamic allocation enabled. The job takes some 15 minutes to start up and fails as soon as it starts*. Is there anything I can check to debug this problem. There is not a lot of information in logs for the exact cause but here is

Spark deletes all existing partitions in SaveMode.Overwrite - Expected behavior ?

2016-07-06 Thread Yash Sharma
Hi All, While writing a partitioned data frame as partitioned text files I see that Spark deletes all available partitions while writing few new partitions. dataDF.write.partitionBy(“year”, “month”, > “date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”) Is this an expected behavior

Re: Quick question on spark performance

2016-05-20 Thread Yash Sharma
of around 400 Megs gz files. The workload is a scan/filter/reduceBy which needs to scan the entire data. On Sat, May 21, 2016 at 11:07 AM, Yash Sharma <yash...@gmail.com> wrote: > The median GC time is 1.3 mins for a median duration of 41 mins. What > parameters can I tune for co

Re: Quick question on spark performance

2016-05-20 Thread Yash Sharma
ynold Xin" <r...@databricks.com> wrote: > It's probably due to GC. > > On Fri, May 20, 2016 at 5:54 PM, Yash Sharma <yash...@gmail.com> wrote: > >> Hi All, >> I am here to get some expert advice on a use case I am working on. >> >> Cluster &am

Quick question on spark performance

2016-05-20 Thread Yash Sharma
Hi All, I am here to get some expert advice on a use case I am working on. Cluster & job details below - Data - 6 Tb Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps) Parameters- --executor-memory 10G \ --executor-cores 6 \ --conf spark.dynamicAllocation.enabled=true \ --conf

Re: Spark Sql on large number of files (~500Megs each) fails after couple of hours

2016-04-10 Thread Yash Sharma
executor(yarn container) log? Most of time, it > shows more details, we are using CDH, the log is at: > > > > [yucai@sr483 container_1457699919227_0094_01_14]$ pwd > > > /mnt/DP_disk1/yucai/yarn/logs/application_1457699919227_0094/container_1457699919227_0094_01_14 &

Spark Sql on large number of files (~500Megs each) fails after couple of hours

2016-04-10 Thread Yash Sharma
Hi All, I am trying Spark Sql on a dataset ~16Tb with large number of files (~50K). Each file is roughly 400-500 Megs. I am issuing a fairly simple hive query on the dataset with just filters (No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs and processes about 80-100

Re: Spark not able to fetch events from Amazon Kinesis

2016-02-22 Thread Yash Sharma
{ // Doesn't Work !! > rdd => > println(rdd.count) > println("rdd isempty:" + rdd.isEmpty) > }*/ unionStreams.foreachRDD ((rdd: RDD[Array[Byte]], time: Time) => { // Works, > Yeah !! > println(rdd.count) > println("rdd isempty:" + rdd.isEmpty) > } &g

Spark not able to fetch events from Amazon Kinesis

2016-01-30 Thread Yash Sharma
Hi All, I have a quick question if anyone has experienced this here. I have been trying to get Spark read events from Kinesis recently but am having problem in receiving the events. While Spark is able to connect to Kinesis and is able to get metadata from Kinesis, Its not able to get events from

Re: Spark not able to fetch events from Amazon Kinesis

2016-01-30 Thread Yash Sharma
hih...@gmail.com> wrote: > w.r.t. protobuf-java version mismatch, I wonder if you can rebuild Spark > with the following change (using maven): > > http://pastebin.com/fVQAYWHM > > Cheers > > On Sat, Jan 30, 2016 at 12:49 AM, Yash Sharma <yash...@gmail.com> wrote: >

Re: Spark not able to fetch events from Amazon Kinesis

2016-01-30 Thread Yash Sharma
due to version incompatibilities, either > due to protobuf or jackson. That may be your culprit. The problem is that > all failures by the Kinesis Client Lib is silent, therefore don't show up > on the logs. It's very hard to debug those buggers. > > Best, > Burak > > On Sat, Jan