Re: Missing / Duplicate Data when Spark retries

2020-09-10 Thread Ruijing Li
ode or data, but hard to > > say without knowing more. The lineage is fine and deterministic, but > > your data or operations might not be. > > > > On Thu, Sep 10, 2020 at 12:03 AM Ruijing Li wrote: > > > > > > Hi all, > > > > > >

Missing / Duplicate Data when Spark retries

2020-09-09 Thread Ruijing Li
happen, I don't have indeterministic data though. Anyone have encountered something similar or an inkling? Thanks! -- Cheers, Ruijing Li

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-05-06 Thread Ruijing Li
is some integer value > representing the task ID that was launched on that executor. In case you're > running > this is local mode that thread would be located in the same Java thread > dump that you have already collected. > > > On Tue, Apr 21, 2020 at 9:51 PM Ruijing Li wrote: &

Re: Good idea to do multi-threading in spark job?

2020-05-06 Thread Ruijing Li
our code in one > JVM, and whatever synchronization that implies. > > On Sun, May 3, 2020 at 11:32 AM Ruijing Li wrote: > > > > Hi all, > > > > We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we > use semaphores / parallel collections within our spark j

Good idea to do multi-threading in spark job?

2020-05-03 Thread Ruijing Li
about any deadlocks and if it could mess with the fixes for issues such as this https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961 We do run with multiple cores. Thanks! -- Cheers, Ruijing Li

Re: Using startingOffsets latest - no data from structured streaming kafka query

2020-04-22 Thread Ruijing Li
For some reason, after restarting the app and trying again, latest now works as expected. Not sure why it didn’t work before. On Tue, Apr 21, 2020 at 1:46 PM Ruijing Li wrote: > Yes, we did. But for some reason latest does not show them. The count is > always 0. > > On Sun, Apr 19,

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
ies in the dump then why not share the > thread dump? (I mean, the output of jstack) > > stack trace would be more helpful to find which thing acquired lock and > which other things are waiting for acquiring lock, if we suspect deadlock. > > On Wed, Apr 22, 2020 at 2:38 AM Ruijing

Re: Using startingOffsets latest - no data from structured streaming kafka query

2020-04-21 Thread Ruijing Li
; On Fri, Apr 17, 2020 at 9:13 AM Ruijing Li wrote: > >> Hi all, >> >> Apologies if this has been asked before, but I could not find the answer >> to this question. We have a structured streaming job, but for some reason, >> if we use startingOffsets = latest

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
waiting. Thanks On Tue, Apr 21, 2020 at 9:58 AM Ruijing Li wrote: > Strangely enough I found an old issue that is the exact same issue as mine > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343 > > However I’m using spark 2.4.4 so the issue should have been s

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
After refreshing a couple of times, I notice the lock is being swapped between these 3. The other 2 will be blocked by whoever gets this lock, in a cycle of 160 has lock -> 161 -> 159 -> 160 On Tue, Apr 21, 2020 at 10:33 AM Ruijing Li wrote: > In thread dump, I do see this >

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
; > so maybe doing manually would be the only option. Not sure Spark UI will > > provide the same, haven't used at all.) > > > > It will tell which thread is being blocked (even it's shown as running) > and > > which point to look at. > > > > On Thu, Apr

Re: Understanding spark structured streaming checkpointing system

2020-04-19 Thread Ruijing Li
Jungtaek Lim wrote: > That sounds odd. Is it intermittent, or always reproducible if you starts > with same checkpoint? What's the version of Spark? > > On Fri, Apr 17, 2020 at 6:17 AM Ruijing Li wrote: > >> Hi all, >> >> I have a question on how structured streaming do

Using startingOffsets latest - no data from structured streaming kafka query

2020-04-16 Thread Ruijing Li
“ Fetcher [Consumer] Resetting offset for partition to offset” over and over again.. However with startingOffsets=earliest, we don’t get this issue. I’m wondering then how we can use startingOffsets=latest as I wish to start from the latest offset available. -- Cheers, Ruijing Li

Understanding spark structured streaming checkpointing system

2020-04-16 Thread Ruijing Li
and restarting it, I see it instead reads from offset file 9 which contains {1:1000} Can someone explain why spark doesn’t take the max offset? Thanks. -- Cheers, Ruijing Li

Re: Spark structured streaming - Fallback to earliest offset

2020-04-16 Thread Ruijing Li
you'll need to do > former), but if you can't make sure and if you understand the risk then yes > you can turn off the option and take the risk. > > > On Wed, Apr 15, 2020 at 9:24 AM Ruijing Li wrote: > >> I see, I wasn’t sure if that would work as expected. The docs seems

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-16 Thread Ruijing Li
S NULL >> --AND a.sid NOT IN (select sid from v$mystat where rownum = 1) >> AND a.sid = b.sid >> AND a.username is not null >> --AND (a.last_call_et < 3600 or a.status = 'ACTIVE') >> --AND CURRENT_DATE - logon_time > 0 >> --AND a.sid NOT IN ( select sid fro

Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Ruijing Li
eam? > > On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li wrote: > >> Hi all, >> >> I have a spark structured streaming app that is consuming from a kafka >> topic with retention set up. Sometimes I face an issue where my query has >> not finished processing a mess

Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Ruijing Li
cannot set that. How do I do this for structured streaming? Thanks! -- Cheers, Ruijing Li

Spark hangs while reading from jdbc - does nothing

2020-04-10 Thread Ruijing Li
but sometimes it stops at 29 completed stages and doesn’t start the last stage. The spark job is idling and there is no pending or active task. What could be the problem? Thanks. -- Cheers, Ruijing Li

Re: Can you view thread dumps on spark UI if job finished

2020-04-08 Thread Ruijing Li
e information on how to use this tool in the spark > documentation https://spark.apache.org/docs/latest/monitoring.html > > > > > > On Wed, 8 Apr 2020, 23:47 Ruijing Li, wrote: > >> Hi all, >> >> As stated in title, currently when I view the spark UI of a comp

Can you view thread dumps on spark UI if job finished

2020-04-08 Thread Ruijing Li
Hi all, As stated in title, currently when I view the spark UI of a completed spark job, I see there are thread dump links in the executor tab, but clicking on them does nothing. Is it possible to see the thread dumps somehow even if the job finishes? On spark 2.4.5. Thanks. -- Cheers, Ruijing

Re: can we all help use our expertise to create an IT solution for Covid-19

2020-03-26 Thread Ruijing Li
> mich.talebza...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> Do you think we can create a global solution in the cloud using >>>>>>>>> volunteers like us and third party employees. What I have in mind is >>>>>>>>> to >>>>>>>>> create a comprehensive real time solution to get data from various >>>>>>>>> countries, universities pushed into a fast database through Kafka and >>>>>>>>> Spark >>>>>>>>> and used downstream for greater analytics. I am sure likes of Goggle >>>>>>>>> etc. >>>>>>>>> will provide free storage and likely many vendors will grab the >>>>>>>>> opportunity. >>>>>>>>> >>>>>>>>> We can then donate this to WHO or others and we can make it very >>>>>>>>> modular though microservices etc. >>>>>>>>> >>>>>>>>> I hope this does not sound futuristic. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> >>>>>>>>> Dr Mich Talebzadeh >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> LinkedIn * >>>>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>>> >>>>>>>>> >>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>>>> for any loss, damage or destruction of data or any other property >>>>>>>>> which may >>>>>>>>> arise from relying on this email's technical content is explicitly >>>>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>>>> damages >>>>>>>>> arising from such loss, damage or destruction. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> .. spend time to analyse .. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Abraham R. Wilcox >>>>>>>>> Sales Director (African Region) >>>>>>>>> 8x8 Hosted VoIP - Communications & Collaboration Solutions >>>>>>>>> 7257 NW 4TH BLVD SUITE 305 >>>>>>>>> <https://www.google.com/maps/search/7257+NW+4TH+BLVD+SUITE+305+GAINESVILLE,+FL+32607+%0D%0AUS?entry=gmail=g> >>>>>>>>> GAINESVILLE, FL 32607 >>>>>>>>> <https://www.google.com/maps/search/7257+NW+4TH+BLVD+SUITE+305+GAINESVILLE,+FL+32607+%0D%0AUS?entry=gmail=g> >>>>>>>>> >>>>>>>>> <https://www.google.com/maps/search/7257+NW+4TH+BLVD+SUITE+305+GAINESVILLE,+FL+32607+%0D%0AUS?entry=gmail=g> >>>>>>>>> >>>>>>>>> US >>>>>>>>> <https://www.google.com/maps/search/7257+NW+4TH+BLVD+SUITE+305+GAINESVILLE,+FL+32607+%0D%0AUS?entry=gmail=g> >>>>>>>>> Direct: +1 510 646 1484 >>>>>>>>> US Voice: +1 641 715 3900 ext. 755489# >>>>>>>>> US Fax: +1 855 661 4166 >>>>>>>>> Alt. email: awilco...@gmail.com >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Chenguang He >>>>>>> >>>>>> -- Cheers, Ruijing Li

ForEachBatch collecting batch to driver

2020-03-10 Thread Ruijing Li
into the driver? -- Cheers, Ruijing Li

Re: Schema store for Parquet

2020-03-09 Thread Ruijing Li
Thanks Magnus, I’ll explore Atlas and see what I can find. On Wed, Mar 4, 2020 at 11:10 AM Magnus Nilsson wrote: > Apache Atlas is the apache data catalog. Maybe want to look into that. It > depends on what your use case is. > > On Wed, Mar 4, 2020 at 8:01 PM Ruijing Li wrote:

Re: Schema store for Parquet

2020-03-04 Thread Ruijing Li
5, Magnus Nilsson wrote: > >> Google hive metastore. >> >> On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li wrote: >> >>> Hi all, >>> >>> Has anyone explored efforts to have a centralized storage of schemas of >>> different parquet files? I k

Re: Integration testing Framework Spark SQL Scala

2020-02-25 Thread Ruijing Li
Just wanted to follow up on this. If anyone has any advice, I’d be interested in learning more! On Thu, Feb 20, 2020 at 6:09 PM Ruijing Li wrote: > Hi all, > > I’m interested in hearing the community’s thoughts on best practices to do > integration testing for spark sql jobs.

Integration testing Framework Spark SQL Scala

2020-02-20 Thread Ruijing Li
a sparksession locally or testing with spark-shell. Ideally, we’d like some sort of docker container emulating hdfs and spark cluster mode, that you can run locally. Any test framework, tips, or examples people can share? Thanks! -- Cheers, Ruijing Li

Re: Better way to debug serializable issues

2020-02-20 Thread Ruijing Li
erialization.extendedDebugInfo=true > > Maxim Gekk > > Software Engineer > > Databricks, Inc. > > > On Tue, Feb 18, 2020 at 1:02 PM Ruijing Li wrote: > >> Hi all, >> >> When working with spark jobs, I sometimes have to tackle with >> seri

Better way to debug serializable issues

2020-02-18 Thread Ruijing Li
generic classes or the class Spark is running itself). Thanks! -- Cheers, Ruijing Li

Re: Best way to read batch from Kafka and Offsets

2020-02-15 Thread Ruijing Li
for your help! On Wed, Feb 5, 2020 at 7:07 PM Ruijing Li wrote: > Looks like I’m wrong, since I tried that exact snippet and it worked > > So to be clear, in the part where I do batchDF.write.parquet, that is not > the exact code I’m using. > > I’m using a custom write function

Spark 2.4.4 has bigger memory impact than 2.3?

2020-02-15 Thread Ruijing Li
memory than previous versions of spark? I’d be interested to know if anyone else has this issue. We are on scala 2.11.12 on java 8 -- Cheers, Ruijing Li

Re: Best way to read batch from Kafka and Offsets

2020-02-05 Thread Ruijing Li
function isn’t working correctly Is batchDF a static dataframe though? Thanks On Wed, Feb 5, 2020 at 6:13 PM Ruijing Li wrote: > Hi all, > > I tried with forEachBatch but got an error. Is this expected? > > Code is > > df.writeStream.trigger(Trigger.Once).forEachBatc

Re: Best way to read batch from Kafka and Offsets

2020-02-05 Thread Ruijing Li
cy. What if your job fails as >> you're committing the offsets in the end, but the data was already stored? >> Will your getOffsets method return the same offsets? >> >> I'd rather not solve problems that other people have solved for me, but >> ultimately the decision is your

Re: Best way to read batch from Kafka and Offsets

2020-02-04 Thread Ruijing Li
ng the data) >> >> Currently to make it work in batch mode, you need to maintain the state >> information of the offsets externally. >> >> >> Thanks >> Anil >> >> -Sent from my mobile >> http://anilkulkarni.com/ >> >> On Mon, Feb 3,

Re: Best way to read batch from Kafka and Offsets

2020-02-03 Thread Ruijing Li
icate message with two offsets. > > The alternative is you can reprocess the offsets back from where you > thought the message was last seen. > > Kind regards > Chris > > On Mon, 3 Feb 2020, 7:39 pm Ruijing Li, wrote: > >> Hi all, >> >> My use case is to read from

Best way to read batch from Kafka and Offsets

2020-02-03 Thread Ruijing Li
without missing data? Any help would be appreciated. -- Cheers, Ruijing Li

Re: Out of memory HDFS Read and Write

2019-12-22 Thread Ruijing Li
o reduce the spark.executor.cores in > such a job (note the approx heap calculation noted in the ticket). Other > solution is increased executor heap. Or use off-heap configuration with > Spark 2.4 which will remove the pressure for reads but not for writes. > > regards > sumedh

Re: Out of memory HDFS Read and Write

2019-12-22 Thread Ruijing Li
duce the number of connections? You may have to look at what the > executors do when they reach out to the remote cluster. > > On Sun, 22 Dec 2019, 8:07 am Ruijing Li, wrote: > >> I managed to make the failing stage work by increasing memoryOverhead to >> something ridiculous > 50

Out of memory HDFS Multiple Cluster Write

2019-12-21 Thread Ruijing Li
ing stage of multiple cluster write) to prevent spark’s small files problem. We reduce from 4000 partitions to 20. On Sat, Dec 21, 2019 at 11:28 AM Ruijing Li wrote: > Not for the stage that fails, all it does is read and write - the number > of tasks is # of cores * # of executor instance

Re: Out of memory HDFS Multiple Cluster Write

2019-12-21 Thread Ruijing Li
tes > a shuffle? I don't expect a shuffle if it is a straight write. What's the > input partition size? > > On Sat, 21 Dec 2019, 10:24 am Ruijing Li, wrote: > >> Could you explain why shuffle partitions might be a good starting point? >> >> Some more details: wh

Re: Out of memory HDFS Multiple Cluster Write

2019-12-20 Thread Ruijing Li
art. > > Is there a difference in the number of partitions when the parquet is read > to spark.sql.shuffle.partitions? Is it much higher than > spark.sql.shuffle.partitions? > > On Fri, 20 Dec 2019, 7:34 pm Ruijing Li, wrote: > >> Hi all, >> >> I have encoun

Out of memory HDFS Multiple Cluster Write

2019-12-20 Thread Ruijing Li
at. -- Cheers, Ruijing Li -- Cheers, Ruijing Li