Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Jungtaek Lim
I think Spark is trying to ensure that it reads the input "continuously"
without any missing. Technically it may be valid to say the situation is a
kind of "data-loss", as the query couldn't process the offsets which are
being thrown out, and owner of the query needs to be careful as it affects
the result.

If your streaming query keeps up with input rate then it's pretty rare for
the query to go under retention. Even it lags a bit, it'd be safe if
retention is set to enough period. The ideal state would be ensuring your
query to process all offsets before they are thrown out by retention (don't
leave the query lagging behind - either increasing processing power or
increasing retention duration, though most probably you'll need to do
former), but if you can't make sure and if you understand the risk then yes
you can turn off the option and take the risk.


On Wed, Apr 15, 2020 at 9:24 AM Ruijing Li  wrote:

> I see, I wasn’t sure if that would work as expected. The docs seems to
> suggest to be careful before turning off that option, and I’m not sure why
> failOnDataLoss is true by default.
>
> On Tue, Apr 14, 2020 at 5:16 PM Burak Yavuz  wrote:
>
>> Just set `failOnDataLoss=false` as an option in readStream?
>>
>> On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li  wrote:
>>
>>> Hi all,
>>>
>>> I have a spark structured streaming app that is consuming from a kafka
>>> topic with retention set up. Sometimes I face an issue where my query has
>>> not finished processing a message but the retention kicks in and deletes
>>> the offset, which since I use the default setting of “failOnDataLoss=true”
>>> causes my query to fail. The solution I currently have is manual, deleting
>>> the offsets directory and rerunning.
>>>
>>> I instead like to have spark automatically fall back to the earliest
>>> offset available. The solutions I saw recommend setting auto.offset =
>>> earliest, but for structured streaming, you cannot set that. How do I do
>>> this for structured streaming?
>>>
>>> Thanks!
>>> --
>>> Cheers,
>>> Ruijing Li
>>>
>> --
> Cheers,
> Ruijing Li
>


Re: Going it alone.

2020-04-14 Thread yeikel valdes
There are many use case cases for Spark. A google search with "Use cases for 
apache spark" will give you all the information that you need. 

 On Tue, 14 Apr 2020 18:44:59 -0400 janethor...@aol.com.INVALID wrote 



I did write a long email in response to you.
But then I deleted it because I felt it would be too revealing.







On Tuesday, 14 April 2020 David Hesson  wrote:

I want to know  if Spark is headed in my direction.

You are implying  Spark could be. 


What direction are you headed in, exactly? I don't feel as if anything were 
implied when you were asked for use cases or what problem you are solving. You 
were asked to identify some use cases, of which you don't appear to have any.


On Tue, Apr 14, 2020 at 4:49 PM jane thorpe  wrote:


That's what  I want to know,  Use Cases.
I am looking for  direction as I described and I want to know  if Spark is 
headed in my direction.  

You are implying  Spark could be.

So tell me about the USE CASES and I'll do the rest.

On Tuesday, 14 April 2020 yeikel valdes  wrote:

It depends on your use case. What are you trying to solve? 



 On Tue, 14 Apr 2020 15:36:50 -0400 janethor...@aol.com.INVALID wrote 



Hi,

I consider myself to be quite good in Software Development especially using 
frameworks.

I like to get my hands  dirty. I have spent the last few months understanding 
modern frameworks and architectures.

I am looking to invest my energy in a product where I don't have to relying on 
the monkeys which occupy this space  we call software development.

I have found one that meets my requirements.

Would Apache Spark be a good Tool for me or  do I need to be a member of a team 
to develop  products  using Apache Spark  ?











Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Ruijing Li
I see, I wasn’t sure if that would work as expected. The docs seems to
suggest to be careful before turning off that option, and I’m not sure why
failOnDataLoss is true by default.

On Tue, Apr 14, 2020 at 5:16 PM Burak Yavuz  wrote:

> Just set `failOnDataLoss=false` as an option in readStream?
>
> On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li  wrote:
>
>> Hi all,
>>
>> I have a spark structured streaming app that is consuming from a kafka
>> topic with retention set up. Sometimes I face an issue where my query has
>> not finished processing a message but the retention kicks in and deletes
>> the offset, which since I use the default setting of “failOnDataLoss=true”
>> causes my query to fail. The solution I currently have is manual, deleting
>> the offsets directory and rerunning.
>>
>> I instead like to have spark automatically fall back to the earliest
>> offset available. The solutions I saw recommend setting auto.offset =
>> earliest, but for structured streaming, you cannot set that. How do I do
>> this for structured streaming?
>>
>> Thanks!
>> --
>> Cheers,
>> Ruijing Li
>>
> --
Cheers,
Ruijing Li


Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Burak Yavuz
Just set `failOnDataLoss=false` as an option in readStream?

On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li  wrote:

> Hi all,
>
> I have a spark structured streaming app that is consuming from a kafka
> topic with retention set up. Sometimes I face an issue where my query has
> not finished processing a message but the retention kicks in and deletes
> the offset, which since I use the default setting of “failOnDataLoss=true”
> causes my query to fail. The solution I currently have is manual, deleting
> the offsets directory and rerunning.
>
> I instead like to have spark automatically fall back to the earliest
> offset available. The solutions I saw recommend setting auto.offset =
> earliest, but for structured streaming, you cannot set that. How do I do
> this for structured streaming?
>
> Thanks!
> --
> Cheers,
> Ruijing Li
>


Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Ruijing Li
Hi all,

I have a spark structured streaming app that is consuming from a kafka
topic with retention set up. Sometimes I face an issue where my query has
not finished processing a message but the retention kicks in and deletes
the offset, which since I use the default setting of “failOnDataLoss=true”
causes my query to fail. The solution I currently have is manual, deleting
the offsets directory and rerunning.

I instead like to have spark automatically fall back to the earliest offset
available. The solutions I saw recommend setting auto.offset = earliest,
but for structured streaming, you cannot set that. How do I do this for
structured streaming?

Thanks!
-- 
Cheers,
Ruijing Li


Re: Going it alone.

2020-04-14 Thread jane thorpe

I did write a long email in response to you.
But then I deleted it because I felt it would be too revealing. 






On Tuesday, 14 April 2020 David Hesson  wrote:

I want to know  if Spark is headed in my direction.


You are implying  Spark could be. 

What direction are you headed in, exactly? I don't feel as if anything were 
implied when you were asked for use cases or what problem you are solving. You 
were asked to identify some use cases, of which you don't appear to have any.
On Tue, Apr 14, 2020 at 4:49 PM jane thorpe  wrote:


That's what  I want to know,  Use Cases. 
 I am looking for  direction as I described and I want to know  if Spark is 
headed in my direction.   

You are implying  Spark could be.

So tell me about the USE CASES and I'll do the rest.
On Tuesday, 14 April 2020 yeikel valdes  wrote:
It depends on your use case. What are you trying to solve? 

  On Tue, 14 Apr 2020 15:36:50 -0400  janethor...@aol.com.INVALID  wrote 




Hi, 

I consider myself to be quite good in Software Development especially using 
frameworks. 

I like to get my hands  dirty. I have spent the last few months understanding 
modern frameworks and architectures. 

I am looking to invest my energy in a product where I don't have to relying on 
the monkeys which occupy this space  we call software development.

I have found one that meets my requirements.

Would Apache Spark be a good Tool for me or  do I need to be a member of a team 
to develop  products  using Apache Spark  ?



 








Re: Going it alone.

2020-04-14 Thread David Hesson
>
> I want to know  if Spark is headed in my direction.
>
You are implying  Spark could be.


What direction are you headed in, exactly? I don't feel as if anything were
implied when you were asked for use cases or what problem you are solving.
You were asked to identify some use cases, of which you don't appear to
have any.

On Tue, Apr 14, 2020 at 4:49 PM jane thorpe 
wrote:

> That's what  I want to know,  Use Cases.
> I am looking for  direction as I described and I want to know  if Spark is
> headed in my direction.
>
> You are implying  Spark could be.
>
> So tell me about the USE CASES and I'll do the rest.
> --
> On Tuesday, 14 April 2020 yeikel valdes  wrote:
> It depends on your use case. What are you trying to solve?
>
>
>  On Tue, 14 Apr 2020 15:36:50 -0400 * janethor...@aol.com.INVALID *
> wrote 
>
> Hi,
>
> I consider myself to be quite good in Software Development especially
> using frameworks.
>
> I like to get my hands  dirty. I have spent the last few months
> understanding modern frameworks and architectures.
>
> I am looking to invest my energy in a product where I don't have to
> relying on the monkeys which occupy this space  we call software
> development.
>
> I have found one that meets my requirements.
>
> Would Apache Spark be a good Tool for me or  do I need to be a member of a
> team to develop  products  using Apache Spark  ?
>
>
>
>
>
>


Re: Going it alone.

2020-04-14 Thread jane thorpe

That's what  I want to know,  Use Cases. 
 I am looking for  direction as I described and I want to know  if Spark is 
headed in my direction.   

You are implying  Spark could be.

So tell me about the USE CASES and I'll do the rest.
On Tuesday, 14 April 2020 yeikel valdes  wrote:
It depends on your use case. What are you trying to solve? 

  On Tue, 14 Apr 2020 15:36:50 -0400  janethor...@aol.com.INVALID  wrote 




Hi, 

I consider myself to be quite good in Software Development especially using 
frameworks. 

I like to get my hands  dirty. I have spent the last few months understanding 
modern frameworks and architectures. 

I am looking to invest my energy in a product where I don't have to relying on 
the monkeys which occupy this space  we call software development.

I have found one that meets my requirements.

Would Apache Spark be a good Tool for me or  do I need to be a member of a team 
to develop  products  using Apache Spark  ?



 







Cross Region Apache Spark Setup

2020-04-14 Thread Stone Zhong
Hi,

I am trying to setup a cross region Apache Spark cluster. All my data are
stored in Amazon S3 and well partitioned by region.

For example, I have parquet file at
S3://mybucket/sales_fact.parquet/us-west
S3://mybucket/sales_fact.parquet/us-east
S3://mybucket/sales_fact.parquet/uk

And my cluster have nodes in us-west, us-east and uk region -- basically I
have node in all region that I supported.

When I have code like:

df = spark.read.parquet("S3://mybucket/sales_fact.parquet/*")
print(df.count()) #1
print(df.select("product_id").distinct().count()) #2

For #1, I expect only us-west nodes read data partition in us-west, and
etc, and spark to add 3 regional count and return me a total count. *I do
not expect large cross region data transfer in this case.*
For #2, I expect only us-west nodes read data partition in us-west, and
etc. Each region, do the distinct() locally first, and merge 3 "product_id"
list and do a distinct() again, I am ok with the necessary cross-region
data transfer for merging the distinct product_ids

Can anyone please share the best practice? Is it possible to config the
Apache Spark to work in such a way?

Any idea and help is appreciated!

Thanks,
Stone


Re: Going it alone.

2020-04-14 Thread yeikel valdes
It depends on your use case. What are you trying to solve? 



 On Tue, 14 Apr 2020 15:36:50 -0400 janethor...@aol.com.INVALID wrote 



Hi,

I consider myself to be quite good in Software Development especially using 
frameworks.

I like to get my hands  dirty. I have spent the last few months understanding 
modern frameworks and architectures.

I am looking to invest my energy in a product where I don't have to relying on 
the monkeys which occupy this space  we call software development.

I have found one that meets my requirements.

Would Apache Spark be a good Tool for me or  do I need to be a member of a team 
to develop  products  using Apache Spark  ?








Going it alone.

2020-04-14 Thread jane thorpe

Hi, 

I consider myself to be quite good in Software Development especially using 
frameworks. 

I like to get my hands  dirty. I have spent the last few months understanding 
modern frameworks and architectures. 

I am looking to invest my energy in a product where I don't have to relying on 
the monkeys which occupy this space  we call software development.

I have found one that meets my requirements.

Would Apache Spark be a good Tool for me or  do I need to be a member of a team 
to develop  products  using Apache Spark  ?



 





Re: What is the best way to take the top N entries from a hive table/data source?

2020-04-14 Thread Yeikel
Looking at the results of explain, I can see a CollectLimit step. Does that
work the same way as a regular .collect() ? (where all records are sent to
the driver?)


spark.sql("select * from db.table limit 100").explain(false)
== Physical Plan ==
CollectLimit 100
+- FileScan parquet ... 806 more fields] Batched: false, Format: Parquet,
Location: CatalogFileIndex[...], PartitionCount: 3, PartitionFilters: [],
PushedFilters: [], ReadSchema:.
db: Unit = ()

The number of partitions is 1 so that makes sense. 

spark.sql("select * from db.table limit 100").rdd.partitions.size = 1

As a follow up , I tried to repartition the resultant dataframe and while I
can't see the CollectLimit step anymore , It did not make any difference in
the job. I still saw a big task at the end that ends up failing. 

spark.sql("select * from db.table limit
100").repartition(1000).explain(false)

Exchange RoundRobinPartitioning(1000)
+- GlobalLimit 100
   +- Exchange SinglePartition
  +- LocalLimit 100  -> Is this a collect?





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How does spark sql evaluate case statements?

2020-04-14 Thread Yeikel
I do not know the answer to this question so I am also looking for it,  but
@kant maybe the generated code can help with this. 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Is there any way to set the location of the history for the spark-shell per session?

2020-04-14 Thread Yeikel
In my team , we get elevated access to our Spark cluster using a common
username which means that we all share the same history. I am not sure if
this is common , but unfortunately there is nothing I can do about it. 

Is there any option to set the location of the history?  I am looking for
something like spark-shell --history=...path or something similar that can
be reused for other sessions. 

By default , history seems to be stored in $HOME/.scala_history and I did
not see anything in the documentation about it , but maybe there is an
undocumented way to do it. 

Thanks for your help!



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Question on writing batch synchronized incremental graph algorithms

2020-04-14 Thread Kaan Sancak
Hi all,
I have been trying to write batch-synchronized  incremental graph algorithms. 
More specifically, I want to run an increment algorithm on a given data-set and 
when a new batch arrives, I want to start the algorithm from last snapshot, and 
run the algorithm on the vertices that are effected by the new batch. (Assuming 
that each batch contains a set of insertions/deletions to the graph, effected 
vertices are the vertices whose neighbor set is effected by the 
insertions/deletions).

I found out a paper[1] which mentions about a framework called GraphTau built 
on top of GraphX, and I also found out couple of Spark Summits[2] mention 
future code release. I have looked but couldn’t find any code available to 
public. 
Is there anyone who can inform me about the subject? Or are there any features 
on GraphX currently available for me to write such algorithms.

[1] Time-Evolving Graph Processing at Scale: 
https://www.researchgate.net/publication/305661018_Time-evolving_graph_processing_at_scale
 

 .

[2] Spark Summit 2017: 
https://www.slideshare.net/SparkSummit/timeevolving-graph-processing-on-commodity-clusters-spark-summit-east-talk-by-anand-iyer
 


Best
Kaan

Re: Spark interrupts S3 request backoff

2020-04-14 Thread Gabor Somogyi
+1 on the previous guess and additionally I suggest to reproduce it with
vanilla Spark.
Amazon Spark contains modifications which not available in vanilla Spark
which makes problem hunting hard or impossible.
Such case amazon can help...

On Tue, Apr 14, 2020 at 11:20 AM ZHANG Wei  wrote:

> I will make a guess, it's not interruptted, it's killed by the driver or
> the resource manager since the executor fallen into sleep for a long time.
>
> You may have to find the root cause in the driver and failed executor log
> contexts.
>
> --
> Cheers,
> -z
>
> 
> From: Lian Jiang 
> Sent: Monday, April 13, 2020 10:43
> To: user
> Subject: Spark interrupts S3 request backoff
>
> Hi,
>
> My Spark job failed when reading parquet files from S3 due to 503 slow
> down. According to
> https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html,
> I can use backoff to mitigate this issue. However, spark seems to interrupt
> the backoff sleeping (see "sleep interrupted"). Is there a way (e.g. some
> settings) to make spark not interrupt the backoff? Appreciate any hints.
>
>
>
> 20/04/12 20:15:37 WARN TaskSetManager: Lost task 3347.0 in stage 155.0
> (TID 128138, ip-100-101-44-35.us-west-2.compute.internal, executor 34):
> org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to
> download file path:
> s3://mybucket/myprefix/part-00178-d0a0d51f-f98e-4b9d-8d00-bb3b9acd9a47-c000.snappy.parquet,
> range: 0-19231, partition values: [empty row], isDataPresent: false
> at
> org.apache.spark.sql.execution.datasources.AsyncFileDownloader.next(AsyncFileDownloader.scala:142)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.getNextFile(FileScanRDD.scala:248)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:172)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Suppressed:
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
> Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down;
> Request ID: CECE220993AE7F89; S3 Extended Request ID:
> UlQe4dEuBR1YWJUthSlrbV9phyqxUNHQEw7tsJ5zu+oNIH+nGlGHfAv7EKkQRUVP8tw8x918A4Y=),
> S3 Extended Request ID:
> UlQe4dEuBR1YWJUthSlrbV9phyqxUNHQEw7tsJ5zu+oNIH+nGlGHfAv7EKkQRUVP8tw8x918A4Y=
> at
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
> at
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
> at
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
> at
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
> at
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
> at
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
> at
> com.amazon.ws.emr.hadoop.fs.sha

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-14 Thread Gabor Somogyi
The simplest way is to do thread dump which doesn't require any fancy tool
(it's available on Spark UI).
Without thread dump it's hard to say anything...


On Tue, Apr 14, 2020 at 11:32 AM jane thorpe 
wrote:

> Here a is another tool I use Logic Analyser  7:55
> https://youtu.be/LnzuMJLZRdU
>
> you could take some suggestions for improving performance  queries.
> https://dzone.com/articles/why-you-should-not-use-select-in-sql-query-1
>
>
> Jane thorpe
> janethor...@aol.com
>
>
> -Original Message-
> From: jane thorpe 
> To: janethorpe1 ; mich.talebzadeh <
> mich.talebza...@gmail.com>; liruijing09 ; user <
> user@spark.apache.org>
> Sent: Mon, 13 Apr 2020 8:32
> Subject: Re: Spark hangs while reading from jdbc - does nothing Removing
> Guess work from trouble shooting
>
>
>
> This tool may be useful for you to trouble shoot your problems away.
>
>
> https://www.javacodegeeks.com/2020/04/simplifying-apm-remove-the-guesswork-from-troubleshooting.html
>
>
> "APM tools typically use a waterfall-type view to show the blocking time
> of different components cascading through the control flow within an
> application.
> These types of visualizations are useful, and AppOptics has them, but they
> can be difficult to understand for those of us without a PhD."
>
> Especially  helpful if you want to understand through visualisation and
> you do not have a phD.
>
>
> Jane thorpe
> janethor...@aol.com
>
>
> -Original Message-
> From: jane thorpe 
> To: mich.talebzadeh ; liruijing09 <
> liruijin...@gmail.com>; user 
> CC: user 
> Sent: Sun, 12 Apr 2020 4:35
> Subject: Re: Spark hangs while reading from jdbc - does nothing
>
> You seem to be implying the error is intermittent.
> You seem to be implying data is being ingested  via JDBC. So the
> connection has proven itself to be working unless no data is arriving from
> the  JDBC channel at all.  If no data is arriving then one could say it
> could be  the JDBC.
> If the error is intermittent  then it is likely a resource involved in
> processing is filling to capacity.
> Try reducing the data ingestion volume and see if that completes, then
> increase the data ingested  incrementally.
> I assume you have  run the job on small amount of data so you have
> completed your prototype stage successfully.
>
> --
> On Saturday, 11 April 2020 Mich Talebzadeh 
> wrote:
> Hi,
>
> Have you checked your JDBC connections from Spark to Oracle. What is
> Oracle saying? Is it doing anything or hanging?
>
> set pagesize 
> set linesize 140
> set heading off
> select SUBSTR(name,1,8) || ' sessions as on '||TO_CHAR(CURRENT_DATE, 'MON
> DD  HH:MI AM') from v$database;
> set heading on
> column spid heading "OS PID" format a6
> column process format a13 heading "Client ProcID"
> column username  format a15
> column sid   format 999
> column serial#   format 9
> column STATUSformat a3 HEADING 'ACT'
> column last  format 9,999.99
> column TotGets   format 999,999,999,999 HEADING 'Logical I/O'
> column phyRdsformat 999,999,999 HEADING 'Physical I/O'
> column total_memory format 999,999,999 HEADING 'MEM/KB'
> --
> SELECT
>   substr(a.username,1,15) "LOGIN"
> , substr(a.sid,1,5) || ','||substr(a.serial#,1,5) AS "SID/serial#"
> , TO_CHAR(a.logon_time, 'DD/MM HH:MI') "LOGGED IN SINCE"
> , substr(a.machine,1,10) HOST
> , substr(p.username,1,8)||'/'||substr(p.spid,1,5) "OS PID"
> , substr(a.osuser,1,8)||'/'||substr(a.process,1,5) "Client PID"
> , substr(a.program,1,15) PROGRAM
> --,ROUND((CURRENT_DATE-a.logon_time)*24) AS "Logged/Hours"
> , (
> select round(sum(ss.value)/1024) from v$sesstat ss,
> v$statname sn
> where ss.sid = a.sid and
> sn.statistic# = ss.statistic# and
> -- sn.name in ('session pga memory')
> sn.name in ('session pga memory','session uga
> memory')
>   ) AS total_memory
> , (b.block_gets + b.consistent_gets) TotGets
> , b.physical_reads phyRds
> , decode(a.status, 'ACTIVE', 'Y','INACTIVE', 'N') STATUS
> , CASE WHEN a.sid in (select sid from v$mystat where rownum = 1)
> THEN '<-- YOU' ELSE ' ' END "INFO"
> FROM
>  v$process p
> ,v$session a
> ,v$sess_io b
> WHERE
> a.paddr = p.addr
> AND p.background IS NULL
> --AND  a.sid NOT IN (select sid from v$mystat where rownum = 1)
> AND a.sid = b.sid
> AND a.username is not null
> --AND (a.last_call_et < 3600 or a.status = 'ACTIVE')
> --AND CURRENT_DATE - logon_time > 0
> --AND a.sid NOT IN ( select sid from v$mystat where rownum=1)  -- exclude
> me
> --AND (b.block_gets + b.consistent_gets) > 0
> ORDER BY a.username;
> exit
>
> HTH
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> 

Re: Spark Streaming not working

2020-04-14 Thread Gerard Maas
Hi,

Could you share the code that you're using to configure the connection to
the Kafka broker?

This is a bread-and-butter feature. My first thought is that there's
something in your particular setup that prevents this from working.

kind regards, Gerard.

On Fri, Apr 10, 2020 at 7:34 PM Debabrata Ghosh 
wrote:

> Hi,
> I have a spark streaming application where Kafka is producing
> records but unfortunately spark streaming isn't able to consume those.
>
> I am hitting the following error:
>
> 20/04/10 17:28:04 ERROR Executor: Exception in task 0.5 in stage 0.0 (TID 24)
> java.lang.AssertionError: assertion failed: Failed to get records for 
> spark-executor-service-spark-ingestion dice-ingestion 11 0 after polling for 
> 12
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:223)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:189)
>
>
> Would you please be able to help with a resolution.
>
> Thanks,
> Debu
>


Re: Spark Streaming not working

2020-04-14 Thread Gabor Somogyi
Sorry, hit the send accidentally...

The symptom is simple, the broker is not responding in 120 seconds.
That's the reason why Debabrata asked the broker config.

What I can suggest is to check the previous printout which logs the Kafka
consumer settings.
With the mentioned settings you can start a console consumer on the exact
same host where the executor ran...
If that works you can open a Spark jira with driver and executor logs,
otherwise fix the connection issue.

BR,
G


On Tue, Apr 14, 2020 at 1:32 PM Gabor Somogyi 
wrote:

> The symptom is simple, the broker is not responding in 120 seconds.
> That's the reason why Debabrata asked the broker config.
>
> What I can suggest is to check the previous printout which logs the Kafka
> consumer settings.
> With
>
>
> On Tue, Apr 14, 2020 at 11:44 AM ZHANG Wei  wrote:
>
>> Here is the assertion error message format:
>>
>>s"Failed to get records for $groupId $topic $partition $offset after
>> polling for $timeout")
>>
>> You might have to check the kafka service with the error log:
>>
>> > 20/04/10 17:28:04 ERROR Executor: Exception in task 0.5 in stage 0.0
>> (TID 24)
>> > java.lang.AssertionError: assertion failed: Failed to get records for
>> spark-executor-service-spark-ingestion dice-ingestion 11 0 after polling
>> for 12
>>
>> Cheers,
>> -z
>>
>> 
>> From: Debabrata Ghosh 
>> Sent: Saturday, April 11, 2020 2:25
>> To: user
>> Subject: Re: Spark Streaming not working
>>
>> Any solution please ?
>>
>> On Fri, Apr 10, 2020 at 11:04 PM Debabrata Ghosh > > wrote:
>> Hi,
>> I have a spark streaming application where Kafka is producing
>> records but unfortunately spark streaming isn't able to consume those.
>>
>> I am hitting the following error:
>>
>> 20/04/10 17:28:04 ERROR Executor: Exception in task 0.5 in stage 0.0 (TID
>> 24)
>> java.lang.AssertionError: assertion failed: Failed to get records for
>> spark-executor-service-spark-ingestion dice-ingestion 11 0 after polling
>> for 12
>> at scala.Predef$.assert(Predef.scala:170)
>> at
>> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
>> at
>> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:223)
>> at
>> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:189)
>>
>> Would you please be able to help with a resolution.
>>
>> Thanks,
>> Debu
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Spark Streaming not working

2020-04-14 Thread Gabor Somogyi
The symptom is simple, the broker is not responding in 120 seconds.
That's the reason why Debabrata asked the broker config.

What I can suggest is to check the previous printout which logs the Kafka
consumer settings.
With


On Tue, Apr 14, 2020 at 11:44 AM ZHANG Wei  wrote:

> Here is the assertion error message format:
>
>s"Failed to get records for $groupId $topic $partition $offset after
> polling for $timeout")
>
> You might have to check the kafka service with the error log:
>
> > 20/04/10 17:28:04 ERROR Executor: Exception in task 0.5 in stage 0.0
> (TID 24)
> > java.lang.AssertionError: assertion failed: Failed to get records for
> spark-executor-service-spark-ingestion dice-ingestion 11 0 after polling
> for 12
>
> Cheers,
> -z
>
> 
> From: Debabrata Ghosh 
> Sent: Saturday, April 11, 2020 2:25
> To: user
> Subject: Re: Spark Streaming not working
>
> Any solution please ?
>
> On Fri, Apr 10, 2020 at 11:04 PM Debabrata Ghosh  > wrote:
> Hi,
> I have a spark streaming application where Kafka is producing
> records but unfortunately spark streaming isn't able to consume those.
>
> I am hitting the following error:
>
> 20/04/10 17:28:04 ERROR Executor: Exception in task 0.5 in stage 0.0 (TID
> 24)
> java.lang.AssertionError: assertion failed: Failed to get records for
> spark-executor-service-spark-ingestion dice-ingestion 11 0 after polling
> for 12
> at scala.Predef$.assert(Predef.scala:170)
> at
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
> at
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:223)
> at
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:189)
>
> Would you please be able to help with a resolution.
>
> Thanks,
> Debu
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


[Spark Core]: Does an executor only cache the partitions it requires for its computations or always the full RDD?

2020-04-14 Thread zwithouta
Provided caching is activated for a RDD, does each executor of a cluster only
cache the partitions it requires for its computations or always the full
RDD?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming not working

2020-04-14 Thread ZHANG Wei
Here is the assertion error message format:

   s"Failed to get records for $groupId $topic $partition $offset after polling 
for $timeout")

You might have to check the kafka service with the error log:

> 20/04/10 17:28:04 ERROR Executor: Exception in task 0.5 in stage 0.0 (TID 24)
> java.lang.AssertionError: assertion failed: Failed to get records for 
> spark-executor-service-spark-ingestion dice-ingestion 11 0 after polling for 
> 12

Cheers,
-z


From: Debabrata Ghosh 
Sent: Saturday, April 11, 2020 2:25
To: user
Subject: Re: Spark Streaming not working

Any solution please ?

On Fri, Apr 10, 2020 at 11:04 PM Debabrata Ghosh 
mailto:mailford...@gmail.com>> wrote:
Hi,
I have a spark streaming application where Kafka is producing records 
but unfortunately spark streaming isn't able to consume those.

I am hitting the following error:

20/04/10 17:28:04 ERROR Executor: Exception in task 0.5 in stage 0.0 (TID 24)
java.lang.AssertionError: assertion failed: Failed to get records for 
spark-executor-service-spark-ingestion dice-ingestion 11 0 after polling for 
12
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:223)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:189)

Would you please be able to help with a resolution.

Thanks,
Debu

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-14 Thread jane thorpe
 Here a is another tool I use Logic Analyser  7:55
https://youtu.be/LnzuMJLZRdU

you could take some suggestions for improving performance  queries.
https://dzone.com/articles/why-you-should-not-use-select-in-sql-query-1
  
Jane thorpe
janethor...@aol.com
 
 
-Original Message-
From: jane thorpe 
To: janethorpe1 ; mich.talebzadeh 
; liruijing09 ; user 

Sent: Mon, 13 Apr 2020 8:32
Subject: Re: Spark hangs while reading from jdbc - does nothing Removing Guess 
work from trouble shooting



This tool may be useful for you to trouble shoot your problems away.

https://www.javacodegeeks.com/2020/04/simplifying-apm-remove-the-guesswork-from-troubleshooting.html

"APM tools typically use a waterfall-type view to show the blocking time of 
different components cascading through the control flow within an application. 
These types of visualizations are useful, and AppOptics has them, but they can 
be difficult to understand for those of us without a PhD."
Especially  helpful if you want to understand through visualisation and you do 
not have a phD.

 
Jane thorpe
janethor...@aol.com
 
 
-Original Message-
From: jane thorpe 
To: mich.talebzadeh ; liruijing09 
; user 
CC: user 
Sent: Sun, 12 Apr 2020 4:35
Subject: Re: Spark hangs while reading from jdbc - does nothing

You seem to be implying the error is intermittent.  
You seem to be implying data is being ingested  via JDBC. So the connection has 
proven itself to be working unless no data is arriving from the  JDBC channel 
at all.  If no data is arriving then one could say it could be  the JDBC.If the 
error is intermittent  then it is likely a resource involved in processing is 
filling to capacity. Try reducing the data ingestion volume and see if that 
completes, then increase the data ingested  incrementally.I assume you have  
run the job on small amount of data so you have  completed your prototype stage 
successfully. 

On Saturday, 11 April 2020 Mich Talebzadeh  wrote:
Hi,
Have you checked your JDBC connections from Spark to Oracle. What is Oracle 
saying? Is it doing anything or hanging?
set pagesize 
set linesize 140
set heading off
select SUBSTR(name,1,8) || ' sessions as on '||TO_CHAR(CURRENT_DATE, 'MON DD 
 HH:MI AM') from v$database;
set heading on
column spid heading "OS PID" format a6
column process format a13 heading "Client ProcID"
column username  format a15
column sid       format 999
column serial#   format 9
column STATUS    format a3 HEADING 'ACT'
column last      format 9,999.99
column TotGets   format 999,999,999,999 HEADING 'Logical I/O'
column phyRds    format 999,999,999 HEADING 'Physical I/O'
column total_memory format 999,999,999 HEADING 'MEM/KB'
--
SELECT
          substr(a.username,1,15) "LOGIN"
        , substr(a.sid,1,5) || ','||substr(a.serial#,1,5) AS "SID/serial#"
        , TO_CHAR(a.logon_time, 'DD/MM HH:MI') "LOGGED IN SINCE"
        , substr(a.machine,1,10) HOST
        , substr(p.username,1,8)||'/'||substr(p.spid,1,5) "OS PID"
        , substr(a.osuser,1,8)||'/'||substr(a.process,1,5) "Client PID"
        , substr(a.program,1,15) PROGRAM
        --,ROUND((CURRENT_DATE-a.logon_time)*24) AS "Logged/Hours"
        , (
                select round(sum(ss.value)/1024) from v$sesstat ss, v$statname 
sn
                where ss.sid = a.sid and
                        sn.statistic# = ss.statistic# and
                        -- sn.name in ('session pga memory')
                        sn.name in ('session pga memory','session uga memory')
          ) AS total_memory
        , (b.block_gets + b.consistent_gets) TotGets
        , b.physical_reads phyRds
        , decode(a.status, 'ACTIVE', 'Y','INACTIVE', 'N') STATUS
        , CASE WHEN a.sid in (select sid from v$mystat where rownum = 1) THEN 
'<-- YOU' ELSE ' ' END "INFO"
FROM
         v$process p
        ,v$session a
        ,v$sess_io b
WHERE
a.paddr = p.addr
AND p.background IS NULL
--AND  a.sid NOT IN (select sid from v$mystat where rownum = 1)
AND a.sid = b.sid
AND a.username is not null
--AND (a.last_call_et < 3600 or a.status = 'ACTIVE')
--AND CURRENT_DATE - logon_time > 0
--AND a.sid NOT IN ( select sid from v$mystat where rownum=1)  -- exclude me
--AND (b.block_gets + b.consistent_gets) > 0
ORDER BY a.username;
exit

HTH

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  

On Fri, 10 Apr 2020 at 17:37, Ruijing Li  wrote:

Hi all,
I am on spark 2.4.4 and using scala 2.11.12, and running cluster mode on mesos. 
I am ingesting from an oracle database using spark.read.jdbc. I am seeing a 
strange issue where spark just hangs 

Re: Spark interrupts S3 request backoff

2020-04-14 Thread ZHANG Wei
I will make a guess, it's not interruptted, it's killed by the driver or the 
resource manager since the executor fallen into sleep for a long time.

You may have to find the root cause in the driver and failed executor log 
contexts.

--
Cheers,
-z


From: Lian Jiang 
Sent: Monday, April 13, 2020 10:43
To: user
Subject: Spark interrupts S3 request backoff

Hi,

My Spark job failed when reading parquet files from S3 due to 503 slow down. 
According to 
https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html, I 
can use backoff to mitigate this issue. However, spark seems to interrupt the 
backoff sleeping (see "sleep interrupted"). Is there a way (e.g. some settings) 
to make spark not interrupt the backoff? Appreciate any hints.



20/04/12 20:15:37 WARN TaskSetManager: Lost task 3347.0 in stage 155.0 (TID 
128138, ip-100-101-44-35.us-west-2.compute.internal, executor 34): 
org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to 
download file path: 
s3://mybucket/myprefix/part-00178-d0a0d51f-f98e-4b9d-8d00-bb3b9acd9a47-c000.snappy.parquet,
 range: 0-19231, partition values: [empty row], isDataPresent: false
at 
org.apache.spark.sql.execution.datasources.AsyncFileDownloader.next(AsyncFileDownloader.scala:142)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.getNextFile(FileScanRDD.scala:248)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:172)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
 Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; 
Request ID: CECE220993AE7F89; S3 Extended Request ID: 
UlQe4dEuBR1YWJUthSlrbV9phyqxUNHQEw7tsJ5zu+oNIH+nGlGHfAv7EKkQRUVP8tw8x918A4Y=), 
S3 Extended Request ID: 
UlQe4dEuBR1YWJUthSlrbV9phyqxUNHQEw7tsJ5zu+oNIH+nGlGHfAv7EKkQRUVP8tw8x918A4Y=
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
at 
com.amazon.ws.emr.hadoop.fs.shaded