Re: OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Sanjeev Mishra
Can you reduce maxFilesPerTrigger further and see if the OOM still persists, if 
it does then the problem may be somewhere else.

> On Jul 19, 2020, at 5:37 AM, Jungtaek Lim  
> wrote:
> 
> Please provide logs and dump file for the OOM case - otherwise no one could 
> say what's the cause.
> 
> Add JVM options to driver/executor => -XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath="...dir..."
> 
> On Sun, Jul 19, 2020 at 6:56 PM Rachana Srivastava 
>  wrote:
> Issue: I am trying to process 5000+ files of gzipped json file periodically 
> from S3 using Structured Streaming code. 
> 
> Here are the key steps:
> Read json schema and broadccast to executors
> Read Stream
> 
> Dataset inputDS = sparkSession.readStream() .format("text") 
> .option("inferSchema", "true") .option("header", "true") .option("multiLine", 
> true).schema(jsonSchema) .option("mode", "PERMISSIVE") .json(inputPath + 
> "/*");
> Process each file in a map Dataset ds = inputDS.map(x -> { ... }, 
> Encoders.STRING());
> Write output to S3
> 
> StreamingQuery query = ds .coalesce(1) .writeStream() .outputMode("append") 
> .format("csv") ... .start();
> maxFilesPerTrigger is set to 500 so I was hoping the streaming will pick only 
> that many file to process. Why are we getting OOM? If in a we have more than 
> 3500 files then system crashes with OOM.
> 
> 



Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
Let me share the Ipython notebook.

On Tue, Jun 30, 2020 at 11:18 AM Gourav Sengupta 
wrote:

> Hi,
>
> I think that the notebook clearly demonstrates that setting the
> inferTimestamp option to False does not really help.
>
> Is it really impossible for you to show how your own data can be loaded?
> It should be simple, just open the notebook and see why the exact code you
> have given does not work, and shows only 11 records.
>
>
> Regards,
> Gourav Sengupta
>
> On Tue, Jun 30, 2020 at 4:15 PM Sanjeev Mishra 
> wrote:
>
>> Hi Gourav,
>>
>> Please check the comments of the ticket, looks like the performance
>> degradation is attributed to inferTimestamp option that is true by default
>> (I have no idea why) in Spark 3.0. This forces Spark to scan entire text
>> and so the poor performance.
>>
>> Regards
>> Sanjeev
>>
>> On Jun 30, 2020, at 8:12 AM, Gourav Sengupta 
>> wrote:
>>
>> Hi, Sanjeev,
>>
>> I think that I did precisely that, can you please download my ipython
>> notebook and have a look, and let me know where I am going wrong. its
>> attached with the JIRA ticket.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra 
>> wrote:
>>
>>> There are total 11 files as part of tar. You will have to untar it to
>>> get to actual files (.json.gz)
>>>
>>> No, I am getting
>>>
>>> Count: 33447
>>>
>>> spark.time(spark.read.json(“/data/small-anon/"))
>>> Time taken: 431 ms
>>> res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ...
>>> 2 more fields]
>>>
>>> scala> res73.count()
>>> res74: Long = 33447
>>>
>>> ls -ltr
>>> total 7592
>>> -rw-r--r--  1 sanjeevmishra  staff  132413 Jun 29 08:40
>>> part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  272767 Jun 29 08:40
>>> part-9-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  272314 Jun 29 08:40
>>> part-8-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  277158 Jun 29 08:40
>>> part-7-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  321451 Jun 29 08:40
>>> part-6-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  331419 Jun 29 08:40
>>> part-5-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  337195 Jun 29 08:40
>>> part-4-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  366346 Jun 29 08:40
>>> part-3-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  423154 Jun 29 08:40
>>> part-2-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  458187 Jun 29 08:40
>>> part-0-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff  673836 Jun 29 08:40
>>> part-1-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
>>> -rw-r--r--  1 sanjeevmishra  staff   0 Jun 29 08:40 _SUCCESS
>>>
>>> On Jun 30, 2020, at 5:37 AM, Gourav Sengupta 
>>> wrote:
>>>
>>> Hi Sanjeev,
>>> that just gives 11 records from the sample that you have loaded to the
>>> JIRA tickets is it correct?
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra 
>>> wrote:
>>>
>>>> There is not much code, I am just using spark-shell and reading the
>>>> data like so
>>>>
>>>> spark.time(spark.read.json("/data/small-anon/"))
>>>>
>>>> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta 
>>>> wrote:
>>>>
>>>> Hi Sanjeev,
>>>>
>>>> can you share the exact code that you are using to read the JSON files?
>>>> Currently I am getting only 11 records from the tar file that you have
>>>> attached with JIRA.
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>> On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev 
>>>> wrote:
>>>>
>>>>> According to the spec, in addition to the line breaks, you should also
>>>>> put the nested object va

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
Hi Gourav,

Please check the comments of the ticket, looks like the performance degradation 
is attributed to inferTimestamp option that is true by default (I have no idea 
why) in Spark 3.0. This forces Spark to scan entire text and so the poor 
performance.

Regards
Sanjeev

> On Jun 30, 2020, at 8:12 AM, Gourav Sengupta  
> wrote:
> 
> Hi, Sanjeev,
> 
> I think that I did precisely that, can you please download my ipython 
> notebook and have a look, and let me know where I am going wrong. its 
> attached with the JIRA ticket.
> 
> 
> Regards,
> Gourav Sengupta 
> 
> On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra  <mailto:sanjeev.mis...@gmail.com>> wrote:
> There are total 11 files as part of tar. You will have to untar it to get to 
> actual files (.json.gz)
> 
> No, I am getting
> 
> Count: 33447
> 
> spark.time(spark.read.json(“/data/small-anon/"))
> Time taken: 431 ms
> res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 2 
> more fields]
> 
> scala> res73.count()
> res74: Long = 33447
> 
> ls -ltr
> total 7592
> -rw-r--r--  1 sanjeevmishra  staff  132413 Jun 29 08:40 
> part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  272767 Jun 29 08:40 
> part-9-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  272314 Jun 29 08:40 
> part-8-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  277158 Jun 29 08:40 
> part-7-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  321451 Jun 29 08:40 
> part-6-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  331419 Jun 29 08:40 
> part-5-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  337195 Jun 29 08:40 
> part-4-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  366346 Jun 29 08:40 
> part-3-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  423154 Jun 29 08:40 
> part-2-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  458187 Jun 29 08:40 
> part-0-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff  673836 Jun 29 08:40 
> part-1-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
> -rw-r--r--  1 sanjeevmishra  staff   0 Jun 29 08:40 _SUCCESS
> 
>> On Jun 30, 2020, at 5:37 AM, Gourav Sengupta > <mailto:gourav.sengu...@gmail.com>> wrote:
>> 
>> Hi Sanjeev,
>> that just gives 11 records from the sample that you have loaded to the JIRA 
>> tickets is it correct?
>> 
>> 
>> Regards,
>> Gourav Sengupta 
>> 
>> On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra > <mailto:sanjeev.mis...@gmail.com>> wrote:
>> There is not much code, I am just using spark-shell and reading the data 
>> like so
>> 
>> spark.time(spark.read.json("/data/small-anon/"))
>> 
>>> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta >> <mailto:gourav.sengu...@gmail.com>> wrote:
>>> 
>>> Hi Sanjeev,
>>> 
>>> can you share the exact code that you are using to read the JSON files? 
>>> Currently I am getting only 11 records from the tar file that you have 
>>> attached with JIRA.
>>> 
>>> Regards,
>>> Gourav Sengupta
>>> 
>>> On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev >> <mailto:arte...@dtechspace.com>> wrote:
>>> According to the spec, in addition to the line breaks, you should also put 
>>> the nested object values in arrays instead of dictionaries.  You may want 
>>> to give a try and see if this would give you a better performance.
>>> 
>>> Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms 3.0.  
>>> Hope the Databricks engineers will find an answer or bug fix soon.
>>> 
>>> -- ND
>>> 
>>> On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
>>>> The tar file that I have attached has bunch of json.zip files and this is 
>>>> the file that is being processed. Each line is self contained JSON as 
>>>> shown below
>>>> 
>>>> zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head 
>>>> -3
>>>> {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>>>>  HNC IGD",

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
There are total 11 files as part of tar. You will have to untar it to get to 
actual files (.json.gz)

No, I am getting

Count: 33447

spark.time(spark.read.json(“/data/small-anon/"))
Time taken: 431 ms
res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 2 more 
fields]

scala> res73.count()
res74: Long = 33447

ls -ltr
total 7592
-rw-r--r--  1 sanjeevmishra  staff  132413 Jun 29 08:40 
part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  272767 Jun 29 08:40 
part-9-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  272314 Jun 29 08:40 
part-8-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  277158 Jun 29 08:40 
part-7-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  321451 Jun 29 08:40 
part-6-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  331419 Jun 29 08:40 
part-5-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  337195 Jun 29 08:40 
part-4-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  366346 Jun 29 08:40 
part-3-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  423154 Jun 29 08:40 
part-2-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  458187 Jun 29 08:40 
part-0-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  673836 Jun 29 08:40 
part-1-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff   0 Jun 29 08:40 _SUCCESS

> On Jun 30, 2020, at 5:37 AM, Gourav Sengupta  
> wrote:
> 
> Hi Sanjeev,
> that just gives 11 records from the sample that you have loaded to the JIRA 
> tickets is it correct?
> 
> 
> Regards,
> Gourav Sengupta 
> 
> On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra  <mailto:sanjeev.mis...@gmail.com>> wrote:
> There is not much code, I am just using spark-shell and reading the data like 
> so
> 
> spark.time(spark.read.json("/data/small-anon/"))
> 
>> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta > <mailto:gourav.sengu...@gmail.com>> wrote:
>> 
>> Hi Sanjeev,
>> 
>> can you share the exact code that you are using to read the JSON files? 
>> Currently I am getting only 11 records from the tar file that you have 
>> attached with JIRA.
>> 
>> Regards,
>> Gourav Sengupta
>> 
>> On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev > <mailto:arte...@dtechspace.com>> wrote:
>> According to the spec, in addition to the line breaks, you should also put 
>> the nested object values in arrays instead of dictionaries.  You may want to 
>> give a try and see if this would give you a better performance.
>> 
>> Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms 3.0.  
>> Hope the Databricks engineers will find an answer or bug fix soon.
>> 
>> -- ND
>> 
>> On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
>>> The tar file that I have attached has bunch of json.zip files and this is 
>>> the file that is being processed. Each line is self contained JSON as shown 
>>> below
>>> 
>>> zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head 
>>> -3
>>> {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>>>  HNC IGD","Annex F 
>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>>>  Arris FastPath Speed 
>>> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA
>>>  
>>> <http://arris.nvg4xx.missing.ca/>","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>>>  HNC IGD 
>>> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamo

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
There is not much code, I am just using spark-shell and reading the data like so

spark.time(spark.read.json("/data/small-anon/"))

> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta  
> wrote:
> 
> Hi Sanjeev,
> 
> can you share the exact code that you are using to read the JSON files? 
> Currently I am getting only 11 records from the tar file that you have 
> attached with JIRA.
> 
> Regards,
> Gourav Sengupta
> 
> On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev  <mailto:arte...@dtechspace.com>> wrote:
> According to the spec, in addition to the line breaks, you should also put 
> the nested object values in arrays instead of dictionaries.  You may want to 
> give a try and see if this would give you a better performance.
> 
> Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms 3.0.  
> Hope the Databricks engineers will find an answer or bug fix soon.
> 
> -- ND
> 
> On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
>> The tar file that I have attached has bunch of json.zip files and this is 
>> the file that is being processed. Each line is self contained JSON as shown 
>> below
>> 
>> zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head -3
>> {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>>  HNC IGD","Annex F 
>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>>  Arris FastPath Speed 
>> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA
>>  
>> <http://arris.nvg4xx.missing.ca/>","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>>  HNC IGD 
>> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
>> Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdv

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Sanjeev Mishra
Done. https://issues.apache.org/jira/browse/SPARK-32130



On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk 
wrote:

> Hello Sanjeev,
>
> It is hard to troubleshoot the issue without input files. Could you open
> an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and
> attach the JSON files there (or samples or code which generates JSON
> files)?
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>
>
> On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra 
> wrote:
>
>> It has read everything. As you notice the timing of count is still
>> smaller in Spark 2.4
>>
>> Spark 2.4
>>
>> scala> spark.time(spark.read.json("/data/20200528"))
>> Time taken: 19691 ms
>> res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ...
>> 5 more fields]
>>
>> scala> spark.time(res61.count())
>> Time taken: 7113 ms
>> res64: Long = 2605349
>>
>> Spark 3.0
>> scala> spark.time(spark.read.json("/data/20200528"))
>> 20/06/29 08:06:53 WARN package: Truncated the string representation of a
>> plan since it was too large. This behavior can be adjusted by setting
>> 'spark.sql.debug.maxToStringFields'.
>> Time taken: 849652 ms
>> res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5
>> more fields]
>>
>> scala> spark.time(res0.count())
>> Time taken: 8201 ms
>> res2: Long = 2605349
>>
>>
>>
>>
>> On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev 
>> wrote:
>>
>>> Could you share your code?  Are you sure you Spark 2.4 cluster had
>>> indeed read anything?  Looks like the Input size field is empty under 2.4.
>>>
>>> -- ND
>>> On 6/27/20 7:58 PM, Sanjeev Mishra wrote:
>>>
>>>
>>> I have large amount of json files that Spark can read in 36 seconds but
>>> Spark 3.0 takes almost 33 minutes to read the same. On closer analysis,
>>> looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone
>>> have any idea what is going on? Is there any configuration problem with
>>> Spark 3.0.
>>>
>>> Here are the details:
>>>
>>> *Spark 2.4*
>>>
>>> Summary Metrics for 2203 Completed Tasks
>>> <http://10.0.0.8:4040/stages/stage/?id=0=0#tasksTitle>
>>> Metric Min 25th percentile Median 75th percentile Max
>>> Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
>>> GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
>>> Showing 1 to 2 of 2 entries
>>>   Aggregated Metrics by Executor
>>> Show  20 40 60 100 All  entries
>>> Search:
>>> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks 
>>> Succeeded
>>> Tasks Blacklisted
>>> driver
>>> 10.0.0.8:49159 36 s 2203 0 0 2203 false
>>>
>>>
>>> *Spark 3.0*
>>>
>>> Summary Metrics for 8 Completed Tasks
>>> <http://10.0.0.8:4040/stages/stage/?id=1=0=1=47#tasksTitle>
>>> Metric Min 25th percentile Median 75th percentile Max
>>> Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
>>> GC Time 3 s 3 s 3 s 4 s 4 s
>>> Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8
>>> MiB / 58148 20.2 MiB / 71624
>>> Showing 1 to 3 of 3 entries
>>>   Aggregated Metrics by Executor
>>> Show  20 40 60 100 All  entries
>>> Search:
>>> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks 
>>> Succeeded
>>> Tasks Blacklisted Input Size / Records
>>> driver
>>> 10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999
>>>
>>>
>>> The DAG is also different
>>> Spark 2.0 DAG
>>>
>>> [image: Screenshot 2020-06-27 16.30.26.png]
>>>
>>> Spark 3.0 DAG
>>>
>>> [image: Screenshot 2020-06-27 16.32.32.png]
>>>
>>>
>>>


Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Sanjeev Mishra
It has read everything. As you notice the timing of count is still smaller
in Spark 2.4

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5
more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a
plan since it was too large. This behavior can be adjusted by setting
'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5
more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349




On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev  wrote:

> Could you share your code?  Are you sure you Spark 2.4 cluster had indeed
> read anything?  Looks like the Input size field is empty under 2.4.
>
> -- ND
> On 6/27/20 7:58 PM, Sanjeev Mishra wrote:
>
>
> I have large amount of json files that Spark can read in 36 seconds but
> Spark 3.0 takes almost 33 minutes to read the same. On closer analysis,
> looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone
> have any idea what is going on? Is there any configuration problem with
> Spark 3.0.
>
> Here are the details:
>
> *Spark 2.4*
>
> Summary Metrics for 2203 Completed Tasks
> <http://10.0.0.8:4040/stages/stage/?id=0=0#tasksTitle>
> Metric Min 25th percentile Median 75th percentile Max
> Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
> GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
> Showing 1 to 2 of 2 entries
>   Aggregated Metrics by Executor
> Show  20 40 60 100 All  entries
> Search:
> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks 
> Succeeded
> Tasks Blacklisted
> driver
> 10.0.0.8:49159 36 s 2203 0 0 2203 false
>
>
> *Spark 3.0*
>
> Summary Metrics for 8 Completed Tasks
> <http://10.0.0.8:4040/stages/stage/?id=1=0=1=47#tasksTitle>
> Metric Min 25th percentile Median 75th percentile Max
> Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
> GC Time 3 s 3 s 3 s 4 s 4 s
> Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8
> MiB / 58148 20.2 MiB / 71624
> Showing 1 to 3 of 3 entries
>   Aggregated Metrics by Executor
> Show  20 40 60 100 All  entries
> Search:
> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks 
> Succeeded
> Tasks Blacklisted Input Size / Records
> driver
> 10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999
>
>
> The DAG is also different
> Spark 2.0 DAG
>
> [image: Screenshot 2020-06-27 16.30.26.png]
>
> Spark 3.0 DAG
>
> [image: Screenshot 2020-06-27 16.32.32.png]
>
>
>


Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Sanjeev Mishra
There is not much code, I am using spark-shell provided by Spark 2.4 and
Spark 3.

val dp = spark.read.json("/Users//data/dailyparams/20200528")



On Mon, Jun 29, 2020 at 2:25 AM Gourav Sengupta 
wrote:

> Hi,
>
> can you please share the SPARK code?
>
>
>
> Regards,
> Gourav
>
> On Sun, Jun 28, 2020 at 12:58 AM Sanjeev Mishra 
> wrote:
>
>>
>> I have large amount of json files that Spark can read in 36 seconds but
>> Spark 3.0 takes almost 33 minutes to read the same. On closer analysis,
>> looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone
>> have any idea what is going on? Is there any configuration problem with
>> Spark 3.0.
>>
>> Here are the details:
>>
>> *Spark 2.4*
>>
>> Summary Metrics for 2203 Completed Tasks
>> <http://10.0.0.8:4040/stages/stage/?id=0=0#tasksTitle>
>> MetricMin25th percentileMedian75th percentileMax
>> Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
>> GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
>> Showing 1 to 2 of 2 entries
>>  Aggregated Metrics by Executor
>> Show 204060100All entries
>> Search:
>> Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded
>> TasksBlacklisted
>> driver 10.0.0.8:49159 36 s 2203 0 0 2203 false
>>
>>
>> *Spark 3.0*
>>
>> Summary Metrics for 8 Completed Tasks
>> <http://10.0.0.8:4040/stages/stage/?id=1=0=1=47#tasksTitle>
>> MetricMin25th percentileMedian75th percentileMax
>> Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
>> GC Time 3 s 3 s 3 s 4 s 4 s
>> Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8
>> MiB / 58148 20.2 MiB / 71624
>> Showing 1 to 3 of 3 entries
>>  Aggregated Metrics by Executor
>> Show 204060100All entries
>> Search:
>> Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded
>> TasksBlacklistedInput Size / Records
>> driver 10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999
>>
>>
>> The DAG is also different
>> Spark 2.0 DAG
>>
>> [image: Screenshot 2020-06-27 16.30.26.png]
>>
>> Spark 3.0 DAG
>>
>> [image: Screenshot 2020-06-27 16.32.32.png]
>>
>>
>>


Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-27 Thread Sanjeev Mishra
I have large amount of json files that Spark can read in 36 seconds but
Spark 3.0 takes almost 33 minutes to read the same. On closer analysis,
looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone
have any idea what is going on? Is there any configuration problem with
Spark 3.0.

Here are the details:

*Spark 2.4*

Summary Metrics for 2203 Completed Tasks

MetricMin25th percentileMedian75th percentileMax
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
Showing 1 to 2 of 2 entries
 Aggregated Metrics by Executor
Show 204060100All entries
Search:
Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded
TasksBlacklisted
driver 10.0.0.8:49159 36 s 2203 0 0 2203 false


*Spark 3.0*

Summary Metrics for 8 Completed Tasks

MetricMin25th percentileMedian75th percentileMax
Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8
MiB / 58148 20.2 MiB / 71624
Showing 1 to 3 of 3 entries
 Aggregated Metrics by Executor
Show 204060100All entries
Search:
Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded
TasksBlacklistedInput Size / Records
driver 10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

[image: Screenshot 2020-06-27 16.30.26.png]

Spark 3.0 DAG

[image: Screenshot 2020-06-27 16.32.32.png]


Spark 3.0.0 spark.read.json never completes

2020-06-27 Thread Sanjeev Mishra
HI all,

I have huge amount of json files that Spark 2.4 can easily finish reading
but Spark 3.0.0 never competes. I am running both Spark 2 and Spark 3 on Mac


Re: Getting PySpark Partitions Locations

2020-06-25 Thread Sanjeev Mishra
You can use catalog apis see following

https://stackoverflow.com/questions/54268845/how-to-check-the-number-of-partitions-of-a-spark-dataframe-without-incurring-the/54270537

On Thu, Jun 25, 2020 at 6:19 AM Tzahi File  wrote:

> I don't want to query with a distinct on the partitioned columns, the df
> contains over 1 Billion of records.
> I just want to know the partitions that were created..
>
> On Thu, Jun 25, 2020 at 4:04 PM Jörn Franke  wrote:
>
>> By doing a select on the df ?
>>
>> Am 25.06.2020 um 14:52 schrieb Tzahi File :
>>
>> 
>> Hi,
>>
>> I'm using pyspark to write df to s3, using the following command:
>> "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)".
>>
>> Is there any way to get the partitions created?
>> e.g.
>> day=2020-06-20/hour=1/country=US
>> day=2020-06-20/hour=2/country=US
>> ..
>>
>> --
>> Tzahi File
>> Data Engineer
>> [image: ironSource] 
>>
>> email tzahi.f...@ironsrc.com
>> mobile +972-546864835
>> fax +972-77-5448273
>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>> ironsrc.com 
>> [image: linkedin] [image:
>> twitter] [image: facebook]
>> [image: googleplus]
>> 
>> This email (including any attachments) is for the sole use of the
>> intended recipient and may contain confidential information which may be
>> protected by legal privilege. If you are not the intended recipient, or the
>> employee or agent responsible for delivering it to the intended recipient,
>> you are hereby notified that any use, dissemination, distribution or
>> copying of this communication and/or its content is strictly prohibited. If
>> you are not the intended recipient, please immediately notify us by reply
>> email or by telephone, delete this email and destroy any copies. Thank you.
>>
>>
>
> --
> Tzahi File
> Data Engineer
> [image: ironSource] 
>
> email tzahi.f...@ironsrc.com
> mobile +972-546864835
> fax +972-77-5448273
> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
> ironsrc.com 
> [image: linkedin] [image:
> twitter] [image: facebook]
> [image: googleplus]
> 
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>