Kafka Zeppelin integration

2020-06-19 Thread silavala
hi here is my question. Spark code run on zeppelin is unable to find 
kafka source even though a dependency is specified. I ask is there any 
way to fix this. Zeppelin version is 0.9.0, Spark version is 2.4.6, and 
kafka version is 2.4.1. I have specified the dependency in the packages 
and add a jar file that contained the kafka stream 010.


thank you
Suraj Ilavala

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [pyspark 2.3+] read/write huge data with smaller block size (128MB per block)

2020-06-19 Thread Rishi Shah
Thanks Sean! To combat the skew I do have another column I partitionby and
that has worked well (like below). However in the image I attached in my
original email - it looks like 2 tasks processed nothing, may I
reading SPARKUI task table right? All 4 dates have date - 2 dates have
~200MB & other 2 have ~800MB... This was just a test run to check the
behavior. Shouldn't I see all 4 tasks with some output rows?

df.repartition('file_date',
'part_col').write.partitionBy('file_date').parquet(PATH)


On Fri, Jun 19, 2020 at 9:38 AM Sean Owen  wrote:

> Yes you'll generally get 1 partition per block, and 1 task per partition.
> The amount of RAM isn't directly relevant; it's not loaded into memory.
> But you may nevertheless get some improvement with larger partitions /
> tasks, though typically only if your tasks are very small and very fast
> right now (completing in a few seconds)
> You can use minSplitSize to encourage some RDD APIs to choose larger
> partitions, but not in the DF API.
> Instead you can try coalescing to a smaller number of partitions, without
> a shuffle (the shuffle will probably negate any benefit)
>
> However what I see here is different still -- you have serious data skew
> because you partitioned by date, and I suppose some dates have lots of
> data, some have almost none.
>
>
> On Fri, Jun 19, 2020 at 12:17 AM Rishi Shah 
> wrote:
>
>> Hi All,
>>
>> I have about 10TB of parquet data on S3, where data files have 128MB
>> sized blocks. Spark would by default pick up one block per task, even
>> though every task within executor has atleast 1.8GB memory. Isn't that
>> wasteful? Is there any way to speed up this processing? Is there a way to
>> force tasks to pick up more files which sum up to a certain block size? or
>> Spark would always entertain block per task? Basically is there an override
>> to make sure spark tasks reads larger block(s)?
>>
>> Also as seen in the image here - while writing 4 files (partitionby
>> file_date), one file per partition.. Somehow 4 threads are active but two
>> threads seem to be doing nothing. and other 2 threads have taken over the
>> writing for all 4 files. Shouldn't all 4 tasks pick up one task each?
>>
>> For this example, assume df has 4 file_dates worth data.
>>
>> df.repartition('file_date').write.partitionBy('file_date').parquet(PATH)
>>
>> Screen Shot 2020-06-18 at 2.01.53 PM.png (126K)
>> 
>>
>> Any suggestions/feedback helps, appreciate it!
>> --
>> Regards,
>>
>> Rishi Shah
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Regards,

Rishi Shah


Re: [pyspark 2.3+] read/write huge data with smaller block size (128MB per block)

2020-06-19 Thread Sean Owen
Yes you'll generally get 1 partition per block, and 1 task per partition.
The amount of RAM isn't directly relevant; it's not loaded into memory. But
you may nevertheless get some improvement with larger partitions / tasks,
though typically only if your tasks are very small and very fast right now
(completing in a few seconds)
You can use minSplitSize to encourage some RDD APIs to choose larger
partitions, but not in the DF API.
Instead you can try coalescing to a smaller number of partitions, without a
shuffle (the shuffle will probably negate any benefit)

However what I see here is different still -- you have serious data skew
because you partitioned by date, and I suppose some dates have lots of
data, some have almost none.


On Fri, Jun 19, 2020 at 12:17 AM Rishi Shah 
wrote:

> Hi All,
>
> I have about 10TB of parquet data on S3, where data files have 128MB sized
> blocks. Spark would by default pick up one block per task, even though
> every task within executor has atleast 1.8GB memory. Isn't that wasteful?
> Is there any way to speed up this processing? Is there a way to force tasks
> to pick up more files which sum up to a certain block size? or Spark would
> always entertain block per task? Basically is there an override to make
> sure spark tasks reads larger block(s)?
>
> Also as seen in the image here - while writing 4 files (partitionby
> file_date), one file per partition.. Somehow 4 threads are active but two
> threads seem to be doing nothing. and other 2 threads have taken over the
> writing for all 4 files. Shouldn't all 4 tasks pick up one task each?
>
> For this example, assume df has 4 file_dates worth data.
>
> df.repartition('file_date').write.partitionBy('file_date').parquet(PATH)
>
> Screen Shot 2020-06-18 at 2.01.53 PM.png (126K)
> 
>
> Any suggestions/feedback helps, appreciate it!
> --
> Regards,
>
> Rishi Shah
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri
Thanks, you meant in a for loop. could you please put pseudocode in spark

On Fri, Jun 19, 2020 at 8:39 AM Jörn Franke  wrote:

> Make every json object a line and then read t as jsonline not as multiline
>
> Am 19.06.2020 um 14:37 schrieb Chetan Khatri  >:
>
> 
> All transactions in JSON, It is not a single array.
>
> On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner 
> wrote:
>
>> It's an interesting problem. What is the structure of the file? One big
>> array? On hash with many key-value pairs?
>>
>> Stephan
>>
>> On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hi Spark Users,
>>>
>>> I have a 50GB of JSON file, I would like to read and persist at HDFS so
>>> it can be taken into next transformation. I am trying to read as
>>> spark.read.json(path) but this is giving Out of memory error on driver.
>>> Obviously, I can't afford having 50 GB on driver memory. In general, what
>>> is the best practice to read large JSON file like 50 GB?
>>>
>>> Thanks
>>>
>>
>>
>> --
>> Stephan Wehner, Ph.D.
>> The Buckmaster Institute, Inc.
>> 2150 Adanac Street
>> Vancouver BC V5L 2E7
>> Canada
>> Cell (604) 767-7415
>> Fax (888) 808-4655
>>
>> Sign up for our free email course
>> http://buckmaster.ca/small_business_website_mistakes.html
>>
>> http://www.buckmaster.ca
>> http://answer4img.com
>> http://loggingit.com
>> http://clocklist.com
>> http://stephansmap.org
>> http://benchology.com
>> http://www.trafficlife.com
>> http://stephan.sugarmotor.org (Personal Blog)
>> @stephanwehner (Personal Account)
>> VA7WSK (Personal call sign)
>>
>


Re: Reading TB of JSON file

2020-06-19 Thread Jörn Franke
Make every json object a line and then read t as jsonline not as multiline 

> Am 19.06.2020 um 14:37 schrieb Chetan Khatri :
> 
> 
> All transactions in JSON, It is not a single array. 
> 
>> On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner  
>> wrote:
>> It's an interesting problem. What is the structure of the file? One big 
>> array? On hash with many key-value pairs?
>> 
>> Stephan
>> 
>>> On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri  
>>> wrote:
>>> Hi Spark Users,
>>> 
>>> I have a 50GB of JSON file, I would like to read and persist at HDFS so it 
>>> can be taken into next transformation. I am trying to read as 
>>> spark.read.json(path) but this is giving Out of memory error on driver. 
>>> Obviously, I can't afford having 50 GB on driver memory. In general, what 
>>> is the best practice to read large JSON file like 50 GB?
>>> 
>>> Thanks
>> 
>> 
>> -- 
>> Stephan Wehner, Ph.D.
>> The Buckmaster Institute, Inc.
>> 2150 Adanac Street
>> Vancouver BC V5L 2E7
>> Canada
>> Cell (604) 767-7415
>> Fax (888) 808-4655
>> 
>> Sign up for our free email course
>> http://buckmaster.ca/small_business_website_mistakes.html
>> 
>> http://www.buckmaster.ca
>> http://answer4img.com
>> http://loggingit.com
>> http://clocklist.com
>> http://stephansmap.org
>> http://benchology.com
>> http://www.trafficlife.com
>> http://stephan.sugarmotor.org (Personal Blog)
>> @stephanwehner (Personal Account)
>> VA7WSK (Personal call sign)


Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri
All transactions in JSON, It is not a single array.

On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner 
wrote:

> It's an interesting problem. What is the structure of the file? One big
> array? On hash with many key-value pairs?
>
> Stephan
>
> On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri 
> wrote:
>
>> Hi Spark Users,
>>
>> I have a 50GB of JSON file, I would like to read and persist at HDFS so
>> it can be taken into next transformation. I am trying to read as
>> spark.read.json(path) but this is giving Out of memory error on driver.
>> Obviously, I can't afford having 50 GB on driver memory. In general, what
>> is the best practice to read large JSON file like 50 GB?
>>
>> Thanks
>>
>
>
> --
> Stephan Wehner, Ph.D.
> The Buckmaster Institute, Inc.
> 2150 Adanac Street
> Vancouver BC V5L 2E7
> Canada
> Cell (604) 767-7415
> Fax (888) 808-4655
>
> Sign up for our free email course
> http://buckmaster.ca/small_business_website_mistakes.html
>
> http://www.buckmaster.ca
> http://answer4img.com
> http://loggingit.com
> http://clocklist.com
> http://stephansmap.org
> http://benchology.com
> http://www.trafficlife.com
> http://stephan.sugarmotor.org (Personal Blog)
> @stephanwehner (Personal Account)
> VA7WSK (Personal call sign)
>


Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri
Yes

On Thu, Jun 18, 2020 at 12:34 PM Gourav Sengupta 
wrote:

> Hi,
> So you have a single JSON record in multiple lines?
> And all the 50 GB is in one file?
>
> Regards,
> Gourav
>
> On Thu, 18 Jun 2020, 14:34 Chetan Khatri, 
> wrote:
>
>> It is dynamically generated and written at s3 bucket not historical data
>> so I guess it doesn't have jsonlines format
>>
>> On Thu, Jun 18, 2020 at 9:16 AM Jörn Franke  wrote:
>>
>>> Depends on the data types you use.
>>>
>>> Do you have in jsonlines format? Then the amount of memory plays much
>>> less a role.
>>>
>>> Otherwise if it is one large object or array I would not recommend it.
>>>
>>> > Am 18.06.2020 um 15:12 schrieb Chetan Khatri <
>>> chetan.opensou...@gmail.com>:
>>> >
>>> > 
>>> > Hi Spark Users,
>>> >
>>> > I have a 50GB of JSON file, I would like to read and persist at HDFS
>>> so it can be taken into next transformation. I am trying to read as
>>> spark.read.json(path) but this is giving Out of memory error on driver.
>>> Obviously, I can't afford having 50 GB on driver memory. In general, what
>>> is the best practice to read large JSON file like 50 GB?
>>> >
>>> > Thanks
>>>
>>


Re: Hey good looking toPandas ()

2020-06-19 Thread Anwar AliKhan
I got an illegal argument error with 2.4.6.

I then pointed my Jupiter notebook  to 3.0 version and it worked as
expected.
Using same .ipnyb file.

I was following this machine learning example.
“Your First Apache Spark ML Model” by Favio Vázquez
https://towardsdatascience.com/your-first-apache-spark-ml-model-d2bb82b599dd


In the example he is using version 3.0 so I assumed I got the error because
I am using different version (2.4.6).



On Fri, 19 Jun 2020, 08:06 Stephen Boesch,  wrote:

> afaik It has been there since  Spark 2.0 in 2015.   Not certain about
> Spark 1.5/1.6
>
> On Thu, 18 Jun 2020 at 23:56, Anwar AliKhan 
> wrote:
>
>> I first ran the  command
>> df.show()
>>
>> For sanity check of my dataFrame.
>>
>> I wasn't impressed with the display.
>>
>> I then ran
>> df.toPandas() in Jupiter Notebook.
>>
>> Now the display is really good looking .
>>
>> Is toPandas() a new function which became available in Spark 3.0 ?
>>
>>
>>
>>
>>
>>


Re: Hey good looking toPandas ()

2020-06-19 Thread Stephen Boesch
afaik It has been there since  Spark 2.0 in 2015.   Not certain about Spark
1.5/1.6

On Thu, 18 Jun 2020 at 23:56, Anwar AliKhan 
wrote:

> I first ran the  command
> df.show()
>
> For sanity check of my dataFrame.
>
> I wasn't impressed with the display.
>
> I then ran
> df.toPandas() in Jupiter Notebook.
>
> Now the display is really good looking .
>
> Is toPandas() a new function which became available in Spark 3.0 ?
>
>
>
>
>
>