Re: reading each JSON file from dataframe...

2022-07-11 Thread Enrico Minack
All you need to do is implement a method readJson that reads a single 
file given its path. Than, you map the values of column file_path to the 
respective JSON content as a string. This can be done via an UDF or 
simply Dataset.map:


case class RowWithJsonUri(entity_id: String, file_path: String, 
other_useful_id: String)
case class RowWithJsonContent(entity_id: String, json_content: String, 
other_useful_id: String)


val ds = Seq(
  RowWithJsonUri("id-01f7pqqbxddb3b1an6ntyqx6mg", 
"gs://bucket1/path/to/id-01g4he5cb4xqn6s1999k6y1vbd/file_result.json", 
"id-2-01g4he5cb4xqn6s1999k6y1vbd"),
  RowWithJsonUri("id-01f7pqgbwms4ajmdtdedtwa3mf", 
"gs://bucket1/path/to/id-01g4he5cbh52che104rwy603sr/file_result.json", 
"id-2-01g4he5cbh52che104rwy603sr"),
  RowWithJsonUri("id-01f7pqqbxejt3ef4ap9qcs78m5", 
"gs://bucket1/path/to/id-01g4he5cbqmdv7dnx46sebs0gt/file_result.json", 
"id-2-01g4he5cbqmdv7dnx46sebs0gt"),
  RowWithJsonUri("id-01f7pqqbynh895ptpjjfxvk6dc", 
"gs://bucket1/path/to/id-01g4he5cbx1kwhgvdme1s560dw/file_result.json", 
"id-2-01g4he5cbx1kwhgvdme1s560dw")

).toDS()

ds.show(false)
+-+---+---+
|entity_id |file_path |other_useful_id    |
+-+---+---+
|id-01f7pqqbxddb3b1an6ntyqx6mg|gs://bucket1/path/to/id-01g4he5cb4xqn6s1999k6y1vbd/file_result.json|id-2-01g4he5cb4xqn6s1999k6y1vbd|
|id-01f7pqgbwms4ajmdtdedtwa3mf|gs://bucket1/path/to/id-01g4he5cbh52che104rwy603sr/file_result.json|id-2-01g4he5cbh52che104rwy603sr|
|id-01f7pqqbxejt3ef4ap9qcs78m5|gs://bucket1/path/to/id-01g4he5cbqmdv7dnx46sebs0gt/file_result.json|id-2-01g4he5cbqmdv7dnx46sebs0gt|
|id-01f7pqqbynh895ptpjjfxvk6dc|gs://bucket1/path/to/id-01g4he5cbx1kwhgvdme1s560dw/file_result.json|id-2-01g4he5cbx1kwhgvdme1s560dw|
+-+---+---+


def readJson(uri: String): String = { s"content of $uri" }

ds.map { row => RowWithJsonContent(row.entity_id, 
readJson(row.file_path), row.other_useful_id) }.show(false)

+-+--+---+
|entity_id |json_content |other_useful_id    |
+-+--+---+
|id-01f7pqqbxddb3b1an6ntyqx6mg|content of 
gs://bucket1/path/to/id-01g4he5cb4xqn6s1999k6y1vbd/file_result.json|id-2-01g4he5cb4xqn6s1999k6y1vbd|
|id-01f7pqgbwms4ajmdtdedtwa3mf|content of 
gs://bucket1/path/to/id-01g4he5cbh52che104rwy603sr/file_result.json|id-2-01g4he5cbh52che104rwy603sr|
|id-01f7pqqbxejt3ef4ap9qcs78m5|content of 
gs://bucket1/path/to/id-01g4he5cbqmdv7dnx46sebs0gt/file_result.json|id-2-01g4he5cbqmdv7dnx46sebs0gt|
|id-01f7pqqbynh895ptpjjfxvk6dc|content of 
gs://bucket1/path/to/id-01g4he5cbx1kwhgvdme1s560dw/file_result.json|id-2-01g4he5cbx1kwhgvdme1s560dw|

+-+--+---+

Cheers,
Enrico




Am 10.07.22 um 09:11 schrieb Muthu Jayakumar:

Hello there,

I have a dataframe with the following...

+-+---+---+
|entity_id                    |file_path                               
       |other_useful_id          |

+-+---+---+
|id-01f7pqqbxddb3b1an6ntyqx6mg|gs://bucket1/path/to/id-01g4he5cb4xqn6s1999k6y1vbd/file_result.json|id-2-01g4he5cb4xqn6s1999k6y1vbd|
|id-01f7pqgbwms4ajmdtdedtwa3mf|gs://bucket1/path/to/id-01g4he5cbh52che104rwy603sr/file_result.json|id-2-01g4he5cbh52che104rwy603sr|
|id-01f7pqqbxejt3ef4ap9qcs78m5|gs://bucket1/path/to/id-01g4he5cbqmdv7dnx46sebs0gt/file_result.json|id-2-01g4he5cbqmdv7dnx46sebs0gt|
|id-01f7pqqbynh895ptpjjfxvk6dc|gs://bucket1/path/to/id-01g4he5cbx1kwhgvdme1s560dw/file_result.json|id-2-01g4he5cbx1kwhgvdme1s560dw|
+-+---+---+

I would like to read each row from `file_path` and write the result to 
another dataframe containing `entity_id`, `other_useful_id`, 
`json_content`, `file_path`.
Assume that I already have the required HDFS url libraries in my 
classpath.


Please advice,
Muthu




Re: about cpu cores

2022-07-11 Thread Gourav Sengupta
Hi,
please see Sean's answer and please read about parallelism in spark.

Regards,
Gourav Sengupta

On Mon, Jul 11, 2022 at 10:12 AM Tufan Rakshit  wrote:

> so as an average every 4 core , you get back 3.6 core in Yarn , but you
> can use only 3 .
> in Kubernetes you get back 3.6 and also can use 3.6
>
> Best
> Tufan
>
> On Mon, 11 Jul 2022 at 11:02, Yong Walt  wrote:
>
>> We were using Yarn. thanks.
>>
>> On Sun, Jul 10, 2022 at 9:02 PM Tufan Rakshit  wrote:
>>
>>> Mainly depends what your cluster manager Yarn or kubernates ?
>>> Best
>>> Tufan
>>>
>>> On Sun, 10 Jul 2022 at 14:38, Sean Owen  wrote:
>>>
 Jobs consist of tasks, each of which consumes a core (can be set to >1
 too, but that's a different story). If there are more tasks ready to
 execute than available cores, some tasks simply wait.

 On Sun, Jul 10, 2022 at 3:31 AM Yong Walt  wrote:

> given my spark cluster has 128 cores totally.
> If the jobs (each job was assigned only one core) I submitted to the
> cluster are over 128, what will happen?
>
> Thank you.
>



Re: about cpu cores

2022-07-11 Thread Tufan Rakshit
so as an average every 4 core , you get back 3.6 core in Yarn , but you can
use only 3 .
in Kubernetes you get back 3.6 and also can use 3.6

Best
Tufan

On Mon, 11 Jul 2022 at 11:02, Yong Walt  wrote:

> We were using Yarn. thanks.
>
> On Sun, Jul 10, 2022 at 9:02 PM Tufan Rakshit  wrote:
>
>> Mainly depends what your cluster manager Yarn or kubernates ?
>> Best
>> Tufan
>>
>> On Sun, 10 Jul 2022 at 14:38, Sean Owen  wrote:
>>
>>> Jobs consist of tasks, each of which consumes a core (can be set to >1
>>> too, but that's a different story). If there are more tasks ready to
>>> execute than available cores, some tasks simply wait.
>>>
>>> On Sun, Jul 10, 2022 at 3:31 AM Yong Walt  wrote:
>>>
 given my spark cluster has 128 cores totally.
 If the jobs (each job was assigned only one core) I submitted to the
 cluster are over 128, what will happen?

 Thank you.

>>>


Re: about cpu cores

2022-07-11 Thread Yong Walt
We were using Yarn. thanks.

On Sun, Jul 10, 2022 at 9:02 PM Tufan Rakshit  wrote:

> Mainly depends what your cluster manager Yarn or kubernates ?
> Best
> Tufan
>
> On Sun, 10 Jul 2022 at 14:38, Sean Owen  wrote:
>
>> Jobs consist of tasks, each of which consumes a core (can be set to >1
>> too, but that's a different story). If there are more tasks ready to
>> execute than available cores, some tasks simply wait.
>>
>> On Sun, Jul 10, 2022 at 3:31 AM Yong Walt  wrote:
>>
>>> given my spark cluster has 128 cores totally.
>>> If the jobs (each job was assigned only one core) I submitted to the
>>> cluster are over 128, what will happen?
>>>
>>> Thank you.
>>>
>>