Unsubscribe

2023-08-21 Thread Dipayan Dev
-- 



With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
*
M.Tech (AI), IISc, Bangalore


Re: error trying to save to database (Phoenix)

2023-08-21 Thread Kal Stevens
Sorry for being so Dense and thank you for your help.

I was using this version
phoenix-spark-5.0.0-HBase-2.0.jar

Because it was the latest in this repo
https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark


On Mon, Aug 21, 2023 at 5:07 PM Sean Owen  wrote:

> It is. But you have a third party library in here which seems to require a
> different version.
>
> On Mon, Aug 21, 2023, 7:04 PM Kal Stevens  wrote:
>
>> OK, it was my impression that scala was packaged with Spark to avoid a
>> mismatch
>> https://spark.apache.org/downloads.html
>>
>> It looks like spark 3.4.1 (my version) uses scala Scala 2.12
>> How do I specify the scala version?
>>
>> On Mon, Aug 21, 2023 at 4:47 PM Sean Owen  wrote:
>>
>>> That's a mismatch in the version of scala that your library uses vs
>>> spark uses.
>>>
>>> On Mon, Aug 21, 2023, 6:46 PM Kal Stevens  wrote:
>>>
 I am having a hard time figuring out what I am doing wrong here.
 I am not sure if I have an incompatible version of something installed
 or something else.
 I can not find anything relevant in google to figure out what I am
 doing wrong
 I am using *spark 3.4.1*, and *python3.10*

 This is my code to save my dataframe
 urls = []
 pull_sitemap_xml(robot, urls)
 df = spark.createDataFrame(data=urls, schema=schema)
 df.write.format("org.apache.phoenix.spark") \
 .mode("overwrite") \
 .option("table", "property") \
 .option("zkUrl", "192.168.1.162:2181") \
 .save()

 urls is an array of maps, containing a "url" and a "last_mod" field.

 Here is the error that I am getting

 Traceback (most recent call last):

   File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65, in
 main

 .save()

   File
 "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
 line 1396, in save

 self._jwrite.save()

   File
 "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
 line 1322, in __call__

 return_value = get_return_value(

   File
 "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
 line 169, in deco

 return f(*a, **kw)

   File
 "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
 line 326, in get_return_value

 raise Py4JJavaError(

 py4j.protocol.Py4JJavaError: An error occurred while calling o636.save.

 : java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps
 scala.Predef$.refArrayOps(java.lang.Object[])'

 at
 org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76)

 at
 org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35)

 at
 org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28)

 at
 org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47)

 at
 org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)

 at
 org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)

 at
 org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)

>>>


Re: error trying to save to database (Phoenix)

2023-08-21 Thread Sean Owen
It is. But you have a third party library in here which seems to require a
different version.

On Mon, Aug 21, 2023, 7:04 PM Kal Stevens  wrote:

> OK, it was my impression that scala was packaged with Spark to avoid a
> mismatch
> https://spark.apache.org/downloads.html
>
> It looks like spark 3.4.1 (my version) uses scala Scala 2.12
> How do I specify the scala version?
>
> On Mon, Aug 21, 2023 at 4:47 PM Sean Owen  wrote:
>
>> That's a mismatch in the version of scala that your library uses vs spark
>> uses.
>>
>> On Mon, Aug 21, 2023, 6:46 PM Kal Stevens  wrote:
>>
>>> I am having a hard time figuring out what I am doing wrong here.
>>> I am not sure if I have an incompatible version of something installed
>>> or something else.
>>> I can not find anything relevant in google to figure out what I am doing
>>> wrong
>>> I am using *spark 3.4.1*, and *python3.10*
>>>
>>> This is my code to save my dataframe
>>> urls = []
>>> pull_sitemap_xml(robot, urls)
>>> df = spark.createDataFrame(data=urls, schema=schema)
>>> df.write.format("org.apache.phoenix.spark") \
>>> .mode("overwrite") \
>>> .option("table", "property") \
>>> .option("zkUrl", "192.168.1.162:2181") \
>>> .save()
>>>
>>> urls is an array of maps, containing a "url" and a "last_mod" field.
>>>
>>> Here is the error that I am getting
>>>
>>> Traceback (most recent call last):
>>>
>>>   File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65, in
>>> main
>>>
>>> .save()
>>>
>>>   File
>>> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>>> line 1396, in save
>>>
>>> self._jwrite.save()
>>>
>>>   File
>>> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>>> line 1322, in __call__
>>>
>>> return_value = get_return_value(
>>>
>>>   File
>>> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
>>> line 169, in deco
>>>
>>> return f(*a, **kw)
>>>
>>>   File
>>> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
>>> line 326, in get_return_value
>>>
>>> raise Py4JJavaError(
>>>
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o636.save.
>>>
>>> : java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps
>>> scala.Predef$.refArrayOps(java.lang.Object[])'
>>>
>>> at
>>> org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76)
>>>
>>> at
>>> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35)
>>>
>>> at
>>> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28)
>>>
>>> at
>>> org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47)
>>>
>>> at
>>> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
>>>
>>> at
>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>>>
>>> at
>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>>>
>>


Re: error trying to save to database (Phoenix)

2023-08-21 Thread Kal Stevens
OK, it was my impression that scala was packaged with Spark to avoid a
mismatch
https://spark.apache.org/downloads.html

It looks like spark 3.4.1 (my version) uses scala Scala 2.12
How do I specify the scala version?

On Mon, Aug 21, 2023 at 4:47 PM Sean Owen  wrote:

> That's a mismatch in the version of scala that your library uses vs spark
> uses.
>
> On Mon, Aug 21, 2023, 6:46 PM Kal Stevens  wrote:
>
>> I am having a hard time figuring out what I am doing wrong here.
>> I am not sure if I have an incompatible version of something installed or
>> something else.
>> I can not find anything relevant in google to figure out what I am doing
>> wrong
>> I am using *spark 3.4.1*, and *python3.10*
>>
>> This is my code to save my dataframe
>> urls = []
>> pull_sitemap_xml(robot, urls)
>> df = spark.createDataFrame(data=urls, schema=schema)
>> df.write.format("org.apache.phoenix.spark") \
>> .mode("overwrite") \
>> .option("table", "property") \
>> .option("zkUrl", "192.168.1.162:2181") \
>> .save()
>>
>> urls is an array of maps, containing a "url" and a "last_mod" field.
>>
>> Here is the error that I am getting
>>
>> Traceback (most recent call last):
>>
>>   File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65, in
>> main
>>
>> .save()
>>
>>   File
>> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>> line 1396, in save
>>
>> self._jwrite.save()
>>
>>   File
>> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>> line 1322, in __call__
>>
>> return_value = get_return_value(
>>
>>   File
>> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
>> line 169, in deco
>>
>> return f(*a, **kw)
>>
>>   File
>> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
>> line 326, in get_return_value
>>
>> raise Py4JJavaError(
>>
>> py4j.protocol.Py4JJavaError: An error occurred while calling o636.save.
>>
>> : java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps
>> scala.Predef$.refArrayOps(java.lang.Object[])'
>>
>> at
>> org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76)
>>
>> at
>> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35)
>>
>> at
>> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28)
>>
>> at
>> org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47)
>>
>> at
>> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
>>
>> at
>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>>
>> at
>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>>
>


Re: error trying to save to database (Phoenix)

2023-08-21 Thread Sean Owen
That's a mismatch in the version of scala that your library uses vs spark
uses.

On Mon, Aug 21, 2023, 6:46 PM Kal Stevens  wrote:

> I am having a hard time figuring out what I am doing wrong here.
> I am not sure if I have an incompatible version of something installed or
> something else.
> I can not find anything relevant in google to figure out what I am doing
> wrong
> I am using *spark 3.4.1*, and *python3.10*
>
> This is my code to save my dataframe
> urls = []
> pull_sitemap_xml(robot, urls)
> df = spark.createDataFrame(data=urls, schema=schema)
> df.write.format("org.apache.phoenix.spark") \
> .mode("overwrite") \
> .option("table", "property") \
> .option("zkUrl", "192.168.1.162:2181") \
> .save()
>
> urls is an array of maps, containing a "url" and a "last_mod" field.
>
> Here is the error that I am getting
>
> Traceback (most recent call last):
>
>   File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65, in
> main
>
> .save()
>
>   File
> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
> line 1396, in save
>
> self._jwrite.save()
>
>   File
> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
> line 1322, in __call__
>
> return_value = get_return_value(
>
>   File
> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
> line 169, in deco
>
> return f(*a, **kw)
>
>   File
> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
> line 326, in get_return_value
>
> raise Py4JJavaError(
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o636.save.
>
> : java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps
> scala.Predef$.refArrayOps(java.lang.Object[])'
>
> at
> org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76)
>
> at
> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35)
>
> at
> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28)
>
> at
> org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47)
>
> at
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>


error trying to save to database (Phoenix)

2023-08-21 Thread Kal Stevens
I am having a hard time figuring out what I am doing wrong here.
I am not sure if I have an incompatible version of something installed or
something else.
I can not find anything relevant in google to figure out what I am doing
wrong
I am using *spark 3.4.1*, and *python3.10*

This is my code to save my dataframe
urls = []
pull_sitemap_xml(robot, urls)
df = spark.createDataFrame(data=urls, schema=schema)
df.write.format("org.apache.phoenix.spark") \
.mode("overwrite") \
.option("table", "property") \
.option("zkUrl", "192.168.1.162:2181") \
.save()

urls is an array of maps, containing a "url" and a "last_mod" field.

Here is the error that I am getting

Traceback (most recent call last):

  File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65, in main

.save()

  File
"/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
line 1396, in save

self._jwrite.save()

  File
"/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
line 1322, in __call__

return_value = get_return_value(

  File
"/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
line 169, in deco

return f(*a, **kw)

  File
"/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
line 326, in get_return_value

raise Py4JJavaError(

py4j.protocol.Py4JJavaError: An error occurred while calling o636.save.

: java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps
scala.Predef$.refArrayOps(java.lang.Object[])'

at
org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76)

at
org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35)

at
org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28)

at
org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47)

at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)

at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)

at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)


DataFrame cache keeps growing

2023-08-21 Thread Varun .N
Hi Team,

While trying to understand/looking out for a problem of "where size of
dataframe keeps growing" , I realized that a similar question

was
asked a couple of years ago.

Need your help in resolving this.

Regards,
Varun


Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Bjørn Jørgensen
In yours file  /home/spark/real-estate/pullhttp/pull_apartments.py

replace import org.apache.spark.SparkContext with from pyspark import
SparkContext

man. 21. aug. 2023 kl. 15:13 skrev Kal Stevens :

> I am getting a class not found error
> import org.apache.spark.SparkContext
>
> It sounds like this is because pyspark is not installed, but as far as I
> can tell it is.
> Pyspark is installed in the correct python verison
>
>
> root@namenode:/home/spark/# pip3.10 install pyspark
> Requirement already satisfied: pyspark in
> /usr/local/lib/python3.10/dist-packages (3.4.1)
> Requirement already satisfied: py4j==0.10.9.7 in
> /usr/local/lib/python3.10/dist-packages (from pyspark) (0.10.9.7)
>
>
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.4.1
>   /_/
>
> Using Python version 3.10.12 (main, Jun 11 2023 05:26:28)
> Spark context Web UI available at http://namenode:4040
> Spark context available as 'sc' (master = yarn, app id =
> application_1692452853354_0008).
> SparkSession available as 'spark'.
> Traceback (most recent call last):
>   File "/home/spark/real-estate/pullhttp/pull_apartments.py", line 11, in
> 
> import org.apache.spark.SparkContext
> ModuleNotFoundError: No module named 'org.apache.spark.SparkContext'
> 2023-08-20T19:45:19,242 INFO  [Thread-5] spark.SparkContext: SparkContext
> is stopping with exitCode 0.
> 2023-08-20T19:45:19,246 INFO  [Thread-5] server.AbstractConnector: Stopped
> Spark@467be156{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2023-08-20T19:45:19,247 INFO  [Thread-5] ui.SparkUI: Stopped Spark web UI
> at http://namenode:4040
> 2023-08-20T19:45:19,251 INFO  [YARN application state monitor]
> cluster.YarnClientSchedulerBackend: Interrupting monitor thread
> 2023-08-20T19:45:19,260 INFO  [Thread-5]
> cluster.YarnClientSchedulerBackend: Shutting down all executors
> 2023-08-20T19:45:19,260 INFO  [dispatcher-CoarseGrainedScheduler]
> cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to
> shut down
> 2023-08-20T19:45:19,263 INFO  [Thread-5]
> cluster.YarnClientSchedulerBackend: YARN client scheduler backend Stopped
> 2023-08-20T19:45:19,267 INFO  [dispatcher-event-loop-29]
> spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint
> stopped!
> 2023-08-20T19:45:19,271 INFO  [Thread-5] memory.MemoryStore: MemoryStore
> cleared
> 2023-08-20T19:45:19,271 INFO  [Thread-5] storage.BlockManager:
> BlockManager stopped
> 2023-08-20T19:45:19,275 INFO  [Thread-5] storage.BlockManagerMaster:
> BlockManagerMaster stopped
> 2023-08-20T19:45:19,276 INFO  [dispatcher-event-loop-8]
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
> OutputCommitCoordinator stopped!
> 2023-08-20T19:45:19,279 INFO  [Thread-5] spark.SparkContext: Successfully
> stopped SparkContext
> 2023-08-20T19:45:19,687 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Shutdown hook called
> 2023-08-20T19:45:19,688 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Deleting directory
> /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034/pyspark-2fcfbc8e-fd40-41f5-bf8d-e4c460332895
> 2023-08-20T19:45:19,689 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Deleting directory /tmp/spark-bf6cbc46-ad8b-429a-9d7a-7d98b7d7912e
> 2023-08-20T19:45:19,690 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Deleting directory /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034
> 2023-08-20T19:45:19,691 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Deleting directory /tmp/localPyFiles-6c113b2b-9ac3-45e3-9032-d1c83419aa64
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Kal Stevens
I am getting a class not found error
import org.apache.spark.SparkContext

It sounds like this is because pyspark is not installed, but as far as I
can tell it is.
Pyspark is installed in the correct python verison


root@namenode:/home/spark/# pip3.10 install pyspark
Requirement already satisfied: pyspark in
/usr/local/lib/python3.10/dist-packages (3.4.1)
Requirement already satisfied: py4j==0.10.9.7 in
/usr/local/lib/python3.10/dist-packages (from pyspark) (0.10.9.7)


    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.4.1
  /_/

Using Python version 3.10.12 (main, Jun 11 2023 05:26:28)
Spark context Web UI available at http://namenode:4040
Spark context available as 'sc' (master = yarn, app id =
application_1692452853354_0008).
SparkSession available as 'spark'.
Traceback (most recent call last):
  File "/home/spark/real-estate/pullhttp/pull_apartments.py", line 11, in

import org.apache.spark.SparkContext
ModuleNotFoundError: No module named 'org.apache.spark.SparkContext'
2023-08-20T19:45:19,242 INFO  [Thread-5] spark.SparkContext: SparkContext
is stopping with exitCode 0.
2023-08-20T19:45:19,246 INFO  [Thread-5] server.AbstractConnector: Stopped
Spark@467be156{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
2023-08-20T19:45:19,247 INFO  [Thread-5] ui.SparkUI: Stopped Spark web UI
at http://namenode:4040
2023-08-20T19:45:19,251 INFO  [YARN application state monitor]
cluster.YarnClientSchedulerBackend: Interrupting monitor thread
2023-08-20T19:45:19,260 INFO  [Thread-5]
cluster.YarnClientSchedulerBackend: Shutting down all executors
2023-08-20T19:45:19,260 INFO  [dispatcher-CoarseGrainedScheduler]
cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to
shut down
2023-08-20T19:45:19,263 INFO  [Thread-5]
cluster.YarnClientSchedulerBackend: YARN client scheduler backend Stopped
2023-08-20T19:45:19,267 INFO  [dispatcher-event-loop-29]
spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint
stopped!
2023-08-20T19:45:19,271 INFO  [Thread-5] memory.MemoryStore: MemoryStore
cleared
2023-08-20T19:45:19,271 INFO  [Thread-5] storage.BlockManager: BlockManager
stopped
2023-08-20T19:45:19,275 INFO  [Thread-5] storage.BlockManagerMaster:
BlockManagerMaster stopped
2023-08-20T19:45:19,276 INFO  [dispatcher-event-loop-8]
scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
2023-08-20T19:45:19,279 INFO  [Thread-5] spark.SparkContext: Successfully
stopped SparkContext
2023-08-20T19:45:19,687 INFO  [shutdown-hook-0] util.ShutdownHookManager:
Shutdown hook called
2023-08-20T19:45:19,688 INFO  [shutdown-hook-0] util.ShutdownHookManager:
Deleting directory
/tmp/spark-9375452d-1989-4df5-9d85-950f751ce034/pyspark-2fcfbc8e-fd40-41f5-bf8d-e4c460332895
2023-08-20T19:45:19,689 INFO  [shutdown-hook-0] util.ShutdownHookManager:
Deleting directory /tmp/spark-bf6cbc46-ad8b-429a-9d7a-7d98b7d7912e
2023-08-20T19:45:19,690 INFO  [shutdown-hook-0] util.ShutdownHookManager:
Deleting directory /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034
2023-08-20T19:45:19,691 INFO  [shutdown-hook-0] util.ShutdownHookManager:
Deleting directory /tmp/localPyFiles-6c113b2b-9ac3-45e3-9032-d1c83419aa64


Spark doesn’t create SUCCESS file when external path is passed

2023-08-21 Thread Dipayan Dev
Hi Team,

I need some help and if someone can replicate the issue at their end, or
let me know if I am doing anything wrong.

https://issues.apache.org/jira/browse/SPARK-44884


We have recently upgraded to Spark 3.3.0 in our Production Dataproc.
We have a lot of downstream application that relies on the SUCCESS file.

Please let me know if this is a bug or I need to any additional
configuration to fix this in Spark 3.3.0.

Happy to contribute if you suggest.
-- 



With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
*
M.Tech (AI), IISc, Bangalore


Unsubscribe

2023-08-21 Thread Umesh Bansal



Re: k8s+ YARN Spark

2023-08-21 Thread Mich Talebzadeh
Interesting.

Spark supports the following cluster managers


   - Standalone: A cluster-manager, limited in features, shipped with Spark.
   - Apache Hadoop YARN is the most widely used resource manager not just
   for Spark but for other artefacts as well. On-premise YARN is used
   extensively. In Cloud it is also used widely in Infrastructure as a Service
   such as Google Dataproc.
   - Kubernetes (k8s): Spark runs natively on Kubernetes since version
   Spark 2.3.
   - Apache Mesos: An open source cluster-manager which was once popular
   but now in decline.

Now as I understand you are utilising both spark standalone and K8s. What
is perhaps missing is an architecture diagram for your setup. Do you have
or can you create such a diagram?

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 21 Aug 2023 at 08:31, Крюков Виталий Семенович
 wrote:

>
> Good afternoon.
> Perhaps you will be discouraged by what I will write below, but
> nevertheless, I ask for help in solving my problem. Perhaps the
> architecture of our solution will not seem correct to you.
> There are backend services that communicate with a service that implements
> spark-driver. When the service starts with driver, spark-submit occurs and
> the session lives until the service stops. The service works constantly.
> Faced problems when they began to deploy our solution in k8s. The services
> were located inside the k8s cluster and the Spark StandAlone cluster
> outside the k8s. When starting the service from spark-driver, spark-submit
> is executed, which confirms the presence of an application on the UI.
> But on workers, we get an error that the workman could not connect to a
> random port to the spark-driver. The ports themselves have learned to
> override and specify, but these ports must be accessible from outside the
> cluster. We found a solution in which we open NodePort on workers - it
> works. BUT this is not suitable for most customers due to internal
> regulations. How to resolve the issue through ingress was never found.
>
> Faced problems when they began to deploy our solution in k8s. The services
> were located inside the k8s cluster and the Spark StandAlone cluster
> outside the k8s. When starting the service from spark-driver, spark-submit
> is executed, which confirms the presence of an application on the UI. But
> on workers, we get an error that the workman could not connect to a random
> port to the spark-driver. The ports themselves have learned to override and
> specify, but these ports must be accessible from outside the cluster. We
> found a solution in which we open NodePort on workers - it works. BUT this
> is not suitable for most customers due to internal regulations. How to
> resolve the issue through ingress was never found.
>
> *with best regards**,*
>
> *Vitaly Kryukov*
> [image: 1c330227-6767-4cc2-bd95-69fd1fe6b3e7]
>


Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Mich Talebzadeh
This should work

check your path. It should pyspark from

which pyspark
/opt/spark/bin/pyspark

And your installation should contain

cd $SPARK_HOME
/opt/spark> ls
LICENSE  NOTICE  R  README.md  RELEASE  bin  conf  data  examples  jars
kubernetes  licenses  logs  python  sbin  yarn

You should use

from pyspark import SparkConf, SparkContext

And this is your problem

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.4.1
  /_/

Using Python version 3.9.16 (main, Apr 22 2023 14:16:13)
Spark context Web UI available at http://rhes76:4040
Spark context available as 'sc' (master = local[*], app id =
local-1692606989942).
SparkSession available as 'spark'.
>>> import org.apache.spark.SparkContext
Traceback (most recent call last):
  File "", line 1, in 
*ModuleNotFoundError: No module named 'org'*

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 21 Aug 2023 at 07:12, Kal Stevens  wrote:

> Are there installation instructions for Spark 3.4.1?
>
> I defined SPARK_HOME as it describes here
>
> https://spark.apache.org/docs/latest/api/python/getting_started/install.html
>
> ls $SPARK_HOME/python/lib
> py4j-0.10.9.7-src.zip  PY4J_LICENSE.txt  pyspark.zip
>
>
> I am getting a class not found error
> import org.apache.spark.SparkContext
>
> I also unzipped those files just in case but that gives the same error.
>
>
> It sounds like this is because pyspark is not installed, but as far as I
> can tell it is.
> Pyspark is installed in the correct python verison
>
>
> root@namenode:/home/spark/# pip3.10 install pyspark
> Requirement already satisfied: pyspark in
> /usr/local/lib/python3.10/dist-packages (3.4.1)
> Requirement already satisfied: py4j==0.10.9.7 in
> /usr/local/lib/python3.10/dist-packages (from pyspark) (0.10.9.7)
>
>
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.4.1
>   /_/
>
> Using Python version 3.10.12 (main, Jun 11 2023 05:26:28)
> Spark context Web UI available at http://namenode:4040
> Spark context available as 'sc' (master = yarn, app id =
> application_1692452853354_0008).
> SparkSession available as 'spark'.
> Traceback (most recent call last):
>   File "/home/spark/real-estate/pullhttp/pull_apartments.py", line 11, in
> 
> import org.apache.spark.SparkContext
> ModuleNotFoundError: No module named 'org.apache.spark.SparkContext'
> 2023-08-20T19:45:19,242 INFO  [Thread-5] spark.SparkContext: SparkContext
> is stopping with exitCode 0.
> 2023-08-20T19:45:19,246 INFO  [Thread-5] server.AbstractConnector: Stopped
> Spark@467be156{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2023-08-20T19:45:19,247 INFO  [Thread-5] ui.SparkUI: Stopped Spark web UI
> at http://namenode:4040
> 2023-08-20T19:45:19,251 INFO  [YARN application state monitor]
> cluster.YarnClientSchedulerBackend: Interrupting monitor thread
> 2023-08-20T19:45:19,260 INFO  [Thread-5]
> cluster.YarnClientSchedulerBackend: Shutting down all executors
> 2023-08-20T19:45:19,260 INFO  [dispatcher-CoarseGrainedScheduler]
> cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to
> shut down
> 2023-08-20T19:45:19,263 INFO  [Thread-5]
> cluster.YarnClientSchedulerBackend: YARN client scheduler backend Stopped
> 2023-08-20T19:45:19,267 INFO  [dispatcher-event-loop-29]
> spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint
> stopped!
> 2023-08-20T19:45:19,271 INFO  [Thread-5] memory.MemoryStore: MemoryStore
> cleared
> 2023-08-20T19:45:19,271 INFO  [Thread-5] storage.BlockManager:
> BlockManager stopped
> 2023-08-20T19:45:19,275 INFO  [Thread-5] storage.BlockManagerMaster:
> BlockManagerMaster stopped
> 2023-08-20T19:45:19,276 INFO  [dispatcher-event-loop-8]
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
> OutputCommitCoordinator stopped!
> 2023-08-20T19:45:19,279 INFO  [Thread-5] spark.SparkContext: Successfully
> stopped SparkContext
> 2023-08-20T19:45:19,687 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Shutdown hook called
> 2023-08-20T19:45:19,688 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Deleting directory
> /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034/pyspark-2fcfbc8e-fd40-41f5-bf8d-e4c460332895
> 2023-08-20T19:45:19,689 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Deleting directory /tmp/spark-bf6cbc46-ad8b-429a-9d7a-7d98b7d7912e
> 2023-08-20T19:45:19,690 INFO  [shutdown-hook-0] 

Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Kal Stevens
Nevermind I was doing something dumb

On Sun, Aug 20, 2023 at 9:53 PM Kal Stevens  wrote:

> Are there installation instructions for Spark 3.4.1?
>
> I defined SPARK_HOME as it describes here
>
> https://spark.apache.org/docs/latest/api/python/getting_started/install.html
>
> ls $SPARK_HOME/python/lib
> py4j-0.10.9.7-src.zip  PY4J_LICENSE.txt  pyspark.zip
>
>
> I am getting a class not found error
> import org.apache.spark.SparkContext
>
> I also unzipped those files just in case but that gives the same error.
>
>
> It sounds like this is because pyspark is not installed, but as far as I
> can tell it is.
> Pyspark is installed in the correct python verison
>
>
> root@namenode:/home/spark/# pip3.10 install pyspark
> Requirement already satisfied: pyspark in
> /usr/local/lib/python3.10/dist-packages (3.4.1)
> Requirement already satisfied: py4j==0.10.9.7 in
> /usr/local/lib/python3.10/dist-packages (from pyspark) (0.10.9.7)
>
>
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.4.1
>   /_/
>
> Using Python version 3.10.12 (main, Jun 11 2023 05:26:28)
> Spark context Web UI available at http://namenode:4040
> Spark context available as 'sc' (master = yarn, app id =
> application_1692452853354_0008).
> SparkSession available as 'spark'.
> Traceback (most recent call last):
>   File "/home/spark/real-estate/pullhttp/pull_apartments.py", line 11, in
> 
> import org.apache.spark.SparkContext
> ModuleNotFoundError: No module named 'org.apache.spark.SparkContext'
> 2023-08-20T19:45:19,242 INFO  [Thread-5] spark.SparkContext: SparkContext
> is stopping with exitCode 0.
> 2023-08-20T19:45:19,246 INFO  [Thread-5] server.AbstractConnector: Stopped
> Spark@467be156{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2023-08-20T19:45:19,247 INFO  [Thread-5] ui.SparkUI: Stopped Spark web UI
> at http://namenode:4040
> 2023-08-20T19:45:19,251 INFO  [YARN application state monitor]
> cluster.YarnClientSchedulerBackend: Interrupting monitor thread
> 2023-08-20T19:45:19,260 INFO  [Thread-5]
> cluster.YarnClientSchedulerBackend: Shutting down all executors
> 2023-08-20T19:45:19,260 INFO  [dispatcher-CoarseGrainedScheduler]
> cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to
> shut down
> 2023-08-20T19:45:19,263 INFO  [Thread-5]
> cluster.YarnClientSchedulerBackend: YARN client scheduler backend Stopped
> 2023-08-20T19:45:19,267 INFO  [dispatcher-event-loop-29]
> spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint
> stopped!
> 2023-08-20T19:45:19,271 INFO  [Thread-5] memory.MemoryStore: MemoryStore
> cleared
> 2023-08-20T19:45:19,271 INFO  [Thread-5] storage.BlockManager:
> BlockManager stopped
> 2023-08-20T19:45:19,275 INFO  [Thread-5] storage.BlockManagerMaster:
> BlockManagerMaster stopped
> 2023-08-20T19:45:19,276 INFO  [dispatcher-event-loop-8]
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
> OutputCommitCoordinator stopped!
> 2023-08-20T19:45:19,279 INFO  [Thread-5] spark.SparkContext: Successfully
> stopped SparkContext
> 2023-08-20T19:45:19,687 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Shutdown hook called
> 2023-08-20T19:45:19,688 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Deleting directory
> /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034/pyspark-2fcfbc8e-fd40-41f5-bf8d-e4c460332895
> 2023-08-20T19:45:19,689 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Deleting directory /tmp/spark-bf6cbc46-ad8b-429a-9d7a-7d98b7d7912e
> 2023-08-20T19:45:19,690 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Deleting directory /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034
> 2023-08-20T19:45:19,691 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Deleting directory /tmp/localPyFiles-6c113b2b-9ac3-45e3-9032-d1c83419aa64
>
>


k8s+ YARN Spark

2023-08-21 Thread Крюков Виталий Семенович

Good afternoon.
Perhaps you will be discouraged by what I will write below, but nevertheless, I 
ask for help in solving my problem. Perhaps the architecture of our solution 
will not seem correct to you.
There are backend services that communicate with a service that implements 
spark-driver. When the service starts with driver, spark-submit occurs and the 
session lives until the service stops. The service works constantly.
Faced problems when they began to deploy our solution in k8s. The services were 
located inside the k8s cluster and the Spark StandAlone cluster outside the 
k8s. When starting the service from spark-driver, spark-submit is executed, 
which confirms the presence of an application on the UI.
But on workers, we get an error that the workman could not connect to a random 
port to the spark-driver. The ports themselves have learned to override and 
specify, but these ports must be accessible from outside the cluster. We found 
a solution in which we open NodePort on workers - it works. BUT this is not 
suitable for most customers due to internal regulations. How to resolve the 
issue through ingress was never found.

Faced problems when they began to deploy our solution in k8s. The services were 
located inside the k8s cluster and the Spark StandAlone cluster outside the 
k8s. When starting the service from spark-driver, spark-submit is executed, 
which confirms the presence of an application on the UI. But on workers, we get 
an error that the workman could not connect to a random port to the 
spark-driver. The ports themselves have learned to override and specify, but 
these ports must be accessible from outside the cluster. We found a solution in 
which we open NodePort on workers - it works. BUT this is not suitable for most 
customers due to internal regulations. How to resolve the issue through ingress 
was never found.

with best regards,

Vitaly Kryukov