Re: Spark 3.5.0 bug - Writing a small paraquet dataframe to storage using spark 3.5.0 taking too long

2024-08-06 Thread Bijoy Deb
Hi Spark community,

Any resolution would be highly appreciated.

Few additional analysis from my side:

The lag in writing parquet exists in spark 3.5.0, but no lag in spark 3.1.2
or 2.4.5.

Also, I found out that the task WholeStageCodeGen(1) --> ColumnarToRow is
the one which is taking the most time (almost 3 mins for a simple 3 mb
file) in spark 3.5.0. Input batch size of this stage is 10,and output
record count is 30,000. The same CoumnarToRow task in spark 3.1.2
finishes in 10 secs.
Further, with spark 3.5.0 if I cache the dataframe and materialise it using
df.count() and then write the df into parquet file, then the ColumnarToRow
gets called twice, first takes 10 secs and second one 3 mins.

On Wed, 31 Jul, 2024, 10:14 PM Bijoy Deb,  wrote:

> Hi,
>
> We are using Spark on-premise to simply read a parquet file from
> GCS(Google Cloud storage) into the DataFrame and write the DataFrame into
> another folder in parquet format in GCS, using below code:
>
> 
>
> DFS_BLOCKSIZE = 512 * 1024 * 1024
>
>
> spark = SparkSession.builder \
> .appName("test_app_parquet_load") \
>  .config("spark.master", "spark://spark-master-svc:7077") \
> .config("spark.driver.maxResultSize", '1g') \
> .config("spark.driver.memory", '1g') \
>  .config("spark.executor.cores",4) \
> .config("spark.sql.shuffle.partitions", 16) \
>.config("spark.sql.files.maxPartitionBytes", DFS_BLOCKSIZE) \
>
>  .getOrCreate()
>
>
> folder="gs://input_folder/input1/key=20240610"
> print(f"reading parquet from {folder}")
>
> start_time1 = time.time()
>
> data_df = spark.read.parquet(folder)
>
> end_time1 = time.time()
> print(f"Time duration for reading parquet t1: {end_time1 - start_time1}")
>
>
> start_time2 = time.time()
>
> data_df.write.mode("overwrite").parquet("gs://output_folder/output/key=20240610")
>
> end_time2 = time.time()
> print(f"Time duration for writing parquet t3: {end_time2 - start_time2}")
>
> spark.stop()
>
>
>
>
>
>
>
>
> __
>
>
> However, we observed a drastic time difference between Spark 2.4.5 and
> 3.5.0 in the writing process.Even in case of local filesystem instead of
> gcs, spark 3.5.0 is taking long time.
>
> In Spark 2.4.5, the above code takes about 10 seconds for Parquet to read
> and 20 seconds for write, while in Spark 3.5.0 read takes almost similar
> time but write takes nearly 3 minutes. The size of the file is just 3 MB.
> Further, we have noticed that if we read a CSV file instead of parquet into
> DataFrame and write to another folder in parquet format, Spark 3.5.0 takes
> relatively less time to write, about 30-40 seconds.
>
> So, it looks like only reading a parquet file to a dataframe and writing
> that dataframe to another parquet file is taking too long in the case of
> Spark 3.5.0.
>
> We are seeing that there is no slowness even with Spark 3.1.2. So, it
> seems that the issue with spark job taking too long to write a parquet
> based dataframe into another parquet file (in gcs or local filesystem both)
> is specific to spark 3.5.0. Looks to be either a potential bug in Spark
> 3.5.0 or some parquet related configuration that is not clearly documented.
> Any help in this regard would be highly appreciated.
>
>
> Thanks,
>
> Bijoy
>


problem using spark 3.4 with spots

2024-07-18 Thread wafa gabouj
Hello,

Hello, I am working with spark on dataiku using spots on S3 ( not in demand
instances). I had no problem until I moved from park 3.3 to spark 3.4 ! I
always have this fail and could not understand what configuration in the
new version of spark lead to it

*org.apache.spark.SparkException: Job aborted due to stage failure:
Authorized committer (attemptNumber=0, stage=2, partition=357) failed; but
task commit success, data duplication may happen.*
Here is the full log

[09:46:42] [INFO] [dku.utils]  - [2024/07/17-09:46:42.680] [Thread-6]
[ERROR] [org.apache.spark.sql.execution.datasources.FileFormatWriter]
- Aborting job 2271622d-848a-4719-ad86-81d951235dbb.
[09:46:42] [INFO] [dku.utils]  - org.apache.spark.SparkException: Job
aborted due to stage failure: Authorized committer (attemptNumber=0,
stage=2, partition=715) failed; but task commit success, data
duplication may happen. reason=ExecutorLostFailure(1,false,Some(The
executor with id 1 was deleted by a user or the framework.))
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
~[spark-core_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)
~[spark-core_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720)
~[spark-core_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
~[scala-library-2.12.17.jar:?]
[09:46:42] [INFO] [dku.utils]  -at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
~[scala-library-2.12.17.jar:?]
[09:46:42] [INFO] [dku.utils]  -at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
~[scala-library-2.12.17.jar:?]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720)
~[spark-core_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1(DAGScheduler.scala:1199)
~[spark-core_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1$adapted(DAGScheduler.scala:1199)
~[spark-core_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
scala.Option.foreach(Option.scala:407) ~[scala-library-2.12.17.jar:?]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.scheduler.DAGScheduler.handleStageFailed(DAGScheduler.scala:1199)
~[spark-core_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2981)
~[spark-core_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)
~[spark-core_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)
~[spark-core_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
~[spark-core_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971)
~[spark-core_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2263)
~[spark-core_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeWrite$4(FileFormatWriter.scala:307)
~[spark-sql_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.writeAndCommit(FileFormatWriter.scala:271)
~[spark-sql_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeWrite(FileFormatWriter.scala:304)
~[spark-sql_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190)
~[spark-sql_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:190)
~[spark-sql_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
~[spark-sql_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
~[spark-sql_2.12-3.4.1.jar:3.4.1]
[09:46:42] [INFO] [dku.utils]  -at
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
~[s

running snowflake query using spark connect on a standalone cluster

2024-07-07 Thread Prabodh Agarwal
I have configured a spark standalone cluster as follows:

```
# start spark master
$SPARK_HOME/sbin/start-master.sh

# start 2 spark workers
SPARK_WORKER_INSTANCES=2 $SPARK_HOME/sbin/start-worker.sh
spark://localhost:7077

# start spark connect
$SPARK_HOME/sbin/start-connect-server.sh --properties-file
./connect.properties --master spark://localhost:7077
```

My properties file is defined as follows:

```
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.plugins=io.dataflint.spark.SparkDataflintPlugin spark.jars.packages
org.apache.spark:spark-connect_2.12:3.5.1,org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hudi:hudi-aws:0.15.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:0.15.0,org.apache.spark:spark-avro_2.12:3.5.1,software.amazon.awssdk:sso:2.18.40,io.dataflint:spark_2.12:0.2.2,net.snowflake:spark-snowflake_2.12:2.16.0-spark_3.4,net.snowflake:snowflake-jdbc:3.16.1
spark.driver.extraJavaOptions=-verbose:class
spark.executor.extraJavaOptions=-verbose:class
```

Now I start my pyspark job which connects with this remote instance and
then tries to query the table. The snowflake query is fired correctly. It
shows up in my Snowflake query history, but then I start getting failures.

```
24/07/07 22:37:26 INFO ErrorUtils: Spark Connect error during: execute.
UserId: pbd. SessionId: 462444eb-82d3-475b-8dd0-ce35d5047405.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 3) (192.168.29.6 executor 1): java.lang.ClassNotFoundException:
net.snowflake.spark.snowflake.io.SnowflakeResultSetPartition
at
org.apache.spark.executor.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:124)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:594)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:398)
at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:71)
at
java.base/java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2003)
at
java.base/java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1870)
at
java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2201)
at
java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687)
at
java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2496)
at
java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2390)
at
java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2228)
at
java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687)
at
java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:489)
at
java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:447)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:129)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:579)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException:
net.snowflake.spark.snowflake.io.SnowflakeResultSetPartition
at java.base/java.lang.ClassLoader.findClass(ClassLoader.java:724)
at
org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:594)
at
org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
at
org.apache.spark.executor.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:109)
... 21 more

Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
at scala.Option.foreach(Option.scala:407)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
at
org.

ML using Spark Connect

2023-12-01 Thread Faiz Halde
Hello,

Is it possible to run SparkML using Spark Connect 3.5.0? So far I've had no
success setting up a connect client that uses ML package


The ML package uses spark core/sql afaik which seems to be shadowing the
Spark connect client classes

Do I have to exclude any dependencies from the mllib dependency to make it
runnable as a connect client? I thought the ML package was entirely
dependent on the Data frame API

I appreciate your help

Thanks
Faiz


Re: Error using SPARK with Rapid GPU

2022-11-30 Thread Alessandro Bellina
Vajiha filed a spark-rapids discussion here
https://github.com/NVIDIA/spark-rapids/discussions/7205, so if you are
interested please follow there.

On Wed, Nov 30, 2022 at 7:17 AM Vajiha Begum S A <
vajihabegu...@maestrowiz.com> wrote:

> Hi,
> I'm using an Ubuntu system with the NVIDIA Quadro K1200 with GPU memory
> 20GB
> Installed - CUDF 22.10.0 jar file, Rapid 4 Spark 2.12-22.10.0 jar file,
> CUDA Toolkit 11.8.0 Linux version., JAVA 8
> I'm running only single server, Master is localhost
>
> I'm trying to run pyspark code through spark submit & Python idle. I'm
> getting errors. Kindly help me to resolve this error.
> Kindly give suggestions where I have made mistakes.
>
> *Error when running code through spark-submit:*
>spark-submit /home/mwadmin/Documents/test.py
> 22/11/30 14:59:32 WARN Utils: Your hostname, mwadmin-HP-Z440-Workstation
> resolves to a loopback address: 127.0.1.1; using ***.***.**.** instead (on
> interface eno1)
> 22/11/30 14:59:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
> another address
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 22/11/30 14:59:32 INFO SparkContext: Running Spark version 3.2.2
> 22/11/30 14:59:32 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 22/11/30 14:59:33 INFO ResourceUtils:
> ==
> 22/11/30 14:59:33 INFO ResourceUtils: No custom resources configured for
> spark.driver.
> 22/11/30 14:59:33 INFO ResourceUtils:
> ==
> 22/11/30 14:59:33 INFO SparkContext: Submitted application: Spark.com
> 22/11/30 14:59:33 INFO ResourceProfile: Default ResourceProfile created,
> executor resources: Map(cores -> name: cores, amount: 1, script: , vendor:
> , memory -> name: memory, amount: 1024, script: , vendor: , offHeap ->
> name: offHeap, amount: 0, script: , vendor: , gpu -> name: gpu, amount: 1,
> script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0,
> gpu -> name: gpu, amount: 0.5)
> 22/11/30 14:59:33 INFO ResourceProfile: Limiting resource is cpus at 1
> tasks per executor
> 22/11/30 14:59:33 WARN ResourceUtils: The configuration of resource: gpu
> (exec = 1, task = 0.5/2, runnable tasks = 2) will result in wasted
> resources due to resource cpus limiting the number of runnable tasks per
> executor to: 1. Please adjust your configuration.
> 22/11/30 14:59:33 INFO ResourceProfileManager: Added ResourceProfile id: 0
> 22/11/30 14:59:33 INFO SecurityManager: Changing view acls to: mwadmin
> 22/11/30 14:59:33 INFO SecurityManager: Changing modify acls to: mwadmin
> 22/11/30 14:59:33 INFO SecurityManager: Changing view acls groups to:
> 22/11/30 14:59:33 INFO SecurityManager: Changing modify acls groups to:
> 22/11/30 14:59:33 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users  with view permissions: Set(mwadmin);
> groups with view permissions: Set(); users  with modify permissions:
> Set(mwadmin); groups with modify permissions: Set()
> 22/11/30 14:59:33 INFO Utils: Successfully started service 'sparkDriver'
> on port 45883.
> 22/11/30 14:59:33 INFO SparkEnv: Registering MapOutputTracker
> 22/11/30 14:59:33 INFO SparkEnv: Registering BlockManagerMaster
> 22/11/30 14:59:33 INFO BlockManagerMasterEndpoint: Using
> org.apache.spark.storage.DefaultTopologyMapper for getting topology
> information
> 22/11/30 14:59:33 INFO BlockManagerMasterEndpoint:
> BlockManagerMasterEndpoint up
> 22/11/30 14:59:33 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
> 22/11/30 14:59:33 INFO DiskBlockManager: Created local directory at
> /tmp/blockmgr-647d2c2a-72e4-402d-aeff-d7460726eb6d
> 22/11/30 14:59:33 INFO MemoryStore: MemoryStore started with capacity
> 366.3 MiB
> 22/11/30 14:59:33 INFO SparkEnv: Registering OutputCommitCoordinator
> 22/11/30 14:59:33 INFO Utils: Successfully started service 'SparkUI' on
> port 4040.
> 22/11/30 14:59:33 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
> htttp://localhost:4040
> 22/11/30 14:59:33 INFO ShimLoader: Loading shim for Spark version: 3.2.2
> 22/11/30 14:59:33 INFO ShimLoader: Complete Spark build info: 3.2.2,
> https://github.com/apache/spark, HEAD,
> 78a5825fe266c0884d2dd18cbca9625fa258d7f7, 2022-07-11T15:44:21Z
> 22/11/30 14:59:33 INFO ShimLoader: findURLClassLoader found a
> URLClassLoader org.apache.spark.util.MutableURLClassLoader@1530c739
> 22/11/30 14:59:33 INFO ShimLoader: Updating spark classloader
> org.apache.spark.util.MutableURLClassLoader@1530c739 with the URLs:
> jar:file:/home/mwadmin/spark-3.2.2-bin-hadoop3.2/jars/rapids-4-spark_2.12-22.10.0.jar!/spark3xx-common/,
> jar:file:/home/mwadmin/spark-3.2.2-bin-hadoop3.2/jars/rapids-4-spark_2.12-22.10.0.jar!/spark322/
> 22/11/30 14:59:33 INFO ShimLoader: Spark classLoader
> org.apache.spark.util.MutableURLClassLoader@1530c739 updated successfully
> 22/11/30 14:5

Error using SPARK with Rapid GPU

2022-11-30 Thread Vajiha Begum S A
Hi,
I'm using an Ubuntu system with the NVIDIA Quadro K1200 with GPU memory 20GB
Installed - CUDF 22.10.0 jar file, Rapid 4 Spark 2.12-22.10.0 jar file,
CUDA Toolkit 11.8.0 Linux version., JAVA 8
I'm running only single server, Master is localhost

I'm trying to run pyspark code through spark submit & Python idle. I'm
getting errors. Kindly help me to resolve this error.
Kindly give suggestions where I have made mistakes.

*Error when running code through spark-submit:*
   spark-submit /home/mwadmin/Documents/test.py
22/11/30 14:59:32 WARN Utils: Your hostname, mwadmin-HP-Z440-Workstation
resolves to a loopback address: 127.0.1.1; using ***.***.**.** instead (on
interface eno1)
22/11/30 14:59:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
another address
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
22/11/30 14:59:32 INFO SparkContext: Running Spark version 3.2.2
22/11/30 14:59:32 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
22/11/30 14:59:33 INFO ResourceUtils:
==
22/11/30 14:59:33 INFO ResourceUtils: No custom resources configured for
spark.driver.
22/11/30 14:59:33 INFO ResourceUtils:
==
22/11/30 14:59:33 INFO SparkContext: Submitted application: Spark.com
22/11/30 14:59:33 INFO ResourceProfile: Default ResourceProfile created,
executor resources: Map(cores -> name: cores, amount: 1, script: , vendor:
, memory -> name: memory, amount: 1024, script: , vendor: , offHeap ->
name: offHeap, amount: 0, script: , vendor: , gpu -> name: gpu, amount: 1,
script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0,
gpu -> name: gpu, amount: 0.5)
22/11/30 14:59:33 INFO ResourceProfile: Limiting resource is cpus at 1
tasks per executor
22/11/30 14:59:33 WARN ResourceUtils: The configuration of resource: gpu
(exec = 1, task = 0.5/2, runnable tasks = 2) will result in wasted
resources due to resource cpus limiting the number of runnable tasks per
executor to: 1. Please adjust your configuration.
22/11/30 14:59:33 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/11/30 14:59:33 INFO SecurityManager: Changing view acls to: mwadmin
22/11/30 14:59:33 INFO SecurityManager: Changing modify acls to: mwadmin
22/11/30 14:59:33 INFO SecurityManager: Changing view acls groups to:
22/11/30 14:59:33 INFO SecurityManager: Changing modify acls groups to:
22/11/30 14:59:33 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users  with view permissions: Set(mwadmin);
groups with view permissions: Set(); users  with modify permissions:
Set(mwadmin); groups with modify permissions: Set()
22/11/30 14:59:33 INFO Utils: Successfully started service 'sparkDriver' on
port 45883.
22/11/30 14:59:33 INFO SparkEnv: Registering MapOutputTracker
22/11/30 14:59:33 INFO SparkEnv: Registering BlockManagerMaster
22/11/30 14:59:33 INFO BlockManagerMasterEndpoint: Using
org.apache.spark.storage.DefaultTopologyMapper for getting topology
information
22/11/30 14:59:33 INFO BlockManagerMasterEndpoint:
BlockManagerMasterEndpoint up
22/11/30 14:59:33 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/11/30 14:59:33 INFO DiskBlockManager: Created local directory at
/tmp/blockmgr-647d2c2a-72e4-402d-aeff-d7460726eb6d
22/11/30 14:59:33 INFO MemoryStore: MemoryStore started with capacity 366.3
MiB
22/11/30 14:59:33 INFO SparkEnv: Registering OutputCommitCoordinator
22/11/30 14:59:33 INFO Utils: Successfully started service 'SparkUI' on
port 4040.
22/11/30 14:59:33 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
htttp://localhost:4040
22/11/30 14:59:33 INFO ShimLoader: Loading shim for Spark version: 3.2.2
22/11/30 14:59:33 INFO ShimLoader: Complete Spark build info: 3.2.2,
https://github.com/apache/spark, HEAD,
78a5825fe266c0884d2dd18cbca9625fa258d7f7, 2022-07-11T15:44:21Z
22/11/30 14:59:33 INFO ShimLoader: findURLClassLoader found a
URLClassLoader org.apache.spark.util.MutableURLClassLoader@1530c739
22/11/30 14:59:33 INFO ShimLoader: Updating spark classloader
org.apache.spark.util.MutableURLClassLoader@1530c739 with the URLs:
jar:file:/home/mwadmin/spark-3.2.2-bin-hadoop3.2/jars/rapids-4-spark_2.12-22.10.0.jar!/spark3xx-common/,
jar:file:/home/mwadmin/spark-3.2.2-bin-hadoop3.2/jars/rapids-4-spark_2.12-22.10.0.jar!/spark322/
22/11/30 14:59:33 INFO ShimLoader: Spark classLoader
org.apache.spark.util.MutableURLClassLoader@1530c739 updated successfully
22/11/30 14:59:33 INFO ShimLoader: Updating spark classloader
org.apache.spark.util.MutableURLClassLoader@1530c739 with the URLs:
jar:file:/home/mwadmin/spark-3.2.2-bin-hadoop3.2/jars/rapids-4-spark_2.12-22.10.0.jar!/spark3xx-common/,
jar:file:/home/mwadmin/spark-3.2.2-bin-hadoop3.2/jars/rapids-4-spark_2.12-22.10.0.jar!/spark322/
22/11/30 14:59:33 INFO ShimLoader: Spark classLoader
org.apache.spark.util.Mutable

Error - using Spark with GPU

2022-11-30 Thread Vajiha Begum S A
 spark-submit /home/mwadmin/Documents/test.py
22/11/30 14:59:32 WARN Utils: Your hostname, mwadmin-HP-Z440-Workstation
resolves to a loopback address: 127.0.1.1; using ***.***.**.** instead (on
interface eno1)
22/11/30 14:59:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
another address
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
22/11/30 14:59:32 INFO SparkContext: Running Spark version 3.2.2
22/11/30 14:59:32 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
22/11/30 14:59:33 INFO ResourceUtils:
==
22/11/30 14:59:33 INFO ResourceUtils: No custom resources configured for
spark.driver.
22/11/30 14:59:33 INFO ResourceUtils:
==
22/11/30 14:59:33 INFO SparkContext: Submitted application: Spark.com
22/11/30 14:59:33 INFO ResourceProfile: Default ResourceProfile created,
executor resources: Map(cores -> name: cores, amount: 1, script: , vendor:
, memory -> name: memory, amount: 1024, script: , vendor: , offHeap ->
name: offHeap, amount: 0, script: , vendor: , gpu -> name: gpu, amount: 1,
script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0,
gpu -> name: gpu, amount: 0.5)
22/11/30 14:59:33 INFO ResourceProfile: Limiting resource is cpus at 1
tasks per executor
22/11/30 14:59:33 WARN ResourceUtils: The configuration of resource: gpu
(exec = 1, task = 0.5/2, runnable tasks = 2) will result in wasted
resources due to resource cpus limiting the number of runnable tasks per
executor to: 1. Please adjust your configuration.
22/11/30 14:59:33 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/11/30 14:59:33 INFO SecurityManager: Changing view acls to: mwadmin
22/11/30 14:59:33 INFO SecurityManager: Changing modify acls to: mwadmin
22/11/30 14:59:33 INFO SecurityManager: Changing view acls groups to:
22/11/30 14:59:33 INFO SecurityManager: Changing modify acls groups to:
22/11/30 14:59:33 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users  with view permissions: Set(mwadmin);
groups with view permissions: Set(); users  with modify permissions:
Set(mwadmin); groups with modify permissions: Set()
22/11/30 14:59:33 INFO Utils: Successfully started service 'sparkDriver' on
port 45883.
22/11/30 14:59:33 INFO SparkEnv: Registering MapOutputTracker
22/11/30 14:59:33 INFO SparkEnv: Registering BlockManagerMaster
22/11/30 14:59:33 INFO BlockManagerMasterEndpoint: Using
org.apache.spark.storage.DefaultTopologyMapper for getting topology
information
22/11/30 14:59:33 INFO BlockManagerMasterEndpoint:
BlockManagerMasterEndpoint up
22/11/30 14:59:33 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/11/30 14:59:33 INFO DiskBlockManager: Created local directory at
/tmp/blockmgr-647d2c2a-72e4-402d-aeff-d7460726eb6d
22/11/30 14:59:33 INFO MemoryStore: MemoryStore started with capacity 366.3
MiB
22/11/30 14:59:33 INFO SparkEnv: Registering OutputCommitCoordinator
22/11/30 14:59:33 INFO Utils: Successfully started service 'SparkUI' on
port 4040.
22/11/30 14:59:33 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
htttp://localhost:4040
22/11/30 14:59:33 INFO ShimLoader: Loading shim for Spark version: 3.2.2
22/11/30 14:59:33 INFO ShimLoader: Complete Spark build info: 3.2.2,
https://github.com/apache/spark, HEAD,
78a5825fe266c0884d2dd18cbca9625fa258d7f7, 2022-07-11T15:44:21Z
22/11/30 14:59:33 INFO ShimLoader: findURLClassLoader found a
URLClassLoader org.apache.spark.util.MutableURLClassLoader@1530c739
22/11/30 14:59:33 INFO ShimLoader: Updating spark classloader
org.apache.spark.util.MutableURLClassLoader@1530c739 with the URLs:
jar:file:/home/mwadmin/spark-3.2.2-bin-hadoop3.2/jars/rapids-4-spark_2.12-22.10.0.jar!/spark3xx-common/,
jar:file:/home/mwadmin/spark-3.2.2-bin-hadoop3.2/jars/rapids-4-spark_2.12-22.10.0.jar!/spark322/
22/11/30 14:59:33 INFO ShimLoader: Spark classLoader
org.apache.spark.util.MutableURLClassLoader@1530c739 updated successfully
22/11/30 14:59:33 INFO ShimLoader: Updating spark classloader
org.apache.spark.util.MutableURLClassLoader@1530c739 with the URLs:
jar:file:/home/mwadmin/spark-3.2.2-bin-hadoop3.2/jars/rapids-4-spark_2.12-22.10.0.jar!/spark3xx-common/,
jar:file:/home/mwadmin/spark-3.2.2-bin-hadoop3.2/jars/rapids-4-spark_2.12-22.10.0.jar!/spark322/
22/11/30 14:59:33 INFO ShimLoader: Spark classLoader
org.apache.spark.util.MutableURLClassLoader@1530c739 updated successfully
22/11/30 14:59:33 INFO RapidsPluginUtils: RAPIDS Accelerator build:
{version=22.10.0, user=, url=https://github.com/NVIDIA/spark-rapids.git,
date=2022-10-17T11:25:41Z,
revision=c75a2eafc9ce9fb3e6ab75c6677d97bf681bff50, cudf_version=22.10.0,
branch=HEAD}
22/11/30 14:59:33 INFO RapidsPluginUtils: RAPIDS Accelerator JNI build:
{version=22.10.0, user=, url=https://github.com/NVIDIA/spark-rapids-jni.git,
date=2022-10-14T05:19:41Z,
re

回复:Re: Spark got incorrect scala version while using spark 3.2.1 and spark 3.2.2

2022-08-26 Thread ckgppl_yan
Oh, I got it. I thought SPARK can get local scala version.
- 原始邮件 -
发件人:Sean Owen 
收件人:ckgppl_...@sina.cn
抄送人:user 
主题:Re: Spark got incorrect scala version while using spark 3.2.1 and spark 3.2.2
日期:2022年08月26日 21点08分

Spark is built with and ships with a copy of Scala. It doesn't use your local 
version.
On Fri, Aug 26, 2022 at 2:55 AM  wrote:
Hi all,
I found a strange thing. I have run SPARK 3.2.1 prebuilt in local mode. My OS 
scala version is 2.13.7.But when I run  spark-sumit then check the SparkUI, the 
web page shown that my scala version is 2.13.5.I used spark-shell, it also 
shown that my scala version is 2.13.5.Then I tried SPARK 3.2.2, it also shown 
that my scala version is 2.13.5.I checked the codes, it seems that SparkEnv got 
scala version from "scala.util.Properties.versionString".Not sure why it shown 
different scala version. Is it a bug or not?
Thanks
Liang

Re: Spark got incorrect scala version while using spark 3.2.1 and spark 3.2.2

2022-08-26 Thread pengyh

good answer. nice to know too.

Sean Owen wrote:

Spark is built with and ships with a copy of Scala. It doesn't use your
local version.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark got incorrect scala version while using spark 3.2.1 and spark 3.2.2

2022-08-26 Thread Sean Owen
Spark is built with and ships with a copy of Scala. It doesn't use your
local version.

On Fri, Aug 26, 2022 at 2:55 AM  wrote:

> Hi all,
>
> I found a strange thing. I have run SPARK 3.2.1 prebuilt in local mode. My
> OS scala version is 2.13.7.
> But when I run  spark-sumit then check the SparkUI, the web page shown
> that my scala version is 2.13.5.
> I used spark-shell, it also shown that my scala version is 2.13.5.
> Then I tried SPARK 3.2.2, it also shown that my scala version is 2.13.5.
> I checked the codes, it seems that SparkEnv got scala version from
> "scala.util.Properties.versionString".
> Not sure why it shown different scala version. Is it a bug or not?
>
> Thanks
>
> Liang
>


Spark got incorrect scala version while using spark 3.2.1 and spark 3.2.2

2022-08-26 Thread ckgppl_yan
Hi all,
I found a strange thing. I have run SPARK 3.2.1 prebuilt in local mode. My OS 
scala version is 2.13.7.But when I run  spark-sumit then check the SparkUI, the 
web page shown that my scala version is 2.13.5.I used spark-shell, it also 
shown that my scala version is 2.13.5.Then I tried SPARK 3.2.2, it also shown 
that my scala version is 2.13.5.I checked the codes, it seems that SparkEnv got 
scala version from "scala.util.Properties.versionString".Not sure why it shown 
different scala version. Is it a bug or not?
Thanks
Liang

Re: [Spark SQL]: Configuring/Using Spark + Catalyst optimally for read-heavy transactional workloads in JDBC sources?

2022-05-18 Thread Gavin Ray
Following up on this in case anyone runs across it in the archives in the
future
>From reading through the config docs and trying various combinations, I've
discovered that:

- You don't want to disable codegen. This roughly doubled the time to
perform simple, few-column/few-row queries from basic testing
  -  Can test this by setting an internal property after setting
"spark.testing" to "true" in system properties


> System.setProperty("spark.testing", "true")
> val spark = SparkSession.builder()
>   .config("spark.sql.codegen.wholeStage", "false")
>   .config("spark.sql.codegen.factoryMode", "NO_CODEGEN")
>

-  The following gave the best performance. I don't know if enabling CBO
did much.

val spark = SparkSession.builder()
> .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> .config("spark.kryo.unsafe", "true")
> .config("spark.sql.adaptive.enabled", "true")
> .config("spark.sql.cbo.enabled", "true")
> .config("spark.sql.cbo.joinReorder.dp.star.filter", "true")
> .config("spark.sql.cbo.joinReorder.enabled", "true")
> .config("spark.sql.cbo.planStats.enabled", "true")
> .config("spark.sql.cbo.starSchemaDetection", "true")


If you're running on more recent JDK's, you'll need to set "--add-opens"
flags for a few namespaces for "kryo.unsafe" to work.



On Mon, May 16, 2022 at 12:55 PM Gavin Ray  wrote:

> Hi all,
>
> I've not got much experience with Spark, but have been reading the
> Catalyst and
> Datasources V2 code/tests to try to get a basic understanding.
>
> I'm interested in trying Catalyst's query planner + optimizer for queries
> spanning one-or-more JDBC sources.
>
> Somewhat unusually, I'd like to do this with as minimal latency as
> possible to
> see what the experience for standard line-of-business apps is like (~90/10
> read/write ratio).
> Few rows would be returned in the reads (something on the order of
> 1-to-1,000).
>
> My question is: What configuration settings would you want to use for
> something
> like this?
>
> I imagine that doing codegen/JIT compilation of the query plan might not be
> worth the cost, so maybe you'd want to disable that and do interpretation?
>
> And possibly you'd want to use query plan config/rules that reduce the time
> spent in planning, trading efficiency for latency?
>
> Does anyone know how you'd configure Spark to test something like this?
>
> Would greatly appreciate any input (even if it's "This is a bad idea and
> will
> never work well").
>
> Thank you =)
>


[Spark SQL]: Configuring/Using Spark + Catalyst optimally for read-heavy transactional workloads in JDBC sources?

2022-05-16 Thread Gavin Ray
Hi all,

I've not got much experience with Spark, but have been reading the Catalyst
and
Datasources V2 code/tests to try to get a basic understanding.

I'm interested in trying Catalyst's query planner + optimizer for queries
spanning one-or-more JDBC sources.

Somewhat unusually, I'd like to do this with as minimal latency as possible
to
see what the experience for standard line-of-business apps is like (~90/10
read/write ratio).
Few rows would be returned in the reads (something on the order of
1-to-1,000).

My question is: What configuration settings would you want to use for
something
like this?

I imagine that doing codegen/JIT compilation of the query plan might not be
worth the cost, so maybe you'd want to disable that and do interpretation?

And possibly you'd want to use query plan config/rules that reduce the time
spent in planning, trading efficiency for latency?

Does anyone know how you'd configure Spark to test something like this?

Would greatly appreciate any input (even if it's "This is a bad idea and
will
never work well").

Thank you =)


trouble using spark in kubernetes

2022-05-03 Thread Andreas Klos

Hello together,

I am trying to run a minimal example in my k8s cluster.

First, I cloned the petastorm github repo: https://github.com/uber/petastorm

Second, I created a Dockerimage as follows:

FROMubuntu:20.04
RUN apt-get update -qq
RUN apt-get install -qq -y software-properties-common
RUN add-apt-repository -yppa:deadsnakes/ppa
RUN apt-get update -qq

RUN apt-get -qq install -y \
  build-essential \
  cmake \
  openjdk-8-jre-headless \
  git \
  python \
  python3-pip \
  python3.9 \
  python3.9-dev \
  python3.9-venv \
  virtualenv \
  wget \
  && rm -rf /var/lib/apt/lists/*
RUN 
wgethttps://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist.bz2
  -P /data/mnist/
RUN mkdir /petastorm
ADD setup.py /petastorm/
ADD README.rst /petastorm/
ADD petastorm /petastorm/petastorm
RUN python3.9 -m pip install pip --upgrade
RUN python3.9 -m pip install wheel
RUN python3.9 -m venv /petastorm_venv3.9
RUN /petastorm_venv3.9/bin/pip3.9 install --no-cache scikit-build
RUN /petastorm_venv3.9/bin/pip3.9 install --no-cache -e 
/petastorm/[test,tf,torch,docs,opencv] --only-binary pyarrow --only-binary 
opencv-python
RUN /petastorm_venv3.9/bin/pip3.9 install -U pyarrow==3.0.0 numpy==1.19.3 
tensorflow==2.5.0 pyspark==3.0.0
RUN /petastorm_venv3.9/bin/pip3.9 install opencv-python-headless
RUN rm -r /petastorm
ADD docker/run_in_venv.sh /

Afterwards, I create a namespace called spark in my k8s cluster, as 
Serviceaccount (spark-driver) and a rolebinding for the service account 
as follows:


kubectl create ns spark
kubectl create serviceaccount spark-driver
kubectl create rolebinding spark-driver-rb --clusterrole=cluster-admin 
--serviceaccount=spark:spark-driver


Finally I create a pod in the spark namespace as follows:

apiVersion: v1
kind: Pod
metadata:
  name: "petastorm-ds-creator"
  namespace: spark
  labels:
    app: "petastorm-ds-creator"
spec:
  serviceAccount: spark-driver
  containers:
  - name: petastorm-ds-creator
    image: "imagename"
    command:
  - "/bin/bash"
  - "-c"
  - "--"
    args:
  - "while true; do sleep 30; done;"
    resources:
  limits:
    cpu: 2000m
    memory: 5000Mi
  requests:
    cpu: 2000m
    memory: 5000Mi
    ports:
    - containerPort:  80
  name:  http
    - containerPort:  443
  name:  https
    - containerPort:  20022
  name:  exposed
    volumeMounts:
    - name: data
  mountPath: /data
  volumes:
    - name: data
  persistentVolumeClaim:
    claimName: spark-geodata-nfs-pvc-20220503
  restartPolicy: Always

I expose port 20022 of the pod with a headless service

kubectl expose pod petastorm-ds-creator --port=20022 --type=ClusterIP 
--cluster-ip=None -n spark


finally I run the following code in the created container/pod:

from pyspark import SparkConf
from pyspark.sql import SparkSession

spark_conf = SparkConf()
spark_conf.setMaster("k8s://https://kubernetes.default:443";)
spark_conf.setAppName("PetastormDsCreator")
spark_conf.set(
    "spark.kubernetes.namespace",
    "spark"
)
spark_conf.set(
    "spark.kubernetes.authenticate.driver.serviceAccountName",
    "spark-driver"
)
spark_conf.set(
    "spark.kubernetes.authenticate.caCertFile",
    "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
)
spark_conf.set(
    "spark.kubernetes.authenticate.oauthTokenFile",
    "/var/run/secrets/kubernetes.io/serviceaccount/token"
)
spark_conf.set(
    "spark.executor.instances",
    "2"
)
spark_conf.set(
    "spark.driver.host",
    "petastorm-ds-creator"
)
spark_conf.set(
    "spark.driver.port",
    "20022"
)
spark_conf.set(
    "spark.kubernetes.container.image",
    "imagename"
)
spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()
sc = spark.sparkContext
t = sc.parallelize(range(10))
r = t.sumApprox(3)
print('Approximate sum: %s' % r)

Unfortunately, It does not work...

with kubectl describe po podname-exec-1 I get the following error message:

Error: failed to start container "spark-kubernetes-executor": Error 
response from daemon: OCI runtime create failed: container_linux.go:349: 
starting container process caused "exec: \"executor\": executable file 
not found in $PATH": unknown


Could somebody give me a hint, what am I doing wrong? Is my SparkSession 
configuration not correct?


Best regards

Andreas


Re: [EXTERNAL] Re: Unable to access Google buckets using spark-submit

2022-02-14 Thread Saurabh Gulati
Hey Karan,
you can get the jar from 
here<https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#non-dataproc_clusters>

From: karan alang 
Sent: 13 February 2022 20:08
To: Gourav Sengupta 
Cc: Holden Karau ; Mich Talebzadeh 
; user @spark 
Subject: [EXTERNAL] Re: Unable to access Google buckets using spark-submit

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Hi Gaurav, All,
I'm doing a spark-submit from my local system to a GCP Dataproc cluster .. This 
is more for dev/testing.
I can run a -- 'gcloud dataproc jobs submit' command as well, which is what 
will be done in Production.

Hope that clarifies.

regds,
Karan Alang


On Sat, Feb 12, 2022 at 10:31 PM Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>> wrote:
Hi,

agree with Holden, have faced quite a few issues with FUSE.

Also trying to understand "spark-submit from local" . Are you submitting your 
SPARK jobs from a local laptop or in local mode from a GCP dataproc / system?

If you are submitting the job from your local laptop, there will be performance 
bottlenecks I guess based on the internet bandwidth and volume of data.

Regards,
Gourav


On Sat, Feb 12, 2022 at 7:12 PM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
You can also put the GS access jar with your Spark jars — that’s what the class 
not found exception is pointing you towards.

On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
BTW I also answered you in in stackoverflow :

https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit<https://urldefense.com/v3/__https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4BmPwNAw$>


HTH


 
[https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4pVNnS44$>


 
https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4hPaytxY$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
You are trying to access a Google storage bucket gs:// from your local host.

It does not see it because spark-submit assumes that it is a local file system 
on the host which is not.

You need to mount gs:// bucket as a local file system.

You can use the tool called gcsfuse 
https://cloud.google.com/storage/docs/gcs-fuse<https://urldefense.com/v3/__https://cloud.google.com/storage/docs/gcs-fuse__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4fYDEO3c$>
 . Cloud Storage FUSE is an open source 
FUSE<https://urldefense.com/v3/__http://fuse.sourceforge.net/__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4H2bW-18$>
 adapter that allows you to mount Cloud Storage buckets as file systems on 
Linux or macOS systems. You can download gcsfuse from 
here<https://urldefense.com/v3/__https://github.com/GoogleCloudPlatform/gcsfuse__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4Y3qR8x0$>


Pretty simple.


It will be installed as /usr/bin/gcsfuse and you can mount it by creating a 
local mount file like /mnt/gs as root and give permission to others to use it.


As a normal user that needs to access gs:// bucket (not as root), use gcsfuse 
to mount it. For example I am mounting a gcs bucket called spark-jars-karan here


Just use the bucket name itself


gcsfuse spark-jars-karan /mnt/gs


Then you can refer to it as /mnt/gs in spark-submit from on-premise host

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 --jars 
/mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar

HTH

 
[https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4pVNnS44$>



Dis

Re: Unable to access Google buckets using spark-submit

2022-02-13 Thread karan alang
Hi Gaurav, All,
I'm doing a spark-submit from my local system to a GCP Dataproc cluster ..
This is more for dev/testing.
I can run a -- 'gcloud dataproc jobs submit' command as well, which is what
will be done in Production.

Hope that clarifies.

regds,
Karan Alang


On Sat, Feb 12, 2022 at 10:31 PM Gourav Sengupta 
wrote:

> Hi,
>
> agree with Holden, have faced quite a few issues with FUSE.
>
> Also trying to understand "spark-submit from local" . Are you submitting
> your SPARK jobs from a local laptop or in local mode from a GCP dataproc /
> system?
>
> If you are submitting the job from your local laptop, there will be
> performance bottlenecks I guess based on the internet bandwidth and volume
> of data.
>
> Regards,
> Gourav
>
>
> On Sat, Feb 12, 2022 at 7:12 PM Holden Karau  wrote:
>
>> You can also put the GS access jar with your Spark jars — that’s what the
>> class not found exception is pointing you towards.
>>
>> On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> BTW I also answered you in in stackoverflow :
>>>
>>>
>>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
>>> wrote:
>>>
>>>> You are trying to access a Google storage bucket gs:// from your local
>>>> host.
>>>>
>>>> It does not see it because spark-submit assumes that it is a local file
>>>> system on the host which is not.
>>>>
>>>> You need to mount gs:// bucket as a local file system.
>>>>
>>>> You can use the tool called gcsfuse
>>>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>>>> an open source FUSE <http://fuse.sourceforge.net/> adapter that allows
>>>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>>>> systems. You can download gcsfuse from here
>>>> <https://github.com/GoogleCloudPlatform/gcsfuse>
>>>>
>>>>
>>>> Pretty simple.
>>>>
>>>>
>>>> It will be installed as /usr/bin/gcsfuse and you can mount it by
>>>> creating a local mount file like /mnt/gs as root and give permission to
>>>> others to use it.
>>>>
>>>>
>>>> As a normal user that needs to access gs:// bucket (not as root), use
>>>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>>>> spark-jars-karan here
>>>>
>>>>
>>>> Just use the bucket name itself
>>>>
>>>>
>>>> gcsfuse spark-jars-karan /mnt/gs
>>>>
>>>>
>>>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>>>
>>>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>>>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>>>
>>>> HTH
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, 12 Feb 2022 at 04:31, karan alang 
>>>> wrote:
>>>>
>>>>> Hello All,
>>>>>
>>>>> I'm trying to access gcp buckets while running spark-submit from
>>>>> local, and running into issues.
>>>>>
>>>>> I'm getting error :
>>>>> ```
>>>>>
>>>>> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
>>>>> library for your platform... using builtin-java classes where applicable
>>>>> Exception in thread "main" 
>>>>> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
>>>>> scheme "gs"
>>>>>
>>>>> ```
>>>>> I tried adding the --conf
>>>>> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>>>>>
>>>>> to the spark-submit command, but getting ClassNotFoundException
>>>>>
>>>>> Details are in stackoverflow :
>>>>>
>>>>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>>>>
>>>>> Any ideas on how to fix this ?
>>>>> tia !
>>>>>
>>>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: Unable to access Google buckets using spark-submit

2022-02-13 Thread karan alang
Hi Holden,

when you mention - GS Access jar -  which jar is this ?
Can you pls clarify ?

thanks,
Karan Alang

On Sat, Feb 12, 2022 at 11:10 AM Holden Karau  wrote:

> You can also put the GS access jar with your Spark jars — that’s what the
> class not found exception is pointing you towards.
>
> On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> BTW I also answered you in in stackoverflow :
>>
>>
>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
>> wrote:
>>
>>> You are trying to access a Google storage bucket gs:// from your local
>>> host.
>>>
>>> It does not see it because spark-submit assumes that it is a local file
>>> system on the host which is not.
>>>
>>> You need to mount gs:// bucket as a local file system.
>>>
>>> You can use the tool called gcsfuse
>>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>>> an open source FUSE <http://fuse.sourceforge.net/> adapter that allows
>>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>>> systems. You can download gcsfuse from here
>>> <https://github.com/GoogleCloudPlatform/gcsfuse>
>>>
>>>
>>> Pretty simple.
>>>
>>>
>>> It will be installed as /usr/bin/gcsfuse and you can mount it by
>>> creating a local mount file like /mnt/gs as root and give permission to
>>> others to use it.
>>>
>>>
>>> As a normal user that needs to access gs:// bucket (not as root), use
>>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>>> spark-jars-karan here
>>>
>>>
>>> Just use the bucket name itself
>>>
>>>
>>> gcsfuse spark-jars-karan /mnt/gs
>>>
>>>
>>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>>
>>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>>
>>> HTH
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>>>
>>>> Hello All,
>>>>
>>>> I'm trying to access gcp buckets while running spark-submit from local,
>>>> and running into issues.
>>>>
>>>> I'm getting error :
>>>> ```
>>>>
>>>> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
>>>> library for your platform... using builtin-java classes where applicable
>>>> Exception in thread "main" 
>>>> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
>>>> scheme "gs"
>>>>
>>>> ```
>>>> I tried adding the --conf
>>>> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>>>>
>>>> to the spark-submit command, but getting ClassNotFoundException
>>>>
>>>> Details are in stackoverflow :
>>>>
>>>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>>>
>>>> Any ideas on how to fix this ?
>>>> tia !
>>>>
>>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Unable to access Google buckets using spark-submit

2022-02-13 Thread karan alang
Thanks, Mich - will check this and update.

regds,
Karan Alang

On Sat, Feb 12, 2022 at 1:57 AM Mich Talebzadeh 
wrote:

> BTW I also answered you in in stackoverflow :
>
>
> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
> wrote:
>
>> You are trying to access a Google storage bucket gs:// from your local
>> host.
>>
>> It does not see it because spark-submit assumes that it is a local file
>> system on the host which is not.
>>
>> You need to mount gs:// bucket as a local file system.
>>
>> You can use the tool called gcsfuse
>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>> an open source FUSE <http://fuse.sourceforge.net/> adapter that allows
>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>> systems. You can download gcsfuse from here
>> <https://github.com/GoogleCloudPlatform/gcsfuse>
>>
>>
>> Pretty simple.
>>
>>
>> It will be installed as /usr/bin/gcsfuse and you can mount it by creating
>> a local mount file like /mnt/gs as root and give permission to others to
>> use it.
>>
>>
>> As a normal user that needs to access gs:// bucket (not as root), use
>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>> spark-jars-karan here
>>
>>
>> Just use the bucket name itself
>>
>>
>> gcsfuse spark-jars-karan /mnt/gs
>>
>>
>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>
>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>
>> HTH
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>>
>>> Hello All,
>>>
>>> I'm trying to access gcp buckets while running spark-submit from local,
>>> and running into issues.
>>>
>>> I'm getting error :
>>> ```
>>>
>>> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
>>> library for your platform... using builtin-java classes where applicable
>>> Exception in thread "main" 
>>> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
>>> scheme "gs"
>>>
>>> ```
>>> I tried adding the --conf
>>> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>>>
>>> to the spark-submit command, but getting ClassNotFoundException
>>>
>>> Details are in stackoverflow :
>>>
>>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>>
>>> Any ideas on how to fix this ?
>>> tia !
>>>
>>>


Re: Unable to access Google buckets using spark-submit

2022-02-13 Thread Mich Talebzadeh
 Putting the GS access jar with Spark jars may technically resolve the
issue of spark-submit  but it is not a recommended practice to create a
local copy of jar files.

The approach that the thread owner adopted by putting the files in Google
cloud bucket is correct. Indeed this is what he states and I quote "I'm
trying to access google buckets, when using spark-submit and running into
issues.,  What needs to be done to debug/fix this". Quote from stack
overflow

Hence the approach adopted is correct. He has created a bucket in GCP
called gs://spark-jars-karan/ and wants to access it. I presume *he wants
to test i*t *locally *(on prem I assume) so he just needs to be able to
access the bucket in GCP remotely. The recommendation of using gcsfuse to
resolve this issue is sound.

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 12 Feb 2022 at 19:10, Holden Karau  wrote:

> You can also put the GS access jar with your Spark jars — that’s what the
> class not found exception is pointing you towards.
>
> On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> BTW I also answered you in in stackoverflow :
>>
>>
>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
>> wrote:
>>
>>> You are trying to access a Google storage bucket gs:// from your local
>>> host.
>>>
>>> It does not see it because spark-submit assumes that it is a local file
>>> system on the host which is not.
>>>
>>> You need to mount gs:// bucket as a local file system.
>>>
>>> You can use the tool called gcsfuse
>>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>>> an open source FUSE <http://fuse.sourceforge.net/> adapter that allows
>>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>>> systems. You can download gcsfuse from here
>>> <https://github.com/GoogleCloudPlatform/gcsfuse>
>>>
>>>
>>> Pretty simple.
>>>
>>>
>>> It will be installed as /usr/bin/gcsfuse and you can mount it by
>>> creating a local mount file like /mnt/gs as root and give permission to
>>> others to use it.
>>>
>>>
>>> As a normal user that needs to access gs:// bucket (not as root), use
>>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>>> spark-jars-karan here
>>>
>>>
>>> Just use the bucket name itself
>>>
>>>
>>> gcsfuse spark-jars-karan /mnt/gs
>>>
>>>
>>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>>
>>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>>
>>> HTH
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>>>
>>>> Hello All,
>>>>
&g

Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Gourav Sengupta
Hi,

agree with Holden, have faced quite a few issues with FUSE.

Also trying to understand "spark-submit from local" . Are you submitting
your SPARK jobs from a local laptop or in local mode from a GCP dataproc /
system?

If you are submitting the job from your local laptop, there will be
performance bottlenecks I guess based on the internet bandwidth and volume
of data.

Regards,
Gourav


On Sat, Feb 12, 2022 at 7:12 PM Holden Karau  wrote:

> You can also put the GS access jar with your Spark jars — that’s what the
> class not found exception is pointing you towards.
>
> On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> BTW I also answered you in in stackoverflow :
>>
>>
>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
>> wrote:
>>
>>> You are trying to access a Google storage bucket gs:// from your local
>>> host.
>>>
>>> It does not see it because spark-submit assumes that it is a local file
>>> system on the host which is not.
>>>
>>> You need to mount gs:// bucket as a local file system.
>>>
>>> You can use the tool called gcsfuse
>>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>>> an open source FUSE <http://fuse.sourceforge.net/> adapter that allows
>>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>>> systems. You can download gcsfuse from here
>>> <https://github.com/GoogleCloudPlatform/gcsfuse>
>>>
>>>
>>> Pretty simple.
>>>
>>>
>>> It will be installed as /usr/bin/gcsfuse and you can mount it by
>>> creating a local mount file like /mnt/gs as root and give permission to
>>> others to use it.
>>>
>>>
>>> As a normal user that needs to access gs:// bucket (not as root), use
>>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>>> spark-jars-karan here
>>>
>>>
>>> Just use the bucket name itself
>>>
>>>
>>> gcsfuse spark-jars-karan /mnt/gs
>>>
>>>
>>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>>
>>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>>
>>> HTH
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>>>
>>>> Hello All,
>>>>
>>>> I'm trying to access gcp buckets while running spark-submit from local,
>>>> and running into issues.
>>>>
>>>> I'm getting error :
>>>> ```
>>>>
>>>> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
>>>> library for your platform... using builtin-java classes where applicable
>>>> Exception in thread "main" 
>>>> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
>>>> scheme "gs"
>>>>
>>>> ```
>>>> I tried adding the --conf
>>>> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>>>>
>>>> to the spark-submit command, but getting ClassNotFoundException
>>>>
>>>> Details are in stackoverflow :
>>>>
>>>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>>>
>>>> Any ideas on how to fix this ?
>>>> tia !
>>>>
>>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Holden Karau
You can also put the GS access jar with your Spark jars — that’s what the
class not found exception is pointing you towards.

On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh 
wrote:

> BTW I also answered you in in stackoverflow :
>
>
> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
> wrote:
>
>> You are trying to access a Google storage bucket gs:// from your local
>> host.
>>
>> It does not see it because spark-submit assumes that it is a local file
>> system on the host which is not.
>>
>> You need to mount gs:// bucket as a local file system.
>>
>> You can use the tool called gcsfuse
>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>> an open source FUSE <http://fuse.sourceforge.net/> adapter that allows
>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>> systems. You can download gcsfuse from here
>> <https://github.com/GoogleCloudPlatform/gcsfuse>
>>
>>
>> Pretty simple.
>>
>>
>> It will be installed as /usr/bin/gcsfuse and you can mount it by creating
>> a local mount file like /mnt/gs as root and give permission to others to
>> use it.
>>
>>
>> As a normal user that needs to access gs:// bucket (not as root), use
>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>> spark-jars-karan here
>>
>>
>> Just use the bucket name itself
>>
>>
>> gcsfuse spark-jars-karan /mnt/gs
>>
>>
>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>
>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>
>> HTH
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>>
>>> Hello All,
>>>
>>> I'm trying to access gcp buckets while running spark-submit from local,
>>> and running into issues.
>>>
>>> I'm getting error :
>>> ```
>>>
>>> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
>>> library for your platform... using builtin-java classes where applicable
>>> Exception in thread "main" 
>>> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
>>> scheme "gs"
>>>
>>> ```
>>> I tried adding the --conf
>>> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>>>
>>> to the spark-submit command, but getting ClassNotFoundException
>>>
>>> Details are in stackoverflow :
>>>
>>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>>
>>> Any ideas on how to fix this ?
>>> tia !
>>>
>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Mich Talebzadeh
BTW I also answered you in in stackoverflow :

https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
wrote:

> You are trying to access a Google storage bucket gs:// from your local
> host.
>
> It does not see it because spark-submit assumes that it is a local file
> system on the host which is not.
>
> You need to mount gs:// bucket as a local file system.
>
> You can use the tool called gcsfuse
> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is an
> open source FUSE <http://fuse.sourceforge.net/> adapter that allows you
> to mount Cloud Storage buckets as file systems on Linux or macOS systems.
> You can download gcsfuse from here
> <https://github.com/GoogleCloudPlatform/gcsfuse>
>
>
> Pretty simple.
>
>
> It will be installed as /usr/bin/gcsfuse and you can mount it by creating
> a local mount file like /mnt/gs as root and give permission to others to
> use it.
>
>
> As a normal user that needs to access gs:// bucket (not as root), use
> gcsfuse to mount it. For example I am mounting a gcs bucket called
> spark-jars-karan here
>
>
> Just use the bucket name itself
>
>
> gcsfuse spark-jars-karan /mnt/gs
>
>
> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>
> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>
> HTH
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>
>> Hello All,
>>
>> I'm trying to access gcp buckets while running spark-submit from local,
>> and running into issues.
>>
>> I'm getting error :
>> ```
>>
>> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
>> library for your platform... using builtin-java classes where applicable
>> Exception in thread "main" 
>> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
>> scheme "gs"
>>
>> ```
>> I tried adding the --conf
>> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>>
>> to the spark-submit command, but getting ClassNotFoundException
>>
>> Details are in stackoverflow :
>>
>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>
>> Any ideas on how to fix this ?
>> tia !
>>
>>


Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Mich Talebzadeh
You are trying to access a Google storage bucket gs:// from your local host.

It does not see it because spark-submit assumes that it is a local file
system on the host which is not.

You need to mount gs:// bucket as a local file system.

You can use the tool called gcsfuse
https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is an
open source FUSE <http://fuse.sourceforge.net/> adapter that allows you to
mount Cloud Storage buckets as file systems on Linux or macOS systems. You
can download gcsfuse from here
<https://github.com/GoogleCloudPlatform/gcsfuse>


Pretty simple.


It will be installed as /usr/bin/gcsfuse and you can mount it by creating a
local mount file like /mnt/gs as root and give permission to others to use
it.


As a normal user that needs to access gs:// bucket (not as root), use
gcsfuse to mount it. For example I am mounting a gcs bucket called
spark-jars-karan here


Just use the bucket name itself


gcsfuse spark-jars-karan /mnt/gs


Then you can refer to it as /mnt/gs in spark-submit from on-premise host

spark-submit --packages
org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 --jars
/mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:

> Hello All,
>
> I'm trying to access gcp buckets while running spark-submit from local,
> and running into issues.
>
> I'm getting error :
> ```
>
> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Exception in thread "main" 
> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme 
> "gs"
>
> ```
> I tried adding the --conf
> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>
> to the spark-submit command, but getting ClassNotFoundException
>
> Details are in stackoverflow :
>
> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>
> Any ideas on how to fix this ?
> tia !
>
>


Unable to access Google buckets using spark-submit

2022-02-11 Thread karan alang
Hello All,

I'm trying to access gcp buckets while running spark-submit from local, and
running into issues.

I'm getting error :
```

22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable
Exception in thread "main"
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for
scheme "gs"

```
I tried adding the --conf
spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

to the spark-submit command, but getting ClassNotFoundException

Details are in stackoverflow :
https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit

Any ideas on how to fix this ?
tia !


Re: [issue] not able to add external libs to pyspark job while using spark-submit

2021-11-24 Thread Mich Talebzadeh
I am not sure about that. However, with Kubernetes and docker image for
PySpark, I build the packages into the image itself as below in the
dockerfile

RUN pip install pyyaml numpy cx_Oracle

and that will add those packages that you can reference in your py script

import yaml
import cx_Oracle

HTH







   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 24 Nov 2021 at 17:44, Bode, Meikel, NMA-CFD <
meikel.b...@bertelsmann.de> wrote:

> Can we add Python dependencies as we can do for mvn coordinates? So that
> we run sth like pip install  or download from pypi index?
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Mittwoch, 24. November 2021 18:28
> *Cc:* user@spark.apache.org
> *Subject:* Re: [issue] not able to add external libs to pyspark job while
> using spark-submit
>
>
>
> The easiest way to set this up is to create dependencies.zip file.
>
>
>
> Assuming that you have a virtual environment already set-up, where there
> is directory called site-packages, go to that directory and just create a
> minimal a shell script  say package_and_zip_dependencies.sh to do it for
> you
>
>
>
> Example:
>
>
>
> cat package_and_zip_dependencies.sh
>
>
>
> #!/bin/bash
>
> # https://blog.danielcorin.com/posts/2015-11-09-pyspark/
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.danielcorin.com%2Fposts%2F2015-11-09-pyspark%2F&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cbdadcaa955124c44178808d9af6fcf46%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637733717018773969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=WyMHGm1PfvLfcoyUfu0mQRewFxJ6%2FSLz1Q6hCjnySnM%3D&reserved=0>
>
> zip -r ../dependencies.zip .
>
> ls -l ../dependencies.zip
>
> exit 0
>
>
>
> One created, create an environment variable called DEPENDENCIES
>
>
>
> export DEPENDENCIES="export
> DEPENDENCIES="/usr/src/Python-3.7.3/airflow_virtualenv/lib/python3.7/dependencies.zip"
>
>
>
> Then in spark-submit you can do this
>
>
>
> spark-submit --master yarn --deploy-mode client --driver-memory xG
> --executor-memory yG --num-executors m --executor-cores n --py-files
> $DEPENDENCIES --jars $HOME/jars/spark-sql-kafka-0-10_2.12-3.1.0.jar
>
>
>
> Also check this link as well
> https://blog.danielcorin.com/posts/2015-11-09-pyspark/
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.danielcorin.com%2Fposts%2F2015-11-09-pyspark%2F&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cbdadcaa955124c44178808d9af6fcf46%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637733717018783923%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=OSzimB4rV0vksIgvoEdQedI47NNxi5EH6XmucYGT%2Bpo%3D&reserved=0>
>
>
>
> HTH
>
>
>
>
>
>view my Linkedin profile
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fmich-talebzadeh-ph-d-5205b2%2F&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cbdadcaa955124c44178808d9af6fcf46%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637733717018783923%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=drsA5Ywhxbav%2Bj2E255t4I14lS4wEXAQ5gEtsdIpbZo%3D&reserved=0>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Wed, 24 Nov 2021 at 14:03, Atheer Alabdullatif 
> wrote:
>
> Dear Spark team,
>
> hope my email finds you well
>
>
>
>
>
> I am using pyspark 3.0 and facing an issue with adding external library
> [configparser] while running the job using [spark-submit] & [yarn]
>
> issue:
>
>
>
> import configparser
>
> ImportError: No module named configparser
>
> 21/11/24 08:54:38 INFO util.ShutdownHookManager: Shutdown hook called
>
> solutions I tried:
>
> 1- installing library src files and adding it to the session using
> [addPyFile]:
>
>- files structu

RE: [issue] not able to add external libs to pyspark job while using spark-submit

2021-11-24 Thread Bode, Meikel, NMA-CFD
Can we add Python dependencies as we can do for mvn coordinates? So that we run 
sth like pip install  or download from pypi index?

From: Mich Talebzadeh 
Sent: Mittwoch, 24. November 2021 18:28
Cc: user@spark.apache.org
Subject: Re: [issue] not able to add external libs to pyspark job while using 
spark-submit

The easiest way to set this up is to create dependencies.zip file.

Assuming that you have a virtual environment already set-up, where there is 
directory called site-packages, go to that directory and just create a minimal 
a shell script  say package_and_zip_dependencies.sh to do it for you

Example:

cat package_and_zip_dependencies.sh

#!/bin/bash
# 
https://blog.danielcorin.com/posts/2015-11-09-pyspark/<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.danielcorin.com%2Fposts%2F2015-11-09-pyspark%2F&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cbdadcaa955124c44178808d9af6fcf46%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637733717018773969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=WyMHGm1PfvLfcoyUfu0mQRewFxJ6%2FSLz1Q6hCjnySnM%3D&reserved=0>
zip -r ../dependencies.zip .
ls -l ../dependencies.zip
exit 0

One created, create an environment variable called DEPENDENCIES

export DEPENDENCIES="export 
DEPENDENCIES="/usr/src/Python-3.7.3/airflow_virtualenv/lib/python3.7/dependencies.zip"

Then in spark-submit you can do this

spark-submit --master yarn --deploy-mode client --driver-memory xG 
--executor-memory yG --num-executors m --executor-cores n --py-files 
$DEPENDENCIES --jars $HOME/jars/spark-sql-kafka-0-10_2.12-3.1.0.jar

Also check this link as well  
https://blog.danielcorin.com/posts/2015-11-09-pyspark/<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.danielcorin.com%2Fposts%2F2015-11-09-pyspark%2F&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cbdadcaa955124c44178808d9af6fcf46%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637733717018783923%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=OSzimB4rV0vksIgvoEdQedI47NNxi5EH6XmucYGT%2Bpo%3D&reserved=0>

HTH



 
[https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fmich-talebzadeh-ph-d-5205b2%2F&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cbdadcaa955124c44178808d9af6fcf46%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637733717018783923%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=drsA5Ywhxbav%2Bj2E255t4I14lS4wEXAQ5gEtsdIpbZo%3D&reserved=0>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Wed, 24 Nov 2021 at 14:03, Atheer Alabdullatif 
mailto:a.alabdulla...@lean.sa>> wrote:
Dear Spark team,
hope my email finds you well



I am using pyspark 3.0 and facing an issue with adding external library 
[configparser] while running the job using [spark-submit] & [yarn]

issue:



import configparser

ImportError: No module named configparser

21/11/24 08:54:38 INFO util.ShutdownHookManager: Shutdown hook called

solutions I tried:

1- installing library src files and adding it to the session using [addPyFile]:

  *   files structure:

-- main dir

   -- subdir

  -- libs

 -- configparser-5.1.0

-- src

   -- configparser.py

 -- configparser.zip

  -- sparkjob.py

1.a zip file:

spark = SparkSession.builder.appName(jobname + '_' + table).config(

"spark.mongodb.input.uri", uri +

"." +

table +

"").config(

"spark.mongodb.input.sampleSize",

990).getOrCreate()



spark.sparkContext.addPyFile('/maindir/subdir/libs/configparser.zip')

df = spark.read.format("mongo").load()

1.b python file

spark = SparkSession.builder.appName(jobname + '_' + table).config(

"spark.mongodb.input.uri", uri +

"." +

table +

"").config(

"spark.mongodb.input.sampleSize",

990).getOrCreate()



spark.sparkContext.addPyFile('maindir/subdir/libs/configparser-5.1.0/src/configparser.py')

df = spark.read.format("mongo").load()



2- using os library

def install_libs():

'''

this function used to install external python libs in yarn

'''

os.system("pip3 install configparser")




Re: [issue] not able to add external libs to pyspark job while using spark-submit

2021-11-24 Thread Mich Talebzadeh
The easiest way to set this up is to create dependencies.zip file.

Assuming that you have a virtual environment already set-up, where there is
directory called site-packages, go to that directory and just create a
minimal a shell script  say package_and_zip_dependencies.sh to do it for you

Example:

cat package_and_zip_dependencies.sh

#!/bin/bash
# https://blog.danielcorin.com/posts/2015-11-09-pyspark/
zip -r ../dependencies.zip .
ls -l ../dependencies.zip
exit 0

One created, create an environment variable called DEPENDENCIES

export DEPENDENCIES="export
DEPENDENCIES="/usr/src/Python-3.7.3/airflow_virtualenv/lib/python3.7/dependencies.zip"

Then in spark-submit you can do this

spark-submit --master yarn --deploy-mode client --driver-memory xG
--executor-memory yG --num-executors m --executor-cores n --py-files
$DEPENDENCIES --jars $HOME/jars/spark-sql-kafka-0-10_2.12-3.1.0.jar

Also check this link as well
https://blog.danielcorin.com/posts/2015-11-09-pyspark/

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 24 Nov 2021 at 14:03, Atheer Alabdullatif 
wrote:

> Dear Spark team,
> hope my email finds you well
>
>
> I am using pyspark 3.0 and facing an issue with adding external library
> [configparser] while running the job using [spark-submit] & [yarn]
>
> issue:
>
>
> import configparser
> ImportError: No module named configparser21/11/24 08:54:38 INFO 
> util.ShutdownHookManager: Shutdown hook called
>
> solutions I tried:
>
> 1- installing library src files and adding it to the session using
> [addPyFile]:
>
>
>- files structure:
>
> -- main dir
>-- subdir
>   -- libs
>  -- configparser-5.1.0
> -- src
>-- configparser.py
>  -- configparser.zip
>   -- sparkjob.py
>
> 1.a zip file:
>
> spark = SparkSession.builder.appName(jobname + '_' + table).config(
> "spark.mongodb.input.uri", uri +
> "." +
> table +
> "").config(
> "spark.mongodb.input.sampleSize",
> 990).getOrCreate()
>
> spark.sparkContext.addPyFile('/maindir/subdir/libs/configparser.zip')
> df = spark.read.format("mongo").load()
>
> 1.b python file
>
> spark = SparkSession.builder.appName(jobname + '_' + table).config(
> "spark.mongodb.input.uri", uri +
> "." +
> table +
> "").config(
> "spark.mongodb.input.sampleSize",
> 990).getOrCreate()
>
> spark.sparkContext.addPyFile('maindir/subdir/libs/configparser-5.1.0/src/configparser.py')
> df = spark.read.format("mongo").load()
>
>
> 2- using os library
>
> def install_libs():
> '''
> this function used to install external python libs in yarn
> '''
> os.system("pip3 install configparser")
> if __name__ == "__main__":
>
> # install libs
> install_libs()
>
>
> we value your support
>
> best,
>
> Atheer Alabdullatif
>
>
>
>
>
>
> إشعار السرية وإخلاء المسؤولية
> هذه الرسالة ومرفقاتها معدة لاستخدام المُرسل إليه المقصود بالرسالة فقط وقد
> تحتوي على معلومات سرية أو محمية قانونياً، إن لم تكن الشخص المقصود فنرجو
> إخطار المُرسل فوراً عن طريق الرد على هذا البريد الإلكتروني وحذف الرسالة من
> البريد الإلكتروني، وعدم إبقاء نسخ منه،  لا يجوز استخدام أو عرض أو نشر
> المحتوى سواء بشكل مباشر أو غير مباشر دون موافقة خطية مسبقة، لا تتحمل شركة
> لين مسؤولية الأضرار الناتجة عن أي فيروسات قد تحملها هذه الرسالة.
>
>
>
> **Confidentiality & Disclaimer Notice**
> This e-mail message, including any attachments, is for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information or otherwise protected by law. If you are not the intended
> recipient, please immediately notify the sender, delete the e-mail, and do
> not retain any copies of it. It is prohibited to use, disseminate or
> distribute the content of this e-mail, directly or indirectly, without
> prior written consent. Lean accepts no liability for damage caused by any
> virus that may be transmitted by this Email.
>
>
>
>
>


Re: [issue] not able to add external libs to pyspark job while using spark-submit

2021-11-24 Thread Atheer Alabdullatif
Hello Owen,
Thank you for your prompt reply!
We will check it out.

best,
Atheer Alabdullatif

From: Sean Owen 
Sent: Wednesday, November 24, 2021 5:06 PM
To: Atheer Alabdullatif 
Cc: user@spark.apache.org ; Data Engineering 

Subject: Re: [issue] not able to add external libs to pyspark job while using 
spark-submit

You don't often get email from sro...@gmail.com. Learn why this is 
important<http://aka.ms/LearnAboutSenderIdentification>
External Sender: be CAUTION , Particularly with links and attachments.
That's not how you add a library. From the docs: 
https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html

On Wed, Nov 24, 2021 at 8:02 AM Atheer Alabdullatif 
mailto:a.alabdulla...@lean.sa>> wrote:
Dear Spark team,
hope my email finds you well



I am using pyspark 3.0 and facing an issue with adding external library 
[configparser] while running the job using [spark-submit] & [yarn]

issue:


import configparser
ImportError: No module named configparser
21/11/24 08:54:38 INFO util.ShutdownHookManager: Shutdown hook called

solutions I tried:

1- installing library src files and adding it to the session using [addPyFile]:

  *   files structure:

-- main dir
   -- subdir
  -- libs
 -- configparser-5.1.0
-- src
   -- configparser.py
 -- configparser.zip
  -- sparkjob.py

1.a zip file:

spark = SparkSession.builder.appName(jobname + '_' + table).config(
"spark.mongodb.input.uri", uri +
"." +
table +
"").config(
"spark.mongodb.input.sampleSize",
990).getOrCreate()

spark.sparkContext.addPyFile('/maindir/subdir/libs/configparser.zip')
df = spark.read.format("mongo").load()

1.b python file

spark = SparkSession.builder.appName(jobname + '_' + table).config(
"spark.mongodb.input.uri", uri +
"." +
table +
"").config(
"spark.mongodb.input.sampleSize",
990).getOrCreate()

spark.sparkContext.addPyFile('maindir/subdir/libs/configparser-5.1.0/src/configparser.py')
df = spark.read.format("mongo").load()


2- using os library

def install_libs():
'''
this function used to install external python libs in yarn
'''
os.system("pip3 install configparser")

if __name__ == "__main__":

# install libs
install_libs()


we value your support

best,

Atheer Alabdullatif






*إشعار السرية وإخلاء المسؤولية*
هذه الرسالة ومرفقاتها معدة لاستخدام المُرسل إليه المقصود بالرسالة فقط وقد تحتوي 
على معلومات سرية أو محمية قانونياً، إن لم تكن الشخص المقصود فنرجو إخطار المُرسل 
فوراً عن طريق الرد على هذا البريد الإلكتروني وحذف الرسالة من  البريد 
الإلكتروني، وعدم إبقاء نسخ منه،  لا يجوز استخدام أو عرض أو نشر المحتوى سواء 
بشكل مباشر أو غير مباشر دون موافقة خطية مسبقة، لا تتحمل شركة لين مسؤولية 
الأضرار الناتجة عن أي فيروسات قد تحملها هذه الرسالة.



*Confidentiality & Disclaimer Notice*
This e-mail message, including any attachments, is for the sole use of the 
intended recipient(s) and may contain confidential and privileged information 
or otherwise protected by law. If you are not the intended recipient, please 
immediately notify the sender, delete the e-mail, and do not retain any copies 
of it. It is prohibited to use, disseminate or distribute the content of this 
e-mail, directly or indirectly, without prior written consent. Lean accepts no 
liability for damage caused by any virus that may be transmitted by this Email.






Re: [issue] not able to add external libs to pyspark job while using spark-submit

2021-11-24 Thread Sean Owen
That's not how you add a library. From the docs:
https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html

On Wed, Nov 24, 2021 at 8:02 AM Atheer Alabdullatif 
wrote:

> Dear Spark team,
> hope my email finds you well
>
>
> I am using pyspark 3.0 and facing an issue with adding external library
> [configparser] while running the job using [spark-submit] & [yarn]
>
> issue:
>
>
> import configparser
> ImportError: No module named configparser21/11/24 08:54:38 INFO 
> util.ShutdownHookManager: Shutdown hook called
>
> solutions I tried:
>
> 1- installing library src files and adding it to the session using
> [addPyFile]:
>
>
>- files structure:
>
> -- main dir
>-- subdir
>   -- libs
>  -- configparser-5.1.0
> -- src
>-- configparser.py
>  -- configparser.zip
>   -- sparkjob.py
>
> 1.a zip file:
>
> spark = SparkSession.builder.appName(jobname + '_' + table).config(
> "spark.mongodb.input.uri", uri +
> "." +
> table +
> "").config(
> "spark.mongodb.input.sampleSize",
> 990).getOrCreate()
>
> spark.sparkContext.addPyFile('/maindir/subdir/libs/configparser.zip')
> df = spark.read.format("mongo").load()
>
> 1.b python file
>
> spark = SparkSession.builder.appName(jobname + '_' + table).config(
> "spark.mongodb.input.uri", uri +
> "." +
> table +
> "").config(
> "spark.mongodb.input.sampleSize",
> 990).getOrCreate()
>
> spark.sparkContext.addPyFile('maindir/subdir/libs/configparser-5.1.0/src/configparser.py')
> df = spark.read.format("mongo").load()
>
>
> 2- using os library
>
> def install_libs():
> '''
> this function used to install external python libs in yarn
> '''
> os.system("pip3 install configparser")
> if __name__ == "__main__":
>
> # install libs
> install_libs()
>
>
> we value your support
>
> best,
>
> Atheer Alabdullatif
>
>
>
>
>
>
> إشعار السرية وإخلاء المسؤولية
> هذه الرسالة ومرفقاتها معدة لاستخدام المُرسل إليه المقصود بالرسالة فقط وقد
> تحتوي على معلومات سرية أو محمية قانونياً، إن لم تكن الشخص المقصود فنرجو
> إخطار المُرسل فوراً عن طريق الرد على هذا البريد الإلكتروني وحذف الرسالة من
> البريد الإلكتروني، وعدم إبقاء نسخ منه،  لا يجوز استخدام أو عرض أو نشر
> المحتوى سواء بشكل مباشر أو غير مباشر دون موافقة خطية مسبقة، لا تتحمل شركة
> لين مسؤولية الأضرار الناتجة عن أي فيروسات قد تحملها هذه الرسالة.
>
>
>
> **Confidentiality & Disclaimer Notice**
> This e-mail message, including any attachments, is for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information or otherwise protected by law. If you are not the intended
> recipient, please immediately notify the sender, delete the e-mail, and do
> not retain any copies of it. It is prohibited to use, disseminate or
> distribute the content of this e-mail, directly or indirectly, without
> prior written consent. Lean accepts no liability for damage caused by any
> virus that may be transmitted by this Email.
>
>
>
>
>


[issue] not able to add external libs to pyspark job while using spark-submit

2021-11-24 Thread Atheer Alabdullatif
Dear Spark team,
hope my email finds you well



I am using pyspark 3.0 and facing an issue with adding external library 
[configparser] while running the job using [spark-submit] & [yarn]

issue:


import configparser
ImportError: No module named configparser
21/11/24 08:54:38 INFO util.ShutdownHookManager: Shutdown hook called

solutions I tried:

1- installing library src files and adding it to the session using [addPyFile]:

  *   files structure:

-- main dir
   -- subdir
  -- libs
 -- configparser-5.1.0
-- src
   -- configparser.py
 -- configparser.zip
  -- sparkjob.py

1.a zip file:

spark = SparkSession.builder.appName(jobname + '_' + table).config(
"spark.mongodb.input.uri", uri +
"." +
table +
"").config(
"spark.mongodb.input.sampleSize",
990).getOrCreate()

spark.sparkContext.addPyFile('/maindir/subdir/libs/configparser.zip')
df = spark.read.format("mongo").load()

1.b python file

spark = SparkSession.builder.appName(jobname + '_' + table).config(
"spark.mongodb.input.uri", uri +
"." +
table +
"").config(
"spark.mongodb.input.sampleSize",
990).getOrCreate()

spark.sparkContext.addPyFile('maindir/subdir/libs/configparser-5.1.0/src/configparser.py')
df = spark.read.format("mongo").load()


2- using os library

def install_libs():
'''
this function used to install external python libs in yarn
'''
os.system("pip3 install configparser")

if __name__ == "__main__":

# install libs
install_libs()


we value your support

best,

Atheer Alabdullatif



*? ?? ?? ?*
??? ??? ?   ???  ???  ??? ??? ? 
??? ???  ?? ? ? ?? ?? ??? ? ??? ? ? ??? 
? ??   ??? ??? ?? ??  ??? ??  ?? 
???  ? ???   ??  ??? ?? ??? ?? ??? ???  
 ? ?? ??? ? ??? ??  ?? ?? ?  ??? ??? 
??? ??? ?? ?? ??? ?? ?? ??? ???.

*Confidentiality & Disclaimer Notice*
This e-mail message, including any attachments, is for the sole use of the 
intended recipient(s) and may contain confidential and privileged information 
or otherwise protected by law. If you are not the intended recipient, please 
immediately notify the sender, delete the e-mail, and do not retain any copies 
of it. It is prohibited to use, disseminate or distribute the content of this 
e-mail, directly or indirectly, without prior written consent. Lean accepts no 
liability for damage caused by any virus that may be transmitted by this Email.




Accessing a kerberized HDFS using Spark on Openshift

2021-10-13 Thread Gal Shinder
Hi,

I have a pod on openshift 4.6 running a jupyter notebook with spark 3.1.1 and 
python 3.7 (based on open data hub, tweaked the dockerfile because I wanted 
this specific python version).

I'm trying to run spark in client mode using the image of google's spark 
operator (gcr.io/spark-operator/spark-py:v3.1.1), spark runs fine but I'm 
unable to connect to a kerberized cloudera hdfs, I've tried the examples 
outlined in the security documentation 
(https://github.com/apache/spark/blob/master/docs/security.md#secure-interaction-with-kubernetes)
 and numerous other combinations but nothing seems to work.

I managed to authenticate with kerberos by passing additional java parameters 
to the driver and executors (-Djava.security.krb5.conf), and passing the 
kerberos config to the executors using the configmap auto generated from the 
folder which SPARK_CONF points to on the driver, I'll try to pass the hadoop 
configuration files like that as well and set the hadoop home just to test the 
connection.
 
I don't want to use that solution in prod, 
`spark.kubernetes.kerberos.krb5.configMapName` and 
`spark.kubernetes.hadoop.configMapName` don't seem to do anything, the pod spec 
of the executors doesn't have those volumes, I'm using 
`spark.kubernetes.authenticate.oauthToken` to authenticate with k8s and I'm 
using a user who is a clusteradmin.

I also don't want to get a delegation token, figured I can just use the keytab 
even though the examples in the security documentation don't mention using a 
keytab with the configmaps.

The configuration I'm trying to use:
spark.kubernetes.authenticate.oauthToken with the oauth token of a cluster 
admin.

spark.kubernetes.hadoop.configMapName pointing to a configmap containing the 
core-site.xml and hdfs-site.xml I got from the cloudera manager

spark.kubernetes.kerberos.krb5.configMapName pointing to a configmap containing 
a krb5.conf

spark.kerberos.keytab 

spark.kerberos.principal


Thanks, 
Gal

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to process S3 data in Scalable Manner Using Spark API (wholeTextFile VERY SLOW and NOT scalable)

2021-10-02 Thread Alchemist
Issue:  We are using wholeTextFile() API to read files from S3.  But this API 
is extremely SLOW due to reasons mentioned below.  Question is how to fix this 
issue?
Here is our analysis so FAR: 
Issue is we are using Spark WholeTextFile API to read s3 files. WholeTextFile 
API works in two step. First step driver/master tries to list all the S3 files 
second step is driver/master tries to split the list of files and distribute 
those files to number of worker nodes and executor to process).

STEP 1. List all the s3 files in the given paths (we pass this path when we run 
the every single gw/device/app step). Issue is every single batch of every 
single report is first listing number of files. Main problem that we have is we 
are using S3 where listing files in a bucket is single threaded. This is 
because the S3 API for listing the keys in a bucket only returns keys by chunks 
of 1000 per call. Single threaded S3 API just tries to list files 1000 at a 
time, so for a million files we are looking at 1000 S3 single threaded API 
call. 

STEP 2. Control the number of splits depends on number of input partitions and 
distribute the load to worker nodes to process.



Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Mich Talebzadeh
BTW what assumption is there that the thread owner is writing to the
cluster? The thrift server is running locally on localhost:1. I concur
that JDBC to remote Hive is needed. However, this is not the impression I
get here.

df.write
  .format("jdbc")
  .option("url", "jdbc:hive2://localhost:1/foundation;AuthMech=2;
UseNativeQuery=0")

There is some confusion somewhere!





   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Jul 2021 at 17:34, Daniel de Oliveira Mantovani <
daniel.oliveira.mantov...@gmail.com> wrote:

> From the Cloudera Documentation:
>
> https://docs.cloudera.com/documentation/other/connectors/hive-jdbc/latest/Cloudera-JDBC-Driver-for-Apache-Hive-Install-Guide.pdf
>
> UseNativeQuery
>  1: The driver does not transform the queries emitted by applications, so
> the native query is used.
>  0: The driver transforms the queries emitted by applications and converts
> them into an equivalent form in HiveQL.
>
>
> Try to change the "NativeQuery" parameter and see if it works :)
>
> On Tue, Jul 20, 2021 at 1:26 PM Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> Insert mode is "overwrite", it shouldn't doesn't matter if the table
>> already exists or not. The JDBC driver should be based on the Cloudera Hive
>> version, we can't know the CDH version he's using.
>>
>> On Tue, Jul 20, 2021 at 1:21 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> The driver is fine and latest and  it should work.
>>>
>>> I have asked the thread owner to send the DDL of the table and how the
>>> table is created. In this case JDBC from Spark expects the table to be
>>> there.
>>>
>>> The error below
>>>
>>> java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR
>>> processing query/statement. Error Code: 4, SQL state:
>>> TStatus(statusCode:ERROR_STATUS,
>>> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
>>> compiling statement: FAILED: ParseException line 1:39 cannot recognize
>>> input near '"first_name"' 'TEXT' ',' in column name or primary key or
>>> foreign key:28:27,
>>> org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329
>>>
>>> Sounds like a mismatch between the columns through Spark Dataframe and
>>> the underlying table.
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 20 Jul 2021 at 17:05, Daniel de Oliveira Mantovani <
>>> daniel.oliveira.mantov...@gmail.com> wrote:
>>>
 Badrinath is trying to write to a Hive in a cluster where he doesn't
 have permission to submit spark jobs, he doesn't have Hive/Spark metadata
 access.
 The only way to communicate with this third-party Hive cluster is
 through JDBC protocol.

 [ Cloudera Data Hub - Hive Server] <-> [Spark Standalone]

 Who's creating this table is "Spark" because he's using "overwrite" in
 order to test it.

  df.write
   .format("jdbc")
   .option("url",
 "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
   .option("dbtable", "test.test")
   .option("user", "admin")
   .option("password", "admin")
   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
 *  .mode("overwrite")*
   .save

 This error is weird, looks like the third-party Hive server isn't able
 to recognize the SQL dialect coming from  [Spark Standalone] server
 JDBC driver.

 1) I would try to execute the create statement manually in this server
 2) if works try to run again with "append" option

 I would open a case with Cloudera and ask which driver you should use.

 Thanks



 On Mon, Jul 19, 2021 at 10:33 AM Artemis User 
 wrote:

> As Mich mentioned, no need to use jdbc API, using the
> DataFrameWriter's saveAsTable method is the way to go.   JDBC Driver is 
> for
> a JDBC client (a Java client for instance) to access the Hive tables in
> Spark via the Thrift server interface.
>
> -- ND
>
> On 7/19/21 2:42 AM, Badrinath Patchikolla wrote:
>
> I have trying to create table in hive from spark itself,
>
> And usi

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Daniel de Oliveira Mantovani
>From the Cloudera Documentation:
https://docs.cloudera.com/documentation/other/connectors/hive-jdbc/latest/Cloudera-JDBC-Driver-for-Apache-Hive-Install-Guide.pdf

UseNativeQuery
 1: The driver does not transform the queries emitted by applications, so
the native query is used.
 0: The driver transforms the queries emitted by applications and converts
them into an equivalent form in HiveQL.


Try to change the "NativeQuery" parameter and see if it works :)

On Tue, Jul 20, 2021 at 1:26 PM Daniel de Oliveira Mantovani <
daniel.oliveira.mantov...@gmail.com> wrote:

> Insert mode is "overwrite", it shouldn't doesn't matter if the table
> already exists or not. The JDBC driver should be based on the Cloudera Hive
> version, we can't know the CDH version he's using.
>
> On Tue, Jul 20, 2021 at 1:21 PM Mich Talebzadeh 
> wrote:
>
>> The driver is fine and latest and  it should work.
>>
>> I have asked the thread owner to send the DDL of the table and how the
>> table is created. In this case JDBC from Spark expects the table to be
>> there.
>>
>> The error below
>>
>> java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR
>> processing query/statement. Error Code: 4, SQL state:
>> TStatus(statusCode:ERROR_STATUS,
>> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
>> compiling statement: FAILED: ParseException line 1:39 cannot recognize
>> input near '"first_name"' 'TEXT' ',' in column name or primary key or
>> foreign key:28:27,
>> org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329
>>
>> Sounds like a mismatch between the columns through Spark Dataframe and
>> the underlying table.
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 20 Jul 2021 at 17:05, Daniel de Oliveira Mantovani <
>> daniel.oliveira.mantov...@gmail.com> wrote:
>>
>>> Badrinath is trying to write to a Hive in a cluster where he doesn't
>>> have permission to submit spark jobs, he doesn't have Hive/Spark metadata
>>> access.
>>> The only way to communicate with this third-party Hive cluster is
>>> through JDBC protocol.
>>>
>>> [ Cloudera Data Hub - Hive Server] <-> [Spark Standalone]
>>>
>>> Who's creating this table is "Spark" because he's using "overwrite" in
>>> order to test it.
>>>
>>>  df.write
>>>   .format("jdbc")
>>>   .option("url",
>>> "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
>>>   .option("dbtable", "test.test")
>>>   .option("user", "admin")
>>>   .option("password", "admin")
>>>   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
>>> *  .mode("overwrite")*
>>>   .save
>>>
>>> This error is weird, looks like the third-party Hive server isn't able
>>> to recognize the SQL dialect coming from  [Spark Standalone] server
>>> JDBC driver.
>>>
>>> 1) I would try to execute the create statement manually in this server
>>> 2) if works try to run again with "append" option
>>>
>>> I would open a case with Cloudera and ask which driver you should use.
>>>
>>> Thanks
>>>
>>>
>>>
>>> On Mon, Jul 19, 2021 at 10:33 AM Artemis User 
>>> wrote:
>>>
 As Mich mentioned, no need to use jdbc API, using the DataFrameWriter's
 saveAsTable method is the way to go.   JDBC Driver is for a JDBC client (a
 Java client for instance) to access the Hive tables in Spark via the Thrift
 server interface.

 -- ND

 On 7/19/21 2:42 AM, Badrinath Patchikolla wrote:

 I have trying to create table in hive from spark itself,

 And using local mode it will work what I am trying here is from spark
 standalone I want to create the manage table in hive (another spark cluster
 basically CDH) using jdbc mode.

 When I try that below are the error I am facing.

 On Thu, 15 Jul, 2021, 9:55 pm Mich Talebzadeh, <
 mich.talebza...@gmail.com> wrote:

> Have you created that table in Hive or are you trying to create it
> from Spark itself.
>
> You Hive is local. In this case you don't need a JDBC connection. Have
> you tried:
>
> df2.write.mode("overwrite").saveAsTable(mydb.mytable)
>
> HTH
>
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Daniel de Oliveira Mantovani
Insert mode is "overwrite", it shouldn't doesn't matter if the table
already exists or not. The JDBC driver should be based on the Cloudera Hive
version, we can't know the CDH version he's using.

On Tue, Jul 20, 2021 at 1:21 PM Mich Talebzadeh 
wrote:

> The driver is fine and latest and  it should work.
>
> I have asked the thread owner to send the DDL of the table and how the
> table is created. In this case JDBC from Spark expects the table to be
> there.
>
> The error below
>
> java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR
> processing query/statement. Error Code: 4, SQL state:
> TStatus(statusCode:ERROR_STATUS,
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
> compiling statement: FAILED: ParseException line 1:39 cannot recognize
> input near '"first_name"' 'TEXT' ',' in column name or primary key or
> foreign key:28:27,
> org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329
>
> Sounds like a mismatch between the columns through Spark Dataframe and the
> underlying table.
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Jul 2021 at 17:05, Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> Badrinath is trying to write to a Hive in a cluster where he doesn't have
>> permission to submit spark jobs, he doesn't have Hive/Spark metadata
>> access.
>> The only way to communicate with this third-party Hive cluster is through
>> JDBC protocol.
>>
>> [ Cloudera Data Hub - Hive Server] <-> [Spark Standalone]
>>
>> Who's creating this table is "Spark" because he's using "overwrite" in
>> order to test it.
>>
>>  df.write
>>   .format("jdbc")
>>   .option("url",
>> "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
>>   .option("dbtable", "test.test")
>>   .option("user", "admin")
>>   .option("password", "admin")
>>   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
>> *  .mode("overwrite")*
>>   .save
>>
>> This error is weird, looks like the third-party Hive server isn't able to
>> recognize the SQL dialect coming from  [Spark Standalone] server JDBC
>> driver.
>>
>> 1) I would try to execute the create statement manually in this server
>> 2) if works try to run again with "append" option
>>
>> I would open a case with Cloudera and ask which driver you should use.
>>
>> Thanks
>>
>>
>>
>> On Mon, Jul 19, 2021 at 10:33 AM Artemis User 
>> wrote:
>>
>>> As Mich mentioned, no need to use jdbc API, using the DataFrameWriter's
>>> saveAsTable method is the way to go.   JDBC Driver is for a JDBC client (a
>>> Java client for instance) to access the Hive tables in Spark via the Thrift
>>> server interface.
>>>
>>> -- ND
>>>
>>> On 7/19/21 2:42 AM, Badrinath Patchikolla wrote:
>>>
>>> I have trying to create table in hive from spark itself,
>>>
>>> And using local mode it will work what I am trying here is from spark
>>> standalone I want to create the manage table in hive (another spark cluster
>>> basically CDH) using jdbc mode.
>>>
>>> When I try that below are the error I am facing.
>>>
>>> On Thu, 15 Jul, 2021, 9:55 pm Mich Talebzadeh, <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Have you created that table in Hive or are you trying to create it from
 Spark itself.

 You Hive is local. In this case you don't need a JDBC connection. Have
 you tried:

 df2.write.mode("overwrite").saveAsTable(mydb.mytable)

 HTH




view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Thu, 15 Jul 2021 at 12:51, Badrinath Patchikolla <
 pbadrinath1...@gmail.com> wrote:

> Hi,
>
> Trying to write data in spark to the hive as JDBC mode below  is the
> sample code:
>
> spark standalone 2.4.7 version
>
> 21/07/15 08:04:07 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes 
> where
> applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
> setLogLevel(newLevel).
> Spark context Web UI available at http://localh

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Mich Talebzadeh
The driver is fine and latest and  it should work.

I have asked the thread owner to send the DDL of the table and how the
table is created. In this case JDBC from Spark expects the table to be
there.

The error below

java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR processing
query/statement. Error Code: 4, SQL state:
TStatus(statusCode:ERROR_STATUS,
infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
compiling statement: FAILED: ParseException line 1:39 cannot recognize
input near '"first_name"' 'TEXT' ',' in column name or primary key or
foreign key:28:27,
org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329

Sounds like a mismatch between the columns through Spark Dataframe and the
underlying table.

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Jul 2021 at 17:05, Daniel de Oliveira Mantovani <
daniel.oliveira.mantov...@gmail.com> wrote:

> Badrinath is trying to write to a Hive in a cluster where he doesn't have
> permission to submit spark jobs, he doesn't have Hive/Spark metadata
> access.
> The only way to communicate with this third-party Hive cluster is through
> JDBC protocol.
>
> [ Cloudera Data Hub - Hive Server] <-> [Spark Standalone]
>
> Who's creating this table is "Spark" because he's using "overwrite" in
> order to test it.
>
>  df.write
>   .format("jdbc")
>   .option("url",
> "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
>   .option("dbtable", "test.test")
>   .option("user", "admin")
>   .option("password", "admin")
>   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
> *  .mode("overwrite")*
>   .save
>
> This error is weird, looks like the third-party Hive server isn't able to
> recognize the SQL dialect coming from  [Spark Standalone] server JDBC
> driver.
>
> 1) I would try to execute the create statement manually in this server
> 2) if works try to run again with "append" option
>
> I would open a case with Cloudera and ask which driver you should use.
>
> Thanks
>
>
>
> On Mon, Jul 19, 2021 at 10:33 AM Artemis User 
> wrote:
>
>> As Mich mentioned, no need to use jdbc API, using the DataFrameWriter's
>> saveAsTable method is the way to go.   JDBC Driver is for a JDBC client (a
>> Java client for instance) to access the Hive tables in Spark via the Thrift
>> server interface.
>>
>> -- ND
>>
>> On 7/19/21 2:42 AM, Badrinath Patchikolla wrote:
>>
>> I have trying to create table in hive from spark itself,
>>
>> And using local mode it will work what I am trying here is from spark
>> standalone I want to create the manage table in hive (another spark cluster
>> basically CDH) using jdbc mode.
>>
>> When I try that below are the error I am facing.
>>
>> On Thu, 15 Jul, 2021, 9:55 pm Mich Talebzadeh, 
>> wrote:
>>
>>> Have you created that table in Hive or are you trying to create it from
>>> Spark itself.
>>>
>>> You Hive is local. In this case you don't need a JDBC connection. Have
>>> you tried:
>>>
>>> df2.write.mode("overwrite").saveAsTable(mydb.mytable)
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 15 Jul 2021 at 12:51, Badrinath Patchikolla <
>>> pbadrinath1...@gmail.com> wrote:
>>>
 Hi,

 Trying to write data in spark to the hive as JDBC mode below  is the
 sample code:

 spark standalone 2.4.7 version

 21/07/15 08:04:07 WARN util.NativeCodeLoader: Unable to load
 native-hadoop library for your platform... using builtin-java classes where
 applicable
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
 setLogLevel(newLevel).
 Spark context Web UI available at http://localhost:4040
 Spark context available as 'sc' (master = spark://localhost:7077, app
 id = app-20210715080414-0817).
 Spark session available as 'spark'.
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 2.4.7
   /_/

 Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
 Type in expressions to have th

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Daniel de Oliveira Mantovani
Badrinath is trying to write to a Hive in a cluster where he doesn't have
permission to submit spark jobs, he doesn't have Hive/Spark metadata
access.
The only way to communicate with this third-party Hive cluster is through
JDBC protocol.

[ Cloudera Data Hub - Hive Server] <-> [Spark Standalone]

Who's creating this table is "Spark" because he's using "overwrite" in
order to test it.

 df.write
  .format("jdbc")
  .option("url",
"jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
  .option("dbtable", "test.test")
  .option("user", "admin")
  .option("password", "admin")
  .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
*  .mode("overwrite")*
  .save

This error is weird, looks like the third-party Hive server isn't able to
recognize the SQL dialect coming from  [Spark Standalone] server JDBC
driver.

1) I would try to execute the create statement manually in this server
2) if works try to run again with "append" option

I would open a case with Cloudera and ask which driver you should use.

Thanks



On Mon, Jul 19, 2021 at 10:33 AM Artemis User 
wrote:

> As Mich mentioned, no need to use jdbc API, using the DataFrameWriter's
> saveAsTable method is the way to go.   JDBC Driver is for a JDBC client (a
> Java client for instance) to access the Hive tables in Spark via the Thrift
> server interface.
>
> -- ND
>
> On 7/19/21 2:42 AM, Badrinath Patchikolla wrote:
>
> I have trying to create table in hive from spark itself,
>
> And using local mode it will work what I am trying here is from spark
> standalone I want to create the manage table in hive (another spark cluster
> basically CDH) using jdbc mode.
>
> When I try that below are the error I am facing.
>
> On Thu, 15 Jul, 2021, 9:55 pm Mich Talebzadeh, 
> wrote:
>
>> Have you created that table in Hive or are you trying to create it from
>> Spark itself.
>>
>> You Hive is local. In this case you don't need a JDBC connection. Have
>> you tried:
>>
>> df2.write.mode("overwrite").saveAsTable(mydb.mytable)
>>
>> HTH
>>
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 15 Jul 2021 at 12:51, Badrinath Patchikolla <
>> pbadrinath1...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Trying to write data in spark to the hive as JDBC mode below  is the
>>> sample code:
>>>
>>> spark standalone 2.4.7 version
>>>
>>> 21/07/15 08:04:07 WARN util.NativeCodeLoader: Unable to load
>>> native-hadoop library for your platform... using builtin-java classes where
>>> applicable
>>> Setting default log level to "WARN".
>>> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
>>> setLogLevel(newLevel).
>>> Spark context Web UI available at http://localhost:4040
>>> Spark context available as 'sc' (master = spark://localhost:7077, app id
>>> = app-20210715080414-0817).
>>> Spark session available as 'spark'.
>>> Welcome to
>>>     __
>>>  / __/__  ___ _/ /__
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>/___/ .__/\_,_/_/ /_/\_\   version 2.4.7
>>>   /_/
>>>
>>> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
>>> Type in expressions to have them evaluated.
>>> Type :help for more information.
>>>
>>> scala> :paste
>>> // Entering paste mode (ctrl-D to finish)
>>>
>>> val df = Seq(
>>> ("John", "Smith", "London"),
>>> ("David", "Jones", "India"),
>>> ("Michael", "Johnson", "Indonesia"),
>>> ("Chris", "Lee", "Brazil"),
>>> ("Mike", "Brown", "Russia")
>>>   ).toDF("first_name", "last_name", "country")
>>>
>>>
>>>  df.write
>>>   .format("jdbc")
>>>   .option("url",
>>> "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
>>>   .option("dbtable", "test.test")
>>>   .option("user", "admin")
>>>   .option("password", "admin")
>>>   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
>>>   .mode("overwrite")
>>>   .save
>>>
>>>
>>> // Exiting paste mode, now interpreting.
>>>
>>> java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR
>>> processing query/statement. Error Code: 4, SQL state:
>>> TStatus(statusCode:ERROR_STATUS,
>>> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
>>> compiling statement: FAILED: ParseException line 1:39 cannot recognize
>>> input near '"first_name"' 'TEXT' ',' in column name or primary key or
>>> foreign key:28:27,
>>> org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329,
>>> org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:207,
>>> org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:290,

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-19 Thread Artemis User
As Mich mentioned, no need to use jdbc API, using the DataFrameWriter's 
saveAsTable method is the way to go.   JDBC Driver is for a JDBC client 
(a Java client for instance) to access the Hive tables in Spark via the 
Thrift server interface.


-- ND

On 7/19/21 2:42 AM, Badrinath Patchikolla wrote:

I have trying to create table in hive from spark itself,

And using local mode it will work what I am trying here is from spark 
standalone I want to create the manage table in hive (another spark 
cluster basically CDH) using jdbc mode.


When I try that below are the error I am facing.

On Thu, 15 Jul, 2021, 9:55 pm Mich Talebzadeh, 
mailto:mich.talebza...@gmail.com>> wrote:


Have you created that table in Hive or are you trying to create it
from Spark itself.

You Hive is local. In this case you don't need a JDBC connection.
Have you tried:

df2.write.mode("overwrite").saveAsTable(mydb.mytable)

HTH




**view my Linkedin profile


*Disclaimer:* Use it at your own risk.Any and all responsibility
for any loss, damage or destruction of data or any other property
which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for
any monetary damages arising from such loss, damage or destruction.



On Thu, 15 Jul 2021 at 12:51, Badrinath Patchikolla
mailto:pbadrinath1...@gmail.com>> wrote:

Hi,

Trying to write data in spark to the hive as JDBC mode below 
is the sample code:

spark standalone 2.4.7 version

21/07/15 08:04:07 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java
classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For
SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://localhost:4040

Spark context available as 'sc' (master =
spark://localhost:7077, app id = app-20210715080414-0817).
Spark session available as 'spark'.
Welcome to
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.7
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java
1.8.0_292)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :paste
// Entering paste mode (ctrl-D to finish)

val df = Seq(
    ("John", "Smith", "London"),
    ("David", "Jones", "India"),
    ("Michael", "Johnson", "Indonesia"),
    ("Chris", "Lee", "Brazil"),
    ("Mike", "Brown", "Russia")
  ).toDF("first_name", "last_name", "country")


 df.write
  .format("jdbc")
  .option("url",
"jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
  .option("dbtable", "test.test")
  .option("user", "admin")
  .option("password", "admin")
  .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
  .mode("overwrite")
  .save


// Exiting paste mode, now interpreting.

java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051)
ERROR processing query/statement. Error Code: 4, SQL
state: TStatus(statusCode:ERROR_STATUS,
infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error
while compiling statement: FAILED: ParseException line 1:39
cannot recognize input near '"first_name"' 'TEXT' ',' in
column name or primary key or foreign key:28:27,

org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329,

org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:207,

org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:290,
org.apache.hive.service.cli.operation.Operation:run:Operation.java:260,

org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:504,

org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementAsync:HiveSessionImpl.java:490,
sun.reflect.GeneratedMethodAccessor13:invoke::-1,

sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43,
java.lang.reflect.Method:invoke:Method.java:498,

org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78,

org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36,

org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63,
java.security.AccessController:doPrivileged:AccessController.java:-2,
 

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-19 Thread Badrinath Patchikolla
I have trying to create table in hive from spark itself,

And using local mode it will work what I am trying here is from spark
standalone I want to create the manage table in hive (another spark cluster
basically CDH) using jdbc mode.

When I try that below are the error I am facing.

On Thu, 15 Jul, 2021, 9:55 pm Mich Talebzadeh, 
wrote:

> Have you created that table in Hive or are you trying to create it from
> Spark itself.
>
> You Hive is local. In this case you don't need a JDBC connection. Have you
> tried:
>
> df2.write.mode("overwrite").saveAsTable(mydb.mytable)
>
> HTH
>
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 15 Jul 2021 at 12:51, Badrinath Patchikolla <
> pbadrinath1...@gmail.com> wrote:
>
>> Hi,
>>
>> Trying to write data in spark to the hive as JDBC mode below  is the
>> sample code:
>>
>> spark standalone 2.4.7 version
>>
>> 21/07/15 08:04:07 WARN util.NativeCodeLoader: Unable to load
>> native-hadoop library for your platform... using builtin-java classes where
>> applicable
>> Setting default log level to "WARN".
>> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
>> setLogLevel(newLevel).
>> Spark context Web UI available at http://localhost:4040
>> Spark context available as 'sc' (master = spark://localhost:7077, app id
>> = app-20210715080414-0817).
>> Spark session available as 'spark'.
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/___/ .__/\_,_/_/ /_/\_\   version 2.4.7
>>   /_/
>>
>> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>>
>> scala> :paste
>> // Entering paste mode (ctrl-D to finish)
>>
>> val df = Seq(
>> ("John", "Smith", "London"),
>> ("David", "Jones", "India"),
>> ("Michael", "Johnson", "Indonesia"),
>> ("Chris", "Lee", "Brazil"),
>> ("Mike", "Brown", "Russia")
>>   ).toDF("first_name", "last_name", "country")
>>
>>
>>  df.write
>>   .format("jdbc")
>>   .option("url",
>> "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
>>   .option("dbtable", "test.test")
>>   .option("user", "admin")
>>   .option("password", "admin")
>>   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
>>   .mode("overwrite")
>>   .save
>>
>>
>> // Exiting paste mode, now interpreting.
>>
>> java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR
>> processing query/statement. Error Code: 4, SQL state:
>> TStatus(statusCode:ERROR_STATUS,
>> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
>> compiling statement: FAILED: ParseException line 1:39 cannot recognize
>> input near '"first_name"' 'TEXT' ',' in column name or primary key or
>> foreign key:28:27,
>> org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329,
>> org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:207,
>> org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:290,
>> org.apache.hive.service.cli.operation.Operation:run:Operation.java:260,
>> org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:504,
>> org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementAsync:HiveSessionImpl.java:490,
>> sun.reflect.GeneratedMethodAccessor13:invoke::-1,
>> sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43,
>> java.lang.reflect.Method:invoke:Method.java:498,
>> org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78,
>> org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36,
>> org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63,
>> java.security.AccessController:doPrivileged:AccessController.java:-2,
>> javax.security.auth.Subject:doAs:Subject.java:422,
>> org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1875,
>> org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59,
>> com.sun.proxy.$Proxy35:executeStatementAsync::-1,
>> org.apache.hive.service.cli.CLIService:executeStatementAsync:CLIService.java:295,
>> org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:507,
>> org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1437,
>> org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1422,
>> org.apache.thr

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-19 Thread Mich Talebzadeh
Your Driver seems to be OK.

 hive_driver: com.cloudera.hive.jdbc41.HS2Driver

However this is theSQL error you are getting

Caused by: com.cloudera.hiveserver2.support.exceptions.GeneralException:
[Cloudera][HiveJDBCDriver](500051) ERROR processing query/statement. Error
Code: 4, SQL state: TStatus(statusCode:ERROR_STATUS,
infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
compiling statement: FAILED: ParseException line 1:39 cannot recognize
input near '"first_name"' 'TEXT' ',' in column name or primary key or
foreign key:28


Are you using a reserved word for table columns? What is your DDL for this
table?


HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 19 Jul 2021 at 07:42, Badrinath Patchikolla <
pbadrinath1...@gmail.com> wrote:

> I have trying to create table in hive from spark itself,
>
> And using local mode it will work what I am trying here is from spark
> standalone I want to create the manage table in hive (another spark cluster
> basically CDH) using jdbc mode.
>
> When I try that below are the error I am facing.
>
> On Thu, 15 Jul, 2021, 9:55 pm Mich Talebzadeh, 
> wrote:
>
>> Have you created that table in Hive or are you trying to create it from
>> Spark itself.
>>
>> You Hive is local. In this case you don't need a JDBC connection. Have
>> you tried:
>>
>> df2.write.mode("overwrite").saveAsTable(mydb.mytable)
>>
>> HTH
>>
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 15 Jul 2021 at 12:51, Badrinath Patchikolla <
>> pbadrinath1...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Trying to write data in spark to the hive as JDBC mode below  is the
>>> sample code:
>>>
>>> spark standalone 2.4.7 version
>>>
>>> 21/07/15 08:04:07 WARN util.NativeCodeLoader: Unable to load
>>> native-hadoop library for your platform... using builtin-java classes where
>>> applicable
>>> Setting default log level to "WARN".
>>> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
>>> setLogLevel(newLevel).
>>> Spark context Web UI available at http://localhost:4040
>>> Spark context available as 'sc' (master = spark://localhost:7077, app id
>>> = app-20210715080414-0817).
>>> Spark session available as 'spark'.
>>> Welcome to
>>>     __
>>>  / __/__  ___ _/ /__
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>/___/ .__/\_,_/_/ /_/\_\   version 2.4.7
>>>   /_/
>>>
>>> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
>>> Type in expressions to have them evaluated.
>>> Type :help for more information.
>>>
>>> scala> :paste
>>> // Entering paste mode (ctrl-D to finish)
>>>
>>> val df = Seq(
>>> ("John", "Smith", "London"),
>>> ("David", "Jones", "India"),
>>> ("Michael", "Johnson", "Indonesia"),
>>> ("Chris", "Lee", "Brazil"),
>>> ("Mike", "Brown", "Russia")
>>>   ).toDF("first_name", "last_name", "country")
>>>
>>>
>>>  df.write
>>>   .format("jdbc")
>>>   .option("url",
>>> "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
>>>   .option("dbtable", "test.test")
>>>   .option("user", "admin")
>>>   .option("password", "admin")
>>>   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
>>>   .mode("overwrite")
>>>   .save
>>>
>>>
>>> // Exiting paste mode, now interpreting.
>>>
>>> java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR
>>> processing query/statement. Error Code: 4, SQL state:
>>> TStatus(statusCode:ERROR_STATUS,
>>> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
>>> compiling statement: FAILED: ParseException line 1:39 cannot recognize
>>> input near '"first_name"' 'TEXT' ',' in column name or primary key or
>>> foreign key:28:27,
>>> org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329,
>>> org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:207,
>>> org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:290,
>>> org.apache.hive.service.cli.operation.Operation:run:Operation.java:260,
>>> org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:504,
>>> org.apache.hive.service.cli.session.HiveSessionImpl:executeState

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-15 Thread Mich Talebzadeh
Have you created that table in Hive or are you trying to create it from
Spark itself.

You Hive is local. In this case you don't need a JDBC connection. Have you
tried:

df2.write.mode("overwrite").saveAsTable(mydb.mytable)

HTH




   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 15 Jul 2021 at 12:51, Badrinath Patchikolla <
pbadrinath1...@gmail.com> wrote:

> Hi,
>
> Trying to write data in spark to the hive as JDBC mode below  is the
> sample code:
>
> spark standalone 2.4.7 version
>
> 21/07/15 08:04:07 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
> setLogLevel(newLevel).
> Spark context Web UI available at http://localhost:4040
> Spark context available as 'sc' (master = spark://localhost:7077, app id =
> app-20210715080414-0817).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.7
>   /_/
>
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
> Type in expressions to have them evaluated.
> Type :help for more information.
>
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
>
> val df = Seq(
> ("John", "Smith", "London"),
> ("David", "Jones", "India"),
> ("Michael", "Johnson", "Indonesia"),
> ("Chris", "Lee", "Brazil"),
> ("Mike", "Brown", "Russia")
>   ).toDF("first_name", "last_name", "country")
>
>
>  df.write
>   .format("jdbc")
>   .option("url",
> "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
>   .option("dbtable", "test.test")
>   .option("user", "admin")
>   .option("password", "admin")
>   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
>   .mode("overwrite")
>   .save
>
>
> // Exiting paste mode, now interpreting.
>
> java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR processing
> query/statement. Error Code: 4, SQL state:
> TStatus(statusCode:ERROR_STATUS,
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
> compiling statement: FAILED: ParseException line 1:39 cannot recognize
> input near '"first_name"' 'TEXT' ',' in column name or primary key or
> foreign key:28:27,
> org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329,
> org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:207,
> org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:290,
> org.apache.hive.service.cli.operation.Operation:run:Operation.java:260,
> org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:504,
> org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementAsync:HiveSessionImpl.java:490,
> sun.reflect.GeneratedMethodAccessor13:invoke::-1,
> sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43,
> java.lang.reflect.Method:invoke:Method.java:498,
> org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78,
> org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36,
> org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63,
> java.security.AccessController:doPrivileged:AccessController.java:-2,
> javax.security.auth.Subject:doAs:Subject.java:422,
> org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1875,
> org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59,
> com.sun.proxy.$Proxy35:executeStatementAsync::-1,
> org.apache.hive.service.cli.CLIService:executeStatementAsync:CLIService.java:295,
> org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:507,
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1437,
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1422,
> org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39,
> org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39,
> org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56,
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286,
> java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149,
> java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624,
> ja

Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-15 Thread Badrinath Patchikolla
Hi,

Trying to write data in spark to the hive as JDBC mode below  is the sample
code:

spark standalone 2.4.7 version

21/07/15 08:04:07 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = spark://localhost:7077, app id =
app-20210715080414-0817).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.7
  /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :paste
// Entering paste mode (ctrl-D to finish)

val df = Seq(
("John", "Smith", "London"),
("David", "Jones", "India"),
("Michael", "Johnson", "Indonesia"),
("Chris", "Lee", "Brazil"),
("Mike", "Brown", "Russia")
  ).toDF("first_name", "last_name", "country")


 df.write
  .format("jdbc")
  .option("url",
"jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
  .option("dbtable", "test.test")
  .option("user", "admin")
  .option("password", "admin")
  .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
  .mode("overwrite")
  .save


// Exiting paste mode, now interpreting.

java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR processing
query/statement. Error Code: 4, SQL state:
TStatus(statusCode:ERROR_STATUS,
infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
compiling statement: FAILED: ParseException line 1:39 cannot recognize
input near '"first_name"' 'TEXT' ',' in column name or primary key or
foreign key:28:27,
org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329,
org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:207,
org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:290,
org.apache.hive.service.cli.operation.Operation:run:Operation.java:260,
org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:504,
org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementAsync:HiveSessionImpl.java:490,
sun.reflect.GeneratedMethodAccessor13:invoke::-1,
sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43,
java.lang.reflect.Method:invoke:Method.java:498,
org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78,
org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36,
org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63,
java.security.AccessController:doPrivileged:AccessController.java:-2,
javax.security.auth.Subject:doAs:Subject.java:422,
org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1875,
org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59,
com.sun.proxy.$Proxy35:executeStatementAsync::-1,
org.apache.hive.service.cli.CLIService:executeStatementAsync:CLIService.java:295,
org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:507,
org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1437,
org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1422,
org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39,
org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39,
org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56,
org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286,
java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149,
java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624,
java.lang.Thread:run:Thread.java:748,
*org.apache.hadoop.hive.ql.parse.ParseException:line 1:39 cannot recognize
input near '"first_name"' 'TEXT' ',' in column name or primary key or
foreign key:33:6,
org.apache.hadoop.hive.ql.parse.ParseDriver:parse:ParseDriver.java:221,
org.apache.hadoop.hive.ql.parse.ParseUtils:parse:ParseUtils.java:75,
org.apache.hadoop.hive.ql.parse.ParseUtils:parse:ParseUtils.java:68,
org.apache.hadoop.hive.ql.Driver:compile:Driver.java:564,
org.apache.hadoop.hive.ql.Driver:compileInternal:Driver.java:1425,
org.apache.hadoop.hive.ql.Driver:compileAndRespond:Driver.java:1398,
org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:205],
sqlState:42000, errorCode:4, errorMessage:Error while compiling
statement: FAILED: ParseException line 1:39 cannot recognize input near
'"first_name"' 'TEXT' ',' in column name or primary key or foreign key),
Query: CREA

Re: Insert into table with one the value is derived from DB function using spark

2021-06-20 Thread Mich Talebzadeh
any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 22:14, Mich Talebzadeh 
>> wrote:
>>
>>> Well the challenge is that Spark is best suited to insert a dataframe
>>> into the Oracle table, i.e. a bulk insert
>>>
>>> that  insert into table (column list) values (..) is a single record
>>> insert .. Can you try creating a staging table in oracle without
>>> get_function() column and do a bulk insert from Spark dataframe to that
>>> staging table?
>>>
>>> HTH
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 18 Jun 2021 at 21:53, Anshul Kala  wrote:
>>>
>>>>
>>>> Hi Mich,
>>>>
>>>> Thanks for your reply. Please advise the insert query that I need to
>>>> substitute should be like below:
>>>>
>>>> Insert into table(a,b,c) values(?,get_function_value(?),?)
>>>>
>>>> In the statement above :
>>>>
>>>>  ?  : refers to value from dataframe column values
>>>> get_function_value : refers to be the function where one of the data
>>>> frame column is passed as input
>>>>
>>>>
>>>> Thanks
>>>> Anshul
>>>>
>>>>
>>>> Thanks
>>>> Anshul
>>>>
>>>> On Fri, Jun 18, 2021 at 4:29 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> I gather you mean using JDBC to write to the Oracle table?
>>>>>
>>>>> Spark provides a unified framework to write to any JDBC
>>>>> compliant database.
>>>>>
>>>>> def writeTableWithJDBC(dataFrame, url, tableName, user, password,
>>>>> driver, mode):
>>>>> try:
>>>>> dataFrame. \
>>>>> write. \
>>>>> format("jdbc"). \
>>>>> option("url", url). \
>>>>> option("dbtable", tableName). \
>>>>> option("user", user). \
>>>>> option("password", password). \
>>>>> option("driver", driver). \
>>>>> mode(mode). \
>>>>> save()
>>>>> except Exception as e:
>>>>> print(f"""{e}, quitting""")
>>>>> sys.exit(1)
>>>>>
>>>>> and how to write it
>>>>>
>>>>>  def loadIntoOracleTable(self, df2):
>>>>> # write to Oracle table, all uppercase not mixed case and
>>>>> column names <= 30 characters in version 12.1
>>>>> tableName =
>>>>> self.config['OracleVariables']['yearlyAveragePricesAllTable']
>>>>> fullyQualifiedTableName =
>>>>> self.config['OracleVariables']['dbschema']+'.'+tableName
>>>>> user = self.config['OracleVariables']['oracle_user']
>>>>> password = self.config['OracleVariables']['oracle_password']
>>>>> driver = self.config['OracleVariables']['oracle_driver']
>>>>> mode = self.config['OracleVariables']['mode']
>>>>>
>>>>> s.writeTableWithJDBC(df2,oracle_url,fullyQualifiedTableName,user,password,driver,mode)
>>>>> print(f"""created
>>>>> {config['OracleVariables']['yearlyAveragePricesAllTable']}""")
>>>>> # read data to ensure all loaded OK
>>>>> fetchsize = self.config['OracleVariables']['fetchsize']
>>>>> read_df =
>>>>> s.loadTableFromJDBC(self.spark,oracle_url,fullyQualifiedTableName,user,password,driver,fetchsize)
>>>>> # check that all rows are there
>>>>> if df2.subtract(read_df).count() == 0:
>>>>> print("Data has been loaded OK to Oracle table")
>>>>> else:
>>>>> print("Data could not be loaded to Oracle table, quitting")
>>>>> sys.exit(1)
>>>>>
>>>>> in the statement where it says
>>>>>
>>>>>  option("dbtable", tableName). \
>>>>>
>>>>> You can replace *tableName* with the equivalent SQL insert statement
>>>>>
>>>>> You will need JDBC driver for Oracle say ojdbc6.jar in
>>>>> $SPARK_HOME/conf/spark-defaults.conf
>>>>>
>>>>> spark.driver.extraClassPath
>>>>>  /home/hduser/jars/jconn4.jar:/home/hduser/jars/ojdbc6.jar
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 18 Jun 2021 at 20:49, Anshul Kala 
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I am using spark to ingest data from file to database Oracle table .
>>>>>> For one of the fields , the value to be populated is generated from a
>>>>>> function that is written in database .
>>>>>>
>>>>>> The input to the function is one of the fields of data frame
>>>>>>
>>>>>> I wanted to use spark.dbc.write to perform the operation, which
>>>>>> generates the insert query at back end .
>>>>>>
>>>>>> For example : It can generate the insert query as :
>>>>>>
>>>>>> Insert into table values (?,?, getfunctionvalue(?) )
>>>>>>
>>>>>> Please advise if it is possible in spark and if yes , how can it be
>>>>>> done
>>>>>>
>>>>>> This is little urgent for me . So any help is appreciated
>>>>>>
>>>>>> Thanks
>>>>>> Anshul
>>>>>>
>>>>>


Re: Insert into table with one the value is derived from DB function using spark

2021-06-19 Thread Sebastian Piu
tion_value(?),?)
>>>
>>> In the statement above :
>>>
>>>  ?  : refers to value from dataframe column values
>>> get_function_value : refers to be the function where one of the data
>>> frame column is passed as input
>>>
>>>
>>> Thanks
>>> Anshul
>>>
>>>
>>> Thanks
>>> Anshul
>>>
>>> On Fri, Jun 18, 2021 at 4:29 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> I gather you mean using JDBC to write to the Oracle table?
>>>>
>>>> Spark provides a unified framework to write to any JDBC
>>>> compliant database.
>>>>
>>>> def writeTableWithJDBC(dataFrame, url, tableName, user, password,
>>>> driver, mode):
>>>> try:
>>>> dataFrame. \
>>>> write. \
>>>> format("jdbc"). \
>>>> option("url", url). \
>>>> option("dbtable", tableName). \
>>>> option("user", user). \
>>>> option("password", password). \
>>>> option("driver", driver). \
>>>> mode(mode). \
>>>> save()
>>>> except Exception as e:
>>>> print(f"""{e}, quitting""")
>>>> sys.exit(1)
>>>>
>>>> and how to write it
>>>>
>>>>  def loadIntoOracleTable(self, df2):
>>>> # write to Oracle table, all uppercase not mixed case and
>>>> column names <= 30 characters in version 12.1
>>>> tableName =
>>>> self.config['OracleVariables']['yearlyAveragePricesAllTable']
>>>> fullyQualifiedTableName =
>>>> self.config['OracleVariables']['dbschema']+'.'+tableName
>>>> user = self.config['OracleVariables']['oracle_user']
>>>> password = self.config['OracleVariables']['oracle_password']
>>>> driver = self.config['OracleVariables']['oracle_driver']
>>>> mode = self.config['OracleVariables']['mode']
>>>>
>>>> s.writeTableWithJDBC(df2,oracle_url,fullyQualifiedTableName,user,password,driver,mode)
>>>> print(f"""created
>>>> {config['OracleVariables']['yearlyAveragePricesAllTable']}""")
>>>> # read data to ensure all loaded OK
>>>> fetchsize = self.config['OracleVariables']['fetchsize']
>>>> read_df =
>>>> s.loadTableFromJDBC(self.spark,oracle_url,fullyQualifiedTableName,user,password,driver,fetchsize)
>>>> # check that all rows are there
>>>> if df2.subtract(read_df).count() == 0:
>>>> print("Data has been loaded OK to Oracle table")
>>>> else:
>>>> print("Data could not be loaded to Oracle table, quitting")
>>>> sys.exit(1)
>>>>
>>>> in the statement where it says
>>>>
>>>>  option("dbtable", tableName). \
>>>>
>>>> You can replace *tableName* with the equivalent SQL insert statement
>>>>
>>>> You will need JDBC driver for Oracle say ojdbc6.jar in
>>>> $SPARK_HOME/conf/spark-defaults.conf
>>>>
>>>> spark.driver.extraClassPath
>>>>  /home/hduser/jars/jconn4.jar:/home/hduser/jars/ojdbc6.jar
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, 18 Jun 2021 at 20:49, Anshul Kala 
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I am using spark to ingest data from file to database Oracle table .
>>>>> For one of the fields , the value to be populated is generated from a
>>>>> function that is written in database .
>>>>>
>>>>> The input to the function is one of the fields of data frame
>>>>>
>>>>> I wanted to use spark.dbc.write to perform the operation, which
>>>>> generates the insert query at back end .
>>>>>
>>>>> For example : It can generate the insert query as :
>>>>>
>>>>> Insert into table values (?,?, getfunctionvalue(?) )
>>>>>
>>>>> Please advise if it is possible in spark and if yes , how can it be
>>>>> done
>>>>>
>>>>> This is little urgent for me . So any help is appreciated
>>>>>
>>>>> Thanks
>>>>> Anshul
>>>>>
>>>>


Re: Insert into table with one the value is derived from DB function using spark

2021-06-19 Thread Mich Talebzadeh
   write. \
>>> format("jdbc"). \
>>> option("url", url). \
>>> option("dbtable", tableName). \
>>> option("user", user). \
>>> option("password", password). \
>>> option("driver", driver). \
>>> mode(mode). \
>>> save()
>>> except Exception as e:
>>> print(f"""{e}, quitting""")
>>> sys.exit(1)
>>>
>>> and how to write it
>>>
>>>  def loadIntoOracleTable(self, df2):
>>> # write to Oracle table, all uppercase not mixed case and column
>>> names <= 30 characters in version 12.1
>>> tableName =
>>> self.config['OracleVariables']['yearlyAveragePricesAllTable']
>>> fullyQualifiedTableName =
>>> self.config['OracleVariables']['dbschema']+'.'+tableName
>>> user = self.config['OracleVariables']['oracle_user']
>>> password = self.config['OracleVariables']['oracle_password']
>>> driver = self.config['OracleVariables']['oracle_driver']
>>> mode = self.config['OracleVariables']['mode']
>>>
>>> s.writeTableWithJDBC(df2,oracle_url,fullyQualifiedTableName,user,password,driver,mode)
>>> print(f"""created
>>> {config['OracleVariables']['yearlyAveragePricesAllTable']}""")
>>> # read data to ensure all loaded OK
>>> fetchsize = self.config['OracleVariables']['fetchsize']
>>> read_df =
>>> s.loadTableFromJDBC(self.spark,oracle_url,fullyQualifiedTableName,user,password,driver,fetchsize)
>>> # check that all rows are there
>>> if df2.subtract(read_df).count() == 0:
>>> print("Data has been loaded OK to Oracle table")
>>>     else:
>>> print("Data could not be loaded to Oracle table, quitting")
>>> sys.exit(1)
>>>
>>> in the statement where it says
>>>
>>>  option("dbtable", tableName). \
>>>
>>> You can replace *tableName* with the equivalent SQL insert statement
>>>
>>> You will need JDBC driver for Oracle say ojdbc6.jar in
>>> $SPARK_HOME/conf/spark-defaults.conf
>>>
>>> spark.driver.extraClassPath
>>>  /home/hduser/jars/jconn4.jar:/home/hduser/jars/ojdbc6.jar
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 18 Jun 2021 at 20:49, Anshul Kala  wrote:
>>>
>>>> Hi All,
>>>>
>>>> I am using spark to ingest data from file to database Oracle table .
>>>> For one of the fields , the value to be populated is generated from a
>>>> function that is written in database .
>>>>
>>>> The input to the function is one of the fields of data frame
>>>>
>>>> I wanted to use spark.dbc.write to perform the operation, which
>>>> generates the insert query at back end .
>>>>
>>>> For example : It can generate the insert query as :
>>>>
>>>> Insert into table values (?,?, getfunctionvalue(?) )
>>>>
>>>> Please advise if it is possible in spark and if yes , how can it be
>>>> done
>>>>
>>>> This is little urgent for me . So any help is appreciated
>>>>
>>>> Thanks
>>>> Anshul
>>>>
>>>


Re: Insert into table with one the value is derived from DB function using spark

2021-06-19 Thread ayan guha
Hi

Why this can be done by oracle insert trigger? Or even a view?

On Sat, 19 Jun 2021 at 7:17 am, Mich Talebzadeh 
wrote:

> Well the challenge is that Spark is best suited to insert a dataframe into
> the Oracle table, i.e. a bulk insert
>
> that  insert into table (column list) values (..) is a single record
> insert .. Can you try creating a staging table in oracle without
> get_function() column and do a bulk insert from Spark dataframe to that
> staging table?
>
> HTH
>
> Mich
>
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Jun 2021 at 21:53, Anshul Kala  wrote:
>
>>
>> Hi Mich,
>>
>> Thanks for your reply. Please advise the insert query that I need to
>> substitute should be like below:
>>
>> Insert into table(a,b,c) values(?,get_function_value(?),?)
>>
>> In the statement above :
>>
>>  ?  : refers to value from dataframe column values
>> get_function_value : refers to be the function where one of the data
>> frame column is passed as input
>>
>>
>> Thanks
>> Anshul
>>
>>
>> Thanks
>> Anshul
>>
>> On Fri, Jun 18, 2021 at 4:29 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> I gather you mean using JDBC to write to the Oracle table?
>>>
>>> Spark provides a unified framework to write to any JDBC
>>> compliant database.
>>>
>>> def writeTableWithJDBC(dataFrame, url, tableName, user, password,
>>> driver, mode):
>>> try:
>>> dataFrame. \
>>> write. \
>>> format("jdbc"). \
>>> option("url", url). \
>>> option("dbtable", tableName). \
>>> option("user", user). \
>>> option("password", password). \
>>> option("driver", driver). \
>>> mode(mode). \
>>> save()
>>> except Exception as e:
>>> print(f"""{e}, quitting""")
>>> sys.exit(1)
>>>
>>> and how to write it
>>>
>>>  def loadIntoOracleTable(self, df2):
>>> # write to Oracle table, all uppercase not mixed case and column
>>> names <= 30 characters in version 12.1
>>> tableName =
>>> self.config['OracleVariables']['yearlyAveragePricesAllTable']
>>> fullyQualifiedTableName =
>>> self.config['OracleVariables']['dbschema']+'.'+tableName
>>> user = self.config['OracleVariables']['oracle_user']
>>> password = self.config['OracleVariables']['oracle_password']
>>> driver = self.config['OracleVariables']['oracle_driver']
>>> mode = self.config['OracleVariables']['mode']
>>>
>>> s.writeTableWithJDBC(df2,oracle_url,fullyQualifiedTableName,user,password,driver,mode)
>>> print(f"""created
>>> {config['OracleVariables']['yearlyAveragePricesAllTable']}""")
>>> # read data to ensure all loaded OK
>>> fetchsize = self.config['OracleVariables']['fetchsize']
>>> read_df =
>>> s.loadTableFromJDBC(self.spark,oracle_url,fullyQualifiedTableName,user,password,driver,fetchsize)
>>> # check that all rows are there
>>> if df2.subtract(read_df).count() == 0:
>>> print("Data has been loaded OK to Oracle table")
>>> else:
>>> print("Data could not be loaded to Oracle table, quitting")
>>> sys.exit(1)
>>>
>>> in the statement where it says
>>>
>>>  option("dbtable", tableName). \
>>>
>>> You can replace *tableName* with the equivalent SQL insert statement
>>>
>>> You will need JDBC driver for Oracle say ojdbc6.jar in
>>> $SPARK_HOME/conf/spark-defaults.conf
>>&g

Re: Insert into table with one the value is derived from DB function using spark

2021-06-18 Thread Mich Talebzadeh
Well the challenge is that Spark is best suited to insert a dataframe into
the Oracle table, i.e. a bulk insert

that  insert into table (column list) values (..) is a single record insert
.. Can you try creating a staging table in oracle without get_function()
column and do a bulk insert from Spark dataframe to that staging table?

HTH

Mich




   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 18 Jun 2021 at 21:53, Anshul Kala  wrote:

>
> Hi Mich,
>
> Thanks for your reply. Please advise the insert query that I need to
> substitute should be like below:
>
> Insert into table(a,b,c) values(?,get_function_value(?),?)
>
> In the statement above :
>
>  ?  : refers to value from dataframe column values
> get_function_value : refers to be the function where one of the data frame
> column is passed as input
>
>
> Thanks
> Anshul
>
>
> Thanks
> Anshul
>
> On Fri, Jun 18, 2021 at 4:29 PM Mich Talebzadeh 
> wrote:
>
>> I gather you mean using JDBC to write to the Oracle table?
>>
>> Spark provides a unified framework to write to any JDBC
>> compliant database.
>>
>> def writeTableWithJDBC(dataFrame, url, tableName, user, password, driver,
>> mode):
>> try:
>> dataFrame. \
>> write. \
>> format("jdbc"). \
>> option("url", url). \
>> option("dbtable", tableName). \
>> option("user", user). \
>> option("password", password). \
>> option("driver", driver). \
>> mode(mode). \
>> save()
>> except Exception as e:
>> print(f"""{e}, quitting""")
>> sys.exit(1)
>>
>> and how to write it
>>
>>  def loadIntoOracleTable(self, df2):
>> # write to Oracle table, all uppercase not mixed case and column
>> names <= 30 characters in version 12.1
>> tableName =
>> self.config['OracleVariables']['yearlyAveragePricesAllTable']
>> fullyQualifiedTableName =
>> self.config['OracleVariables']['dbschema']+'.'+tableName
>> user = self.config['OracleVariables']['oracle_user']
>> password = self.config['OracleVariables']['oracle_password']
>> driver = self.config['OracleVariables']['oracle_driver']
>> mode = self.config['OracleVariables']['mode']
>>
>> s.writeTableWithJDBC(df2,oracle_url,fullyQualifiedTableName,user,password,driver,mode)
>> print(f"""created
>> {config['OracleVariables']['yearlyAveragePricesAllTable']}""")
>> # read data to ensure all loaded OK
>> fetchsize = self.config['OracleVariables']['fetchsize']
>> read_df =
>> s.loadTableFromJDBC(self.spark,oracle_url,fullyQualifiedTableName,user,password,driver,fetchsize)
>> # check that all rows are there
>> if df2.subtract(read_df).count() == 0:
>> print("Data has been loaded OK to Oracle table")
>> else:
>> print("Data could not be loaded to Oracle table, quitting")
>> sys.exit(1)
>>
>> in the statement where it says
>>
>>  option("dbtable", tableName). \
>>
>> You can replace *tableName* with the equivalent SQL insert statement
>>
>> You will need JDBC driver for Oracle say ojdbc6.jar in
>> $SPARK_HOME/conf/spark-defaults.conf
>>
>> spark.driver.extraClassPath
>>  /home/hduser/jars/jconn4.jar:/home/hduser/jars/ojdbc6.jar
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 20:49, Anshul Kala  wrote:
>>
>>> Hi All,
>>>
>>> I am using spark to ingest data from file to database Oracle table . For
>>> one of the fields , the value to be populated is generated from a function
>>> that is written in database .
>>>
>>> The input to the function is one of the fields of data frame
>>>
>>> I wanted to use spark.dbc.write to perform the operation, which
>>> generates the insert query at back end .
>>>
>>> For example : It can generate the insert query as :
>>>
>>> Insert into table values (?,?, getfunctionvalue(?) )
>>>
>>> Please advise if it is possible in spark and if yes , how can it be done
>>>
>>> This is little urgent for me . So any help is appreciated
>>>
>>> Thanks
>>> Anshul
>>>
>>


Re: Insert into table with one the value is derived from DB function using spark

2021-06-18 Thread Anshul Kala
Hi Mich,

Thanks for your reply. Please advise the insert query that I need to
substitute should be like below:

Insert into table(a,b,c) values(?,get_function_value(?),?)

In the statement above :

 ?  : refers to value from dataframe column values
get_function_value : refers to be the function where one of the data frame
column is passed as input


Thanks
Anshul


Thanks
Anshul

On Fri, Jun 18, 2021 at 4:29 PM Mich Talebzadeh 
wrote:

> I gather you mean using JDBC to write to the Oracle table?
>
> Spark provides a unified framework to write to any JDBC compliant database.
>
> def writeTableWithJDBC(dataFrame, url, tableName, user, password, driver,
> mode):
> try:
> dataFrame. \
> write. \
> format("jdbc"). \
> option("url", url). \
> option("dbtable", tableName). \
> option("user", user). \
> option("password", password). \
> option("driver", driver). \
> mode(mode). \
> save()
> except Exception as e:
> print(f"""{e}, quitting""")
> sys.exit(1)
>
> and how to write it
>
>  def loadIntoOracleTable(self, df2):
> # write to Oracle table, all uppercase not mixed case and column
> names <= 30 characters in version 12.1
> tableName =
> self.config['OracleVariables']['yearlyAveragePricesAllTable']
> fullyQualifiedTableName =
> self.config['OracleVariables']['dbschema']+'.'+tableName
> user = self.config['OracleVariables']['oracle_user']
> password = self.config['OracleVariables']['oracle_password']
> driver = self.config['OracleVariables']['oracle_driver']
> mode = self.config['OracleVariables']['mode']
>
> s.writeTableWithJDBC(df2,oracle_url,fullyQualifiedTableName,user,password,driver,mode)
> print(f"""created
> {config['OracleVariables']['yearlyAveragePricesAllTable']}""")
> # read data to ensure all loaded OK
> fetchsize = self.config['OracleVariables']['fetchsize']
> read_df =
> s.loadTableFromJDBC(self.spark,oracle_url,fullyQualifiedTableName,user,password,driver,fetchsize)
> # check that all rows are there
> if df2.subtract(read_df).count() == 0:
> print("Data has been loaded OK to Oracle table")
> else:
> print("Data could not be loaded to Oracle table, quitting")
> sys.exit(1)
>
> in the statement where it says
>
>  option("dbtable", tableName). \
>
> You can replace *tableName* with the equivalent SQL insert statement
>
> You will need JDBC driver for Oracle say ojdbc6.jar in
> $SPARK_HOME/conf/spark-defaults.conf
>
> spark.driver.extraClassPath
>  /home/hduser/jars/jconn4.jar:/home/hduser/jars/ojdbc6.jar
>
> HTH
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Jun 2021 at 20:49, Anshul Kala  wrote:
>
>> Hi All,
>>
>> I am using spark to ingest data from file to database Oracle table . For
>> one of the fields , the value to be populated is generated from a function
>> that is written in database .
>>
>> The input to the function is one of the fields of data frame
>>
>> I wanted to use spark.dbc.write to perform the operation, which generates
>> the insert query at back end .
>>
>> For example : It can generate the insert query as :
>>
>> Insert into table values (?,?, getfunctionvalue(?) )
>>
>> Please advise if it is possible in spark and if yes , how can it be done
>>
>> This is little urgent for me . So any help is appreciated
>>
>> Thanks
>> Anshul
>>
>


Re: Insert into table with one the value is derived from DB function using spark

2021-06-18 Thread Mich Talebzadeh
I gather you mean using JDBC to write to the Oracle table?

Spark provides a unified framework to write to any JDBC compliant database.

def writeTableWithJDBC(dataFrame, url, tableName, user, password, driver,
mode):
try:
dataFrame. \
write. \
format("jdbc"). \
option("url", url). \
option("dbtable", tableName). \
option("user", user). \
option("password", password). \
option("driver", driver). \
mode(mode). \
save()
except Exception as e:
print(f"""{e}, quitting""")
sys.exit(1)

and how to write it

 def loadIntoOracleTable(self, df2):
# write to Oracle table, all uppercase not mixed case and column
names <= 30 characters in version 12.1
tableName =
self.config['OracleVariables']['yearlyAveragePricesAllTable']
fullyQualifiedTableName =
self.config['OracleVariables']['dbschema']+'.'+tableName
user = self.config['OracleVariables']['oracle_user']
password = self.config['OracleVariables']['oracle_password']
driver = self.config['OracleVariables']['oracle_driver']
mode = self.config['OracleVariables']['mode']

s.writeTableWithJDBC(df2,oracle_url,fullyQualifiedTableName,user,password,driver,mode)
print(f"""created
{config['OracleVariables']['yearlyAveragePricesAllTable']}""")
# read data to ensure all loaded OK
fetchsize = self.config['OracleVariables']['fetchsize']
read_df =
s.loadTableFromJDBC(self.spark,oracle_url,fullyQualifiedTableName,user,password,driver,fetchsize)
# check that all rows are there
if df2.subtract(read_df).count() == 0:
print("Data has been loaded OK to Oracle table")
else:
print("Data could not be loaded to Oracle table, quitting")
sys.exit(1)

in the statement where it says

 option("dbtable", tableName). \

You can replace *tableName* with the equivalent SQL insert statement

You will need JDBC driver for Oracle say ojdbc6.jar in
$SPARK_HOME/conf/spark-defaults.conf

spark.driver.extraClassPath
 /home/hduser/jars/jconn4.jar:/home/hduser/jars/ojdbc6.jar

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 18 Jun 2021 at 20:49, Anshul Kala  wrote:

> Hi All,
>
> I am using spark to ingest data from file to database Oracle table . For
> one of the fields , the value to be populated is generated from a function
> that is written in database .
>
> The input to the function is one of the fields of data frame
>
> I wanted to use spark.dbc.write to perform the operation, which generates
> the insert query at back end .
>
> For example : It can generate the insert query as :
>
> Insert into table values (?,?, getfunctionvalue(?) )
>
> Please advise if it is possible in spark and if yes , how can it be done
>
> This is little urgent for me . So any help is appreciated
>
> Thanks
> Anshul
>


Insert into table with one the value is derived from DB function using spark

2021-06-18 Thread Anshul Kala
Hi All,

I am using spark to ingest data from file to database Oracle table . For
one of the fields , the value to be populated is generated from a function
that is written in database .

The input to the function is one of the fields of data frame

I wanted to use spark.dbc.write to perform the operation, which generates
the insert query at back end .

For example : It can generate the insert query as :

Insert into table values (?,?, getfunctionvalue(?) )

Please advise if it is possible in spark and if yes , how can it be done

This is little urgent for me . So any help is appreciated

Thanks
Anshul


Re: Moving millions of file using spark

2021-06-16 Thread Molotch
Definitely not a spark task.

Moving files within the same filesystem is merely a linking exercise, you
don't have to actually move any data. Write a shell script creating hard
links in the new location, once you're satisfied, remove the old links,
profit.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Moving millions of file using spark

2021-06-16 Thread rajat kumar
Hello ,

I know this might not be a valid use case for spark. But I have millions of
files in a single folder. file names are having a pattern. based on pattern
I want to move it to different directory.

Can you pls suggest what can be done?

Thanks
rajat


Data Lakes using Spark

2021-04-06 Thread Boris Litvak
Hi Friends,

I’d like to publish a document to Medium about data lakes using Spark.
Its latter parts include info that is not widely known, unless you have 
experience with data lakes.

https://github.com/borislitvak/datalake-article/blob/initial_comments/Building%20a%20Real%20Life%20Data%20Lake%20in%C2%A0AWS.md
I hope it’s OK if I ask you to review its draft.

You can respond here or contact me directly.
If there are some topics I should add (like, compaction effect on downstream 
reads using structured streaming), or there are errors, please point them out 
before it gets out.
Also, if some points are unclear or misleading, please state so.

Thanks,

Boris Litvak


Re: [Spark SQL]: Can complex oracle views be created using Spark SQL

2021-03-23 Thread Mich Talebzadeh
IRST,
channel_id#41 ASC NULLS FIRST, promo_id#40 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(prod_id#38, time_id#39, channel_id#41,
promo_id#40, 200), ENSURE_REQUIREMENTS, [id=#37]
   : +- *(1) Scan JDBCRelation((SELECT * FROM sh.costs))
[numPartitions=1]
[PROD_ID#38,TIME_ID#39,PROMO_ID#40,CHANNEL_ID#41,UNIT_COST#42,UNIT_PRICE#43]
PushedFilters: [*IsNotNull(PROD_ID), *IsNotNull(TIME_ID),
*IsNotNull(CHANNEL_ID), *IsNotNull(PROMO_ID)], ReadSchema:
structhttps://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 22 Mar 2021 at 05:38, Gaurav Singh  wrote:

> Hi Team,
>
> We have lots of complex oracle views ( containing multiple tables, joins,
> analytical and  aggregate functions, sub queries etc) and we are wondering
> if Spark can help us execute those views faster.
>
> Also we want to know if those complex views can be implemented using Spark
> SQL?
>
> Thanks and regards,
> Gaurav Singh
> +91 8600852256
>
>


Re: [Spark SQL]: Can complex oracle views be created using Spark SQL

2021-03-22 Thread Mich Talebzadeh
Hi Gaurav,

What version of Spark will you be using?

Have you tried a simple example of reading one of the views through JDBC
connection to Oracle yourself

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 22 Mar 2021 at 05:38, Gaurav Singh  wrote:

> Hi Team,
>
> We have lots of complex oracle views ( containing multiple tables, joins,
> analytical and  aggregate functions, sub queries etc) and we are wondering
> if Spark can help us execute those views faster.
>
> Also we want to know if those complex views can be implemented using Spark
> SQL?
>
> Thanks and regards,
> Gaurav Singh
> +91 8600852256
>
>


[Spark SQL]: Can complex oracle views be created using Spark SQL

2021-03-21 Thread Gaurav Singh
Hi Team,

We have lots of complex oracle views ( containing multiple tables, joins,
analytical and  aggregate functions, sub queries etc) and we are wondering
if Spark can help us execute those views faster.

Also we want to know if those complex views can be implemented using Spark
SQL?

Thanks and regards,
Gaurav Singh
+91 8600852256


Re: Using Spark as a fail-over platform for Java app

2021-03-12 Thread Jungtaek Lim
That's what resource managers provide to you. So you can code and deal with
resource managers, but I assume you're finding ways to not deal with
resource managers directly and let Spark do it instead.

I admit I have no experience (I did the similar with Apache Storm on
standalone setup 5+ years ago), but the question can be simply changed as
"making driver fault-tolerant" as your app logic can run under driver even
if you don't do any calculation with Spark. And there seems to be lots of
answers in google for the new question, including the old one;
https://stackoverflow.com/questions/26618464/what-happens-if-the-driver-program-crashes


On Sat, Mar 13, 2021 at 5:21 AM Lalwani, Jayesh 
wrote:

> Can I cut a steak with a hammer? Sure you can, but the steak would taste
> awful
>
>
>
> Do you have organizational/bureaucratic issues with using a Load Balancer?
> Because that’s what you really need. Run your application on multiple nodes
> with a load balancer in front. When a node crashes, the load balancer will
> shift the traffic to the healthy node until the crashed node recovers.
>
>
>
> *From: *Sergey Oboguev 
> *Date: *Friday, March 12, 2021 at 2:53 PM
> *To: *User 
> *Subject: *[EXTERNAL] Using Spark as a fail-over platform for Java app
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> I have an existing plain-Java (non-Spark) application that needs to run in
> a fault-tolerant way, i.e. if the node crashes then the application is
> restarted on another node, and if the application crashes because of
> internal fault, the application is restarted too.
>
> Normally I would run it in a Kubernetes, but in this specific case
> Kubernetes is unavailable because of organizational/bureaucratic issues,
> and the only execution platform available in the domain is Spark.
>
> Is it possible to wrap the application into a Spark-based launcher that
> will take care of executing the application and restarts?
>
> Execution must be in a separate JVM, apart from other apps.
>
> And for optimum performance, the application also needs to be assigned
> guaranteed resources, i.e. the number of cores and amount of RAM required
> for it, so it would be great if the launcher could take care of this too.
>
> Thanks for advice.
>


Re: Using Spark as a fail-over platform for Java app

2021-03-12 Thread Lalwani, Jayesh
Can I cut a steak with a hammer? Sure you can, but the steak would taste awful

Do you have organizational/bureaucratic issues with using a Load Balancer? 
Because that’s what you really need. Run your application on multiple nodes 
with a load balancer in front. When a node crashes, the load balancer will 
shift the traffic to the healthy node until the crashed node recovers.

From: Sergey Oboguev 
Date: Friday, March 12, 2021 at 2:53 PM
To: User 
Subject: [EXTERNAL] Using Spark as a fail-over platform for Java app


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


I have an existing plain-Java (non-Spark) application that needs to run in a 
fault-tolerant way, i.e. if the node crashes then the application is restarted 
on another node, and if the application crashes because of internal fault, the 
application is restarted too.

Normally I would run it in a Kubernetes, but in this specific case Kubernetes 
is unavailable because of organizational/bureaucratic issues, and the only 
execution platform available in the domain is Spark.

Is it possible to wrap the application into a Spark-based launcher that will 
take care of executing the application and restarts?

Execution must be in a separate JVM, apart from other apps.

And for optimum performance, the application also needs to be assigned 
guaranteed resources, i.e. the number of cores and amount of RAM required for 
it, so it would be great if the launcher could take care of this too.

Thanks for advice.


Using Spark as a fail-over platform for Java app

2021-03-12 Thread Sergey Oboguev
I have an existing plain-Java (non-Spark) application that needs to run in
a fault-tolerant way, i.e. if the node crashes then the application is
restarted on another node, and if the application crashes because of
internal fault, the application is restarted too.

Normally I would run it in a Kubernetes, but in this specific case
Kubernetes is unavailable because of organizational/bureaucratic issues,
and the only execution platform available in the domain is Spark.

Is it possible to wrap the application into a Spark-based launcher that
will take care of executing the application and restarts?

Execution must be in a separate JVM, apart from other apps.

And for optimum performance, the application also needs to be assigned
guaranteed resources, i.e. the number of cores and amount of RAM required
for it, so it would be great if the launcher could take care of this too.

Thanks for advice.


Re: Hive using Spark engine vs native spark with hive integration.

2020-10-07 Thread Patrick McCarthy
I think a lot will depend on what the scripts do. I've seen some legacy
hive scripts which were written in an awkward way (e.g. lots of subqueries,
nested explodes) because pre-spark it was the only way to express certain
logic. For fairly straightforward operations I expect Catalyst would reduce
both code to similar plans.

On Tue, Oct 6, 2020 at 12:07 PM Manu Jacob 
wrote:

> Hi All,
>
>
>
> Not sure if I need to ask this question on spark community or hive
> community.
>
>
>
> We have a set of hive scripts that runs on EMR (Tez engine). We would like
> to experiment by moving some of it onto Spark. We are planning to
> experiment with two options.
>
>
>
>1. Use the current code based on HQL, with engine set as spark.
>2. Write pure spark code in scala/python using SparkQL and hive
>integration.
>
>
>
> The first approach helps us to transition to Spark quickly but not sure if
> this is the best approach in terms of performance.  Could not find any
> reasonable comparisons of this two approaches.  It looks like writing pure
> Spark code, gives us more control to add logic and also control some of the
> performance features, for example things like caching/evicting etc.
>
>
>
>
>
> Any advice on this is much appreciated.
>
>
>
>
>
> Thanks,
>
> -Manu
>
>
>


-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016


Re: Hive using Spark engine vs native spark with hive integration.

2020-10-06 Thread Ricardo Martinelli de Oliveira
My 2 cents is that this is a complicated question since I'm not confident
that Spark is 100% compatible with Hive in terms of query language. I have
an unanswered question in this list about this:

http://apache-spark-user-list.1001560.n3.nabble.com/Should-SHOW-TABLES-statement-return-a-hive-compatible-output-td38577.html

One thing that is important to check is if you are using the supported
objects in both Hive and Spark. One example is the lack of support for
materialized views in Spark:
https://issues.apache.org/jira/browse/SPARK-29038

With that being said, I'd recommend going to 2. as this will force your
code to use that Spark offers.

Hope that helps.

On Tue, Oct 6, 2020 at 1:14 PM Manu Jacob 
wrote:

> Hi All,
>
>
>
> Not sure if I need to ask this question on spark community or hive
> community.
>
>
>
> We have a set of hive scripts that runs on EMR (Tez engine). We would like
> to experiment by moving some of it onto Spark. We are planning to
> experiment with two options.
>
>
>
>1. Use the current code based on HQL, with engine set as spark.
>2. Write pure spark code in scala/python using SparkQL and hive
>integration.
>
>
>
> The first approach helps us to transition to Spark quickly but not sure if
> this is the best approach in terms of performance.  Could not find any
> reasonable comparisons of this two approaches.  It looks like writing pure
> Spark code, gives us more control to add logic and also control some of the
> performance features, for example things like caching/evicting etc.
>
>
>
>
>
> Any advice on this is much appreciated.
>
>
>
>
>
> Thanks,
>
> -Manu
>
>
>


-- 

Ricardo Martinelli De Oliveira

Data Engineer, AI CoE

Red Hat Brazil 

Av. Brigadeiro Faria Lima, 3900

8th floor

rmart...@redhat.comT: +551135426125
M: +5511970696531
@redhatjobs    redhatjobs
 @redhatjobs




Hive using Spark engine vs native spark with hive integration.

2020-10-06 Thread Manu Jacob
Hi All,

Not sure if I need to ask this question on spark community or hive community.

We have a set of hive scripts that runs on EMR (Tez engine). We would like to 
experiment by moving some of it onto Spark. We are planning to experiment with 
two options.


  1.  Use the current code based on HQL, with engine set as spark.
  2.  Write pure spark code in scala/python using SparkQL and hive integration.

The first approach helps us to transition to Spark quickly but not sure if this 
is the best approach in terms of performance.  Could not find any reasonable 
comparisons of this two approaches.  It looks like writing pure Spark code, 
gives us more control to add logic and also control some of the performance 
features, for example things like caching/evicting etc.


Any advice on this is much appreciated.


Thanks,
-Manu



Unable to run bash script when using spark-submit in cluster mode.

2020-07-23 Thread Nasrulla Khan Haris
Hi Spark Users,


I am trying to execute bash script from my spark app. I can run the below 
command without issues from spark-shell however when I use it in the spark-app 
and submit with spark-submit, container is not able to find the directories.

val result = "export LD_LIBRARY_PATH=/ binaries/ && /binaries/generatedata 
simulate -rows 1000 -payload 32 -file MyFile1" !!


Any inputs on how to make the script visible in spark executor ?


Thanks,
Nasrulla



RE: Unable to run bash script when using spark-submit in cluster mode.

2020-07-23 Thread Nasrulla Khan Haris
Are local paths not exposed in containers ?

Thanks,
Nasrulla

From: Nasrulla Khan Haris
Sent: Thursday, July 23, 2020 6:13 PM
To: user@spark.apache.org
Subject: Unable to run bash script when using spark-submit in cluster mode.
Importance: High

Hi Spark Users,


I am trying to execute bash script from my spark app. I can run the below 
command without issues from spark-shell however when I use it in the spark-app 
and submit with spark-submit, container is not able to find the directories.

val result = "export LD_LIBRARY_PATH=/ binaries/ && /binaries/generatedata 
simulate -rows 1000 -payload 32 -file MyFile1" !!


Any inputs on how to make the script visible in spark executor ?


Thanks,
Nasrulla



Re: OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Piyush Acharya
Please try with maxBytesPerTrigger option, probably files are big enough to
crash the JVM.
Please give some info on Executors and file info ( size etc)

Regards,
..Piyush

On Sun, Jul 19, 2020 at 3:29 PM Rachana Srivastava
 wrote:

> *Issue:* I am trying to process 5000+ files of gzipped json file
> periodically from S3 using Structured Streaming code.
>
> *Here are the key steps:*
>
>1.
>
>Read json schema and broadccast to executors
>2.
>
>Read Stream
>
>Dataset inputDS = sparkSession.readStream() .format("text")
>.option("inferSchema", "true") .option("header", "true")
>.option("multiLine", true).schema(jsonSchema) .option("mode", "PERMISSIVE")
>.json(inputPath + "/*");
>3.
>
>Process each file in a map Dataset ds = inputDS.map(x -> { ... },
>Encoders.STRING());
>4.
>
>Write output to S3
>
>StreamingQuery query = ds .coalesce(1) .writeStream()
>.outputMode("append") .format("csv") ... .start();
>
> *maxFilesPerTrigger* is set to 500 so I was hoping the streaming will
> pick only that many file to process. Why are we getting OOM? If in a we
> have more than 3500 files then system crashes with OOM.
>
>


Re: OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Sanjeev Mishra
Can you reduce maxFilesPerTrigger further and see if the OOM still persists, if 
it does then the problem may be somewhere else.

> On Jul 19, 2020, at 5:37 AM, Jungtaek Lim  
> wrote:
> 
> Please provide logs and dump file for the OOM case - otherwise no one could 
> say what's the cause.
> 
> Add JVM options to driver/executor => -XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath="...dir..."
> 
> On Sun, Jul 19, 2020 at 6:56 PM Rachana Srivastava 
>  wrote:
> Issue: I am trying to process 5000+ files of gzipped json file periodically 
> from S3 using Structured Streaming code. 
> 
> Here are the key steps:
> Read json schema and broadccast to executors
> Read Stream
> 
> Dataset inputDS = sparkSession.readStream() .format("text") 
> .option("inferSchema", "true") .option("header", "true") .option("multiLine", 
> true).schema(jsonSchema) .option("mode", "PERMISSIVE") .json(inputPath + 
> "/*");
> Process each file in a map Dataset ds = inputDS.map(x -> { ... }, 
> Encoders.STRING());
> Write output to S3
> 
> StreamingQuery query = ds .coalesce(1) .writeStream() .outputMode("append") 
> .format("csv") ... .start();
> maxFilesPerTrigger is set to 500 so I was hoping the streaming will pick only 
> that many file to process. Why are we getting OOM? If in a we have more than 
> 3500 files then system crashes with OOM.
> 
> 



Re: OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Jungtaek Lim
Please provide logs and dump file for the OOM case - otherwise no one could
say what's the cause.

Add JVM options to driver/executor => -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath="...dir..."

On Sun, Jul 19, 2020 at 6:56 PM Rachana Srivastava
 wrote:

> *Issue:* I am trying to process 5000+ files of gzipped json file
> periodically from S3 using Structured Streaming code.
>
> *Here are the key steps:*
>
>1.
>
>Read json schema and broadccast to executors
>2.
>
>Read Stream
>
>Dataset inputDS = sparkSession.readStream() .format("text")
>.option("inferSchema", "true") .option("header", "true")
>.option("multiLine", true).schema(jsonSchema) .option("mode", "PERMISSIVE")
>.json(inputPath + "/*");
>3.
>
>Process each file in a map Dataset ds = inputDS.map(x -> { ... },
>Encoders.STRING());
>4.
>
>Write output to S3
>
>StreamingQuery query = ds .coalesce(1) .writeStream()
>.outputMode("append") .format("csv") ... .start();
>
> *maxFilesPerTrigger* is set to 500 so I was hoping the streaming will
> pick only that many file to process. Why are we getting OOM? If in a we
> have more than 3500 files then system crashes with OOM.
>
>


OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Rachana Srivastava
Issue: I am trying to process 5000+ files of gzipped json file periodically 
from S3 using Structured Streaming code. 
Here are the key steps:   
   -
Read json schema and broadccast to executors

   -
Read Stream
   
Dataset inputDS = sparkSession.readStream() .format("text") 
.option("inferSchema", "true") .option("header", "true") .option("multiLine", 
true).schema(jsonSchema) .option("mode", "PERMISSIVE") .json(inputPath + "/*");

   -
Process each file in a map Dataset ds = inputDS.map(x -> { ... }, 
Encoders.STRING());

   -
Write output to S3
   
StreamingQuery query = ds .coalesce(1) .writeStream() .outputMode("append") 
.format("csv") ... .start();


maxFilesPerTrigger is set to 500 so I was hoping the streaming will pick only 
that many file to process. Why are we getting OOM? If in a we have more than 
3500 files then system crashes with OOM.



Re: Issue in parallelization of CNN model using spark

2020-07-17 Thread Mukhtaj Khan
Dear All
Thanks all of you for your reply.
I am trying to parallelize the CNN model using Keras2DML library, however,
I am getting the error message: NO Module Named Systemml.mllearn. Can any
body guide me how to install systemml using ubuntu

best regards


On Tue, Jul 14, 2020 at 4:34 AM Anwar AliKhan 
wrote:

> This is very useful for me leading on from week4 of the Andrew Ng course.
>
>
> On Mon, 13 Jul 2020, 15:18 Sean Owen,  wrote:
>
>> There is a multilayer perceptron implementation in Spark ML, but
>> that's not what you're looking for.
>> To parallelize model training developed using standard libraries like
>> Keras, use Horovod from Uber.
>> https://horovod.readthedocs.io/en/stable/spark_include.html
>>
>> On Mon, Jul 13, 2020 at 6:59 AM Mukhtaj Khan  wrote:
>> >
>> > Dear Spark User
>> >
>> > I am trying to parallelize the CNN (convolutional neural network) model
>> using spark. I have developed the model using python and Keras library. The
>> model works fine on a single machine but when we try on multiple machines,
>> the execution time remains the same as sequential.
>> > Could you please tell me that there is any built-in library for CNN to
>> parallelize in spark framework. Moreover, MLLIB does not have any support
>> for CNN.
>> > Best regards
>> > Mukhtaj
>> >
>> >
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: “Pyspark.zip does not exist” using Spark in cluster mode with Yarn

2020-07-16 Thread Hulio andres
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-10795


https://stackoverflow.com/questions/34632617/spark-python-submission-error-file-does-not-exist-pyspark-zip



https://stackoverflow.com/questions/34632617/spark-python-submission-error-file-does-not-exist-pyspark-zip>
 Sent: Thursday, July 16, 2020 at 6:54 PM
> From: "Davide Curcio" 
> To: "user@spark.apache.org" 
> Subject: “Pyspark.zip does not exist” using Spark in cluster mode with Yarn
>
> I'm trying to run some Spark script in cluster mode using Yarn but I've 
> always obtained this error. I read in other similar question that the cause 
> can be:
> 
> "Local" set up hard-coded as a master but I don't have it
> HADOOP_CONF_DIR environment variable that's wrong inside spark-env.sh but it 
> seems right
> I've tried with every code, even simple code but it still doesn't work, even 
> though in local mode they work.
> 
> Here is my log when I try to execute the code:
> 
> spark/bin/spark-submit --deploy-mode cluster --master yarn ~/prova7.py
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.util.Shell).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 20/07/16 16:10:27 INFO Client: Requesting a new application from cluster with 
> 2 NodeManagers
> 20/07/16 16:10:27 INFO Client: Verifying our application has not requested 
> more than the maximum memory capability of the cluster (1536 MB per container)
> 20/07/16 16:10:27 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 20/07/16 16:10:27 INFO Client: Setting up container launch context for our AM
> 20/07/16 16:10:27 INFO Client: Setting up the launch environment for our AM 
> container
> 20/07/16 16:10:27 INFO Client: Preparing resources for our AM container
> 20/07/16 16:10:27 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 20/07/16 16:10:31 INFO Client: Uploading resource 
> file:/tmp/spark-750fb229-4166--9c69-eb90e9a2318d/__spark_libs__4588035472069967339.zip
>  -> 
> file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/__spark_libs__4588035472069967339.zip
> 20/07/16 16:10:31 INFO Client: Uploading resource file:/home/ubuntu/prova7.py 
> -> file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/prova7.py
> 20/07/16 16:10:31 INFO Client: Uploading resource 
> file:/home/ubuntu/spark/python/lib/pyspark.zip -> 
> file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/pyspark.zip
> 20/07/16 16:10:31 INFO Client: Uploading resource 
> file:/home/ubuntu/spark/python/lib/py4j-0.10.7-src.zip -> 
> file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/py4j-0.10.7-src.zip
> 20/07/16 16:10:32 INFO Client: Uploading resource 
> file:/tmp/spark-750fb229-4166--9c69-eb90e9a2318d/__spark_conf__1291791519024875749.zip
>  -> 
> file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/__spark_conf__.zip
> 20/07/16 16:10:32 INFO SecurityManager: Changing view acls to: ubuntu
> 20/07/16 16:10:32 INFO SecurityManager: Changing modify acls to: ubuntu
> 20/07/16 16:10:32 INFO SecurityManager: Changing view acls groups to:
> 20/07/16 16:10:32 INFO SecurityManager: Changing modify acls groups to:
> 20/07/16 16:10:32 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(ubuntu); groups 
> with view permissions: Set(); users  with modify permissions: Set(ubuntu); 
> groups with modify permissions: Set()
> 20/07/16 16:10:33 INFO Client: Submitting application 
> application_1594914119543_0010 to ResourceManager
> 20/07/16 16:10:33 INFO YarnClientImpl: Submitted application 
> application_1594914119543_0010
> 20/07/16 16:10:34 INFO Client: Application report for 
> application_1594914119543_0010 (state: FAILED)
> 20/07/16 16:10:34 INFO Client:
>  client token: N/A
>  diagnostics: Application application_1594914119543_0010 failed 2 times 
> due to AM Container for appattempt_1594914119543_0010_02 exited with  
> exitCode: -1000
> Failing this attempt.Diagnostics: [2020-07-16 16:10:34.391]File 
> file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/pyspark.zip 
> does not exist
> java.io.FileNotFoundException: File 
> file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/pyspark.zip 
> does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFi

“Pyspark.zip does not exist” using Spark in cluster mode with Yarn

2020-07-16 Thread Davide Curcio
I'm trying to run some Spark script in cluster mode using Yarn but I've always 
obtained this error. I read in other similar question that the cause can be:

"Local" set up hard-coded as a master but I don't have it
HADOOP_CONF_DIR environment variable that's wrong inside spark-env.sh but it 
seems right
I've tried with every code, even simple code but it still doesn't work, even 
though in local mode they work.

Here is my log when I try to execute the code:

spark/bin/spark-submit --deploy-mode cluster --master yarn ~/prova7.py
log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/16 16:10:27 INFO Client: Requesting a new application from cluster with 2 
NodeManagers
20/07/16 16:10:27 INFO Client: Verifying our application has not requested more 
than the maximum memory capability of the cluster (1536 MB per container)
20/07/16 16:10:27 INFO Client: Will allocate AM container, with 896 MB memory 
including 384 MB overhead
20/07/16 16:10:27 INFO Client: Setting up container launch context for our AM
20/07/16 16:10:27 INFO Client: Setting up the launch environment for our AM 
container
20/07/16 16:10:27 INFO Client: Preparing resources for our AM container
20/07/16 16:10:27 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
20/07/16 16:10:31 INFO Client: Uploading resource 
file:/tmp/spark-750fb229-4166--9c69-eb90e9a2318d/__spark_libs__4588035472069967339.zip
 -> 
file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/__spark_libs__4588035472069967339.zip
20/07/16 16:10:31 INFO Client: Uploading resource file:/home/ubuntu/prova7.py 
-> file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/prova7.py
20/07/16 16:10:31 INFO Client: Uploading resource 
file:/home/ubuntu/spark/python/lib/pyspark.zip -> 
file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/pyspark.zip
20/07/16 16:10:31 INFO Client: Uploading resource 
file:/home/ubuntu/spark/python/lib/py4j-0.10.7-src.zip -> 
file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/py4j-0.10.7-src.zip
20/07/16 16:10:32 INFO Client: Uploading resource 
file:/tmp/spark-750fb229-4166--9c69-eb90e9a2318d/__spark_conf__1291791519024875749.zip
 -> 
file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/__spark_conf__.zip
20/07/16 16:10:32 INFO SecurityManager: Changing view acls to: ubuntu
20/07/16 16:10:32 INFO SecurityManager: Changing modify acls to: ubuntu
20/07/16 16:10:32 INFO SecurityManager: Changing view acls groups to:
20/07/16 16:10:32 INFO SecurityManager: Changing modify acls groups to:
20/07/16 16:10:32 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(ubuntu); groups 
with view permissions: Set(); users  with modify permissions: Set(ubuntu); 
groups with modify permissions: Set()
20/07/16 16:10:33 INFO Client: Submitting application 
application_1594914119543_0010 to ResourceManager
20/07/16 16:10:33 INFO YarnClientImpl: Submitted application 
application_1594914119543_0010
20/07/16 16:10:34 INFO Client: Application report for 
application_1594914119543_0010 (state: FAILED)
20/07/16 16:10:34 INFO Client:
 client token: N/A
 diagnostics: Application application_1594914119543_0010 failed 2 times due 
to AM Container for appattempt_1594914119543_0010_02 exited with  exitCode: 
-1000
Failing this attempt.Diagnostics: [2020-07-16 16:10:34.391]File 
file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/pyspark.zip does 
not exist
java.io.FileNotFoundException: File 
file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/pyspark.zip does 
not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:641)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:930)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:631)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalize

Re: Issue in parallelization of CNN model using spark

2020-07-14 Thread Anwar AliKhan
Ok, thanks.
You can buy it here

https://www.amazon.com/s?k=hands+on+machine+learning+with+scikit-learn+and+tensorflow+2&crid=2U0P9XVIJ790T&sprefix=Hands+on+machine+%2Caps%2C246&ref=nb_sb_ss_i_1_17

This book is like an accompaniment to the Andrew Ng course on coursera.
It uses exact same mathematical notations , examples etc. so it is a smooth
transition from that courses.




On Tue, 14 Jul 2020, 15:52 Sean Owen,  wrote:

> It is still copyrighted material, no matter its state of editing. Yes,
> you should not be sharing this on the internet.
>
> On Tue, Jul 14, 2020 at 9:46 AM Anwar AliKhan 
> wrote:
> >
> > Please note It is freely available because it is an early unedited raw
> edition.
> > It is not 100% complete , it is not entirely same as yours.
> > So it is not piracy.
> > I agree it is a good book.
> >
>


Re: Issue in parallelization of CNN model using spark

2020-07-14 Thread Sean Owen
It is still copyrighted material, no matter its state of editing. Yes,
you should not be sharing this on the internet.

On Tue, Jul 14, 2020 at 9:46 AM Anwar AliKhan  wrote:
>
> Please note It is freely available because it is an early unedited raw 
> edition.
> It is not 100% complete , it is not entirely same as yours.
> So it is not piracy.
> I agree it is a good book.
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Issue in parallelization of CNN model using spark

2020-07-14 Thread Anwar AliKhan
Please note It is freely available because it is an early unedited raw
edition.
It is not 100% complete , it is not entirely same as yours.
So it is not piracy.
I agree it is a good book.







On Tue, 14 Jul 2020, 14:30 Patrick McCarthy, 
wrote:

> Please don't advocate for piracy, this book is not freely available.
>
> I own it and it's wonderful, Mr. Géron deserves to benefit from it.
>
> On Mon, Jul 13, 2020 at 9:59 PM Anwar AliKhan 
> wrote:
>
>>  link to a free book  which may be useful.
>>
>> Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow
>> Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien
>> Géron
>>
>> https://bit.ly/2zxueGt
>>
>>
>>
>>
>>
>>  13 Jul 2020, 15:18 Sean Owen,  wrote:
>>
>>> There is a multilayer perceptron implementation in Spark ML, but
>>> that's not what you're looking for.
>>> To parallelize model training developed using standard libraries like
>>> Keras, use Horovod from Uber.
>>> https://horovod.readthedocs.io/en/stable/spark_include.html
>>>
>>> On Mon, Jul 13, 2020 at 6:59 AM Mukhtaj Khan 
>>> wrote:
>>> >
>>> > Dear Spark User
>>> >
>>> > I am trying to parallelize the CNN (convolutional neural network)
>>> model using spark. I have developed the model using python and Keras
>>> library. The model works fine on a single machine but when we try on
>>> multiple machines, the execution time remains the same as sequential.
>>> > Could you please tell me that there is any built-in library for CNN to
>>> parallelize in spark framework. Moreover, MLLIB does not have any support
>>> for CNN.
>>> > Best regards
>>> > Mukhtaj
>>> >
>>> >
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>


Re: Issue in parallelization of CNN model using spark

2020-07-14 Thread Patrick McCarthy
Please don't advocate for piracy, this book is not freely available.

I own it and it's wonderful, Mr. Géron deserves to benefit from it.

On Mon, Jul 13, 2020 at 9:59 PM Anwar AliKhan 
wrote:

>  link to a free book  which may be useful.
>
> Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow
> Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien
> Géron
>
> https://bit.ly/2zxueGt
>
>
>
>
>
>  13 Jul 2020, 15:18 Sean Owen,  wrote:
>
>> There is a multilayer perceptron implementation in Spark ML, but
>> that's not what you're looking for.
>> To parallelize model training developed using standard libraries like
>> Keras, use Horovod from Uber.
>> https://horovod.readthedocs.io/en/stable/spark_include.html
>>
>> On Mon, Jul 13, 2020 at 6:59 AM Mukhtaj Khan  wrote:
>> >
>> > Dear Spark User
>> >
>> > I am trying to parallelize the CNN (convolutional neural network) model
>> using spark. I have developed the model using python and Keras library. The
>> model works fine on a single machine but when we try on multiple machines,
>> the execution time remains the same as sequential.
>> > Could you please tell me that there is any built-in library for CNN to
>> parallelize in spark framework. Moreover, MLLIB does not have any support
>> for CNN.
>> > Best regards
>> > Mukhtaj
>> >
>> >
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016


Re: Issue in parallelization of CNN model using spark

2020-07-13 Thread Anwar AliKhan
 link to a free book  which may be useful.

Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow
Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien
Géron

https://bit.ly/2zxueGt





 13 Jul 2020, 15:18 Sean Owen,  wrote:

> There is a multilayer perceptron implementation in Spark ML, but
> that's not what you're looking for.
> To parallelize model training developed using standard libraries like
> Keras, use Horovod from Uber.
> https://horovod.readthedocs.io/en/stable/spark_include.html
>
> On Mon, Jul 13, 2020 at 6:59 AM Mukhtaj Khan  wrote:
> >
> > Dear Spark User
> >
> > I am trying to parallelize the CNN (convolutional neural network) model
> using spark. I have developed the model using python and Keras library. The
> model works fine on a single machine but when we try on multiple machines,
> the execution time remains the same as sequential.
> > Could you please tell me that there is any built-in library for CNN to
> parallelize in spark framework. Moreover, MLLIB does not have any support
> for CNN.
> > Best regards
> > Mukhtaj
> >
> >
> >
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Issue in parallelization of CNN model using spark

2020-07-13 Thread Anwar AliKhan
This is very useful for me leading on from week4 of the Andrew Ng course.


On Mon, 13 Jul 2020, 15:18 Sean Owen,  wrote:

> There is a multilayer perceptron implementation in Spark ML, but
> that's not what you're looking for.
> To parallelize model training developed using standard libraries like
> Keras, use Horovod from Uber.
> https://horovod.readthedocs.io/en/stable/spark_include.html
>
> On Mon, Jul 13, 2020 at 6:59 AM Mukhtaj Khan  wrote:
> >
> > Dear Spark User
> >
> > I am trying to parallelize the CNN (convolutional neural network) model
> using spark. I have developed the model using python and Keras library. The
> model works fine on a single machine but when we try on multiple machines,
> the execution time remains the same as sequential.
> > Could you please tell me that there is any built-in library for CNN to
> parallelize in spark framework. Moreover, MLLIB does not have any support
> for CNN.
> > Best regards
> > Mukhtaj
> >
> >
> >
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Using Spark UI with Running Spark on Hadoop Yarn

2020-07-13 Thread ArtemisDev
Is there anyway to make the spark process visible via Spark UI when 
running Spark 3.0 on a Hadoop yarn cluster?  The spark documentation 
talked about replacing Spark UI with the spark history server, but 
didn't give much details.  Therefore I would assume it is still possible 
to use Spark UI when running spark on a hadoop yarn cluster.  Is this 
correct?   Does the spark history server have the same user functions as 
the Spark UI?


But how could this be possible (the possibility of using Spark UI) if 
the Spark master server isn't active when all the job scheduling and 
resource allocation tasks are replaced by yarn servers?


Thanks!

-- ND


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Issue in parallelization of CNN model using spark

2020-07-13 Thread Sean Owen
There is a multilayer perceptron implementation in Spark ML, but
that's not what you're looking for.
To parallelize model training developed using standard libraries like
Keras, use Horovod from Uber.
https://horovod.readthedocs.io/en/stable/spark_include.html

On Mon, Jul 13, 2020 at 6:59 AM Mukhtaj Khan  wrote:
>
> Dear Spark User
>
> I am trying to parallelize the CNN (convolutional neural network) model using 
> spark. I have developed the model using python and Keras library. The model 
> works fine on a single machine but when we try on multiple machines, the 
> execution time remains the same as sequential.
> Could you please tell me that there is any built-in library for CNN to 
> parallelize in spark framework. Moreover, MLLIB does not have any support for 
> CNN.
> Best regards
> Mukhtaj
>
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Issue in parallelization of CNN model using spark

2020-07-13 Thread Juan Martín Guillén
 Hi Mukhtaj,
Parallelization on Spark is abstracted on the DataFrame.
You can run anything locally on the driver but to make it run in parallel on 
the cluster you'll need to use the DataFrame abstraction.
You may want to check maxpumperla/elephas.

| 
| 
| 
|  |  |

 |

 |
| 
|  | 
maxpumperla/elephas

Distributed Deep learning with Keras & Spark. Contribute to maxpumperla/elephas 
development by creating an accou...
 |

 |

 |



Regards,Juan Martín.


El lunes, 13 de julio de 2020 08:59:35 ART, Mukhtaj Khan 
 escribió:  
 
 Dear Spark User
I am trying to parallelize the CNN (convolutional neural network) model using 
spark. I have developed the model using python and Keras library. The model 
works fine on a single machine but when we try on multiple machines, the 
execution time remains the same as sequential.Could you please tell me that 
there is any built-in library for CNN to parallelize in spark framework. 
Moreover, MLLIB does not have any support for CNN.Best regardsMukhtaj
  


  

Issue in parallelization of CNN model using spark

2020-07-13 Thread Mukhtaj Khan
Dear Spark User

I am trying to parallelize the CNN (convolutional neural network) model
using spark. I have developed the model using python and Keras library. The
model works fine on a single machine but when we try on multiple machines,
the execution time remains the same as sequential.
Could you please tell me that there is any built-in library for CNN to
parallelize in spark framework. Moreover, MLLIB does not have any support
for CNN.
Best regards
Mukhtaj


Re: Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

2020-07-06 Thread Daniel de Oliveira Mantovani
Hi Teja,

To access Hive 3 using Apache Spark 2.x.x you need to use this connector
from Cloudera
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html
.
It has many limitations You just can write to Hive managed tables in
ORC format. But you can mitigate this problem writing to Hive unmanaged
tables, so parquet will work.
The performance is also not the same.

Good luck


On Mon, Jul 6, 2020 at 3:16 PM Sean Owen  wrote:

> 2.4 works with Hadoop 3 (optionally) and Hive 1. I doubt it will work
> connecting to Hadoop 3 / Hive 3; it's possible in a few cases.
> It's also possible some vendor distributions support this combination.
>
> On Mon, Jul 6, 2020 at 7:51 AM Teja  wrote:
> >
> > We use spark 2.4.0 to connect to Hadoop 2.7 cluster and query from Hive
> > Metastore version 2.3. But the Cluster managing team has decided to
> upgrade
> > to Hadoop 3.x and Hive 3.x. We could not migrate to spark 3 yet, which is
> > compatible with Hadoop 3 and Hive 3, as we could not test if anything
> > breaks.
> >
> > *Is there any possible way to stick to spark 2.4.x version and still be
> able
> > to use Hadoop 3 and Hive 3?
> > *
> >
> > I got to know backporting is one option but I am not sure how. It would
> be
> > great if you could point me in that direction.
> >
> >
> >
> > --
> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 

--
Daniel Mantovani


Re: Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

2020-07-06 Thread Sean Owen
2.4 works with Hadoop 3 (optionally) and Hive 1. I doubt it will work
connecting to Hadoop 3 / Hive 3; it's possible in a few cases.
It's also possible some vendor distributions support this combination.

On Mon, Jul 6, 2020 at 7:51 AM Teja  wrote:
>
> We use spark 2.4.0 to connect to Hadoop 2.7 cluster and query from Hive
> Metastore version 2.3. But the Cluster managing team has decided to upgrade
> to Hadoop 3.x and Hive 3.x. We could not migrate to spark 3 yet, which is
> compatible with Hadoop 3 and Hive 3, as we could not test if anything
> breaks.
>
> *Is there any possible way to stick to spark 2.4.x version and still be able
> to use Hadoop 3 and Hive 3?
> *
>
> I got to know backporting is one option but I am not sure how. It would be
> great if you could point me in that direction.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

2020-07-06 Thread Teja
We use spark 2.4.0 to connect to Hadoop 2.7 cluster and query from Hive
Metastore version 2.3. But the Cluster managing team has decided to upgrade
to Hadoop 3.x and Hive 3.x. We could not migrate to spark 3 yet, which is
compatible with Hadoop 3 and Hive 3, as we could not test if anything
breaks.

*Is there any possible way to stick to spark 2.4.x version and still be able
to use Hadoop 3 and Hive 3?
*

I got to know backporting is one option but I am not sure how. It would be
great if you could point me in that direction.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Using Spark Accumulators with Structured Streaming

2020-06-08 Thread Something Something
t; > > > >
>>>>>> > > > >
>>>>>> > > > > On Thu, May 28, 2020 at 7:04 PM ZHANG Wei <
>>>>>> wezh...@outlook.com> wrote:
>>>>>> > > > >
>>>>>> > > > > > May I get how the accumulator is accessed in the method
>>>>>> > > > > > `onQueryProgress()`?
>>>>>> > > > > >
>>>>>> > > > > > AFAICT, the accumulator is incremented well. There is a way
>>>>>> to verify
>>>>>> > > > that
>>>>>> > > > > > in cluster like this:
>>>>>> > > > > > ```
>>>>>> > > > > > // Add the following while loop before invoking
>>>>>> awaitTermination
>>>>>> > > > > > while (true) {
>>>>>> > > > > >   println("My acc: " + myAcc.value)
>>>>>> > > > > >   Thread.sleep(5 * 1000)
>>>>>> > > > > > }
>>>>>> > > > > >
>>>>>> > > > > > //query.awaitTermination()
>>>>>> > > > > > ```
>>>>>> > > > > >
>>>>>> > > > > > And the accumulator value updated can be found from driver
>>>>>> stdout.
>>>>>> > > > > >
>>>>>> > > > > > --
>>>>>> > > > > > Cheers,
>>>>>> > > > > > -z
>>>>>> > > > > >
>>>>>> > > > > > On Thu, 28 May 2020 17:12:48 +0530
>>>>>> > > > > > Srinivas V  wrote:
>>>>>> > > > > >
>>>>>> > > > > > > yes, I am using stateful structured streaming. Yes
>>>>>> similar to what
>>>>>> > > > you
>>>>>> > > > > > do.
>>>>>> > > > > > > This is in Java
>>>>>> > > > > > > I do it this way:
>>>>>> > > > > > > Dataset productUpdates = watermarkedDS
>>>>>> > > > > > > .groupByKey(
>>>>>> > > > > > >     (MapFunction>>>>> String>) event
>>>>>> > > > ->
>>>>>> > > > > > > event.getId(), Encoders.STRING())
>>>>>> > > > > > > .mapGroupsWithState(
>>>>>> > > > > > > new
>>>>>> > > > > > >
>>>>>> > > > > >
>>>>>> > > >
>>>>>> StateUpdateTask(Long.parseLong(appConfig.getSparkStructuredStreamingConfig().STATE_TIMEOUT),
>>>>>> > > > > > > appConfig, accumulators),
>>>>>> > > > > > >
>>>>>>  Encoders.bean(ModelStateInfo.class),
>>>>>> > > > > > > Encoders.bean(ModelUpdate.class),
>>>>>> > > > > > >
>>>>>>  GroupStateTimeout.ProcessingTimeTimeout());
>>>>>> > > > > > >
>>>>>> > > > > > > StateUpdateTask contains the update method.
>>>>>> > > > > > >
>>>>>> > > > > > > On Thu, May 28, 2020 at 4:41 AM Something Something <
>>>>>> > > > > > > mailinglist...@gmail.com> wrote:
>>>>>> > > > > > >
>>>>>> > > > > > > > Yes, that's exactly how I am creating them.
>>>>>> > > > > > > >
>>>>>> > > > > > > > Question... Are you using 'Stateful Structured
>>>>>> Streaming' in which
>>>>>> > > > > > you've
>>>>>> > > > > > > > something like this?
>>>>>> > > > > > > >
>>>>>> > > > > > > >
>>>>>> .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
>>>>>> > > > > > > > updateAcro

Re: Using Spark Accumulators with Structured Streaming

2020-06-08 Thread Srinivas V
t; > > --
>>>>> > > > > > Cheers,
>>>>> > > > > > -z
>>>>> > > > > >
>>>>> > > > > > On Thu, 28 May 2020 17:12:48 +0530
>>>>> > > > > > Srinivas V  wrote:
>>>>> > > > > >
>>>>> > > > > > > yes, I am using stateful structured streaming. Yes similar
>>>>> to what
>>>>> > > > you
>>>>> > > > > > do.
>>>>> > > > > > > This is in Java
>>>>> > > > > > > I do it this way:
>>>>> > > > > > > Dataset productUpdates = watermarkedDS
>>>>> > > > > > > .groupByKey(
>>>>> > > > > > > (MapFunction>>>> String>) event
>>>>> > > > ->
>>>>> > > > > > > event.getId(), Encoders.STRING())
>>>>> > > > > > > .mapGroupsWithState(
>>>>> > > > > > > new
>>>>> > > > > > >
>>>>> > > > > >
>>>>> > > >
>>>>> StateUpdateTask(Long.parseLong(appConfig.getSparkStructuredStreamingConfig().STATE_TIMEOUT),
>>>>> > > > > > > appConfig, accumulators),
>>>>> > > > > > >
>>>>>  Encoders.bean(ModelStateInfo.class),
>>>>> > > > > > > Encoders.bean(ModelUpdate.class),
>>>>> > > > > > >
>>>>>  GroupStateTimeout.ProcessingTimeTimeout());
>>>>> > > > > > >
>>>>> > > > > > > StateUpdateTask contains the update method.
>>>>> > > > > > >
>>>>> > > > > > > On Thu, May 28, 2020 at 4:41 AM Something Something <
>>>>> > > > > > > mailinglist...@gmail.com> wrote:
>>>>> > > > > > >
>>>>> > > > > > > > Yes, that's exactly how I am creating them.
>>>>> > > > > > > >
>>>>> > > > > > > > Question... Are you using 'Stateful Structured
>>>>> Streaming' in which
>>>>> > > > > > you've
>>>>> > > > > > > > something like this?
>>>>> > > > > > > >
>>>>> > > > > > > >
>>>>> .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
>>>>> > > > > > > > updateAcrossEvents
>>>>> > > > > > > >   )
>>>>> > > > > > > >
>>>>> > > > > > > > And updating the Accumulator inside
>>>>> 'updateAcrossEvents'? We're
>>>>> > > > > > experiencing this only under 'Stateful Structured
>>>>> Streaming'. In other
>>>>> > > > > > streaming applications it works as expected.
>>>>> > > > > > > >
>>>>> > > > > > > >
>>>>> > > > > > > >
>>>>> > > > > > > > On Wed, May 27, 2020 at 9:01 AM Srinivas V <
>>>>> srini@gmail.com>
>>>>> > > > > > wrote:
>>>>> > > > > > > >
>>>>> > > > > > > >> Yes, I am talking about Application specific
>>>>> Accumulators.
>>>>> > > > Actually I
>>>>> > > > > > am
>>>>> > > > > > > >> getting the values printed in my driver log as well as
>>>>> sent to
>>>>> > > > > > Grafana. Not
>>>>> > > > > > > >> sure where and when I saw 0 before. My deploy mode is
>>>>> “client” on
>>>>> > > > a
>>>>> > > > > > yarn
>>>>> > > > > > > >> cluster(not local Mac) where I submit from master node.
>>>>> It should
>>>>> > > > > > work the
>>>>> > > > > > > >> same for cluster mode as well.
>>>>

Re: Using Spark Accumulators with Structured Streaming

2020-06-08 Thread Something Something
; StateUpdateTask(Long.parseLong(appConfig.getSparkStructuredStreamingConfig().STATE_TIMEOUT),
>>>> > > > > > > appConfig, accumulators),
>>>> > > > > > > Encoders.bean(ModelStateInfo.class),
>>>> > > > > > > Encoders.bean(ModelUpdate.class),
>>>> > > > > > >
>>>>  GroupStateTimeout.ProcessingTimeTimeout());
>>>> > > > > > >
>>>> > > > > > > StateUpdateTask contains the update method.
>>>> > > > > > >
>>>> > > > > > > On Thu, May 28, 2020 at 4:41 AM Something Something <
>>>> > > > > > > mailinglist...@gmail.com> wrote:
>>>> > > > > > >
>>>> > > > > > > > Yes, that's exactly how I am creating them.
>>>> > > > > > > >
>>>> > > > > > > > Question... Are you using 'Stateful Structured Streaming'
>>>> in which
>>>> > > > > > you've
>>>> > > > > > > > something like this?
>>>> > > > > > > >
>>>> > > > > > > >
>>>> .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
>>>> > > > > > > > updateAcrossEvents
>>>> > > > > > > >   )
>>>> > > > > > > >
>>>> > > > > > > > And updating the Accumulator inside 'updateAcrossEvents'?
>>>> We're
>>>> > > > > > experiencing this only under 'Stateful Structured Streaming'.
>>>> In other
>>>> > > > > > streaming applications it works as expected.
>>>> > > > > > > >
>>>> > > > > > > >
>>>> > > > > > > >
>>>> > > > > > > > On Wed, May 27, 2020 at 9:01 AM Srinivas V <
>>>> srini@gmail.com>
>>>> > > > > > wrote:
>>>> > > > > > > >
>>>> > > > > > > >> Yes, I am talking about Application specific
>>>> Accumulators.
>>>> > > > Actually I
>>>> > > > > > am
>>>> > > > > > > >> getting the values printed in my driver log as well as
>>>> sent to
>>>> > > > > > Grafana. Not
>>>> > > > > > > >> sure where and when I saw 0 before. My deploy mode is
>>>> “client” on
>>>> > > > a
>>>> > > > > > yarn
>>>> > > > > > > >> cluster(not local Mac) where I submit from master node.
>>>> It should
>>>> > > > > > work the
>>>> > > > > > > >> same for cluster mode as well.
>>>> > > > > > > >> Create accumulators like this:
>>>> > > > > > > >> AccumulatorV2 accumulator =
>>>> sparkContext.longAccumulator(name);
>>>> > > > > > > >>
>>>> > > > > > > >>
>>>> > > > > > > >> On Tue, May 26, 2020 at 8:42 PM Something Something <
>>>> > > > > > > >> mailinglist...@gmail.com> wrote:
>>>> > > > > > > >>
>>>> > > > > > > >>> Hmm... how would they go to Graphana if they are not
>>>> getting
>>>> > > > > > computed in
>>>> > > > > > > >>> your code? I am talking about the Application Specific
>>>> > > > Accumulators.
>>>> > > > > > The
>>>> > > > > > > >>> other standard counters such as
>>>> > > > 'event.progress.inputRowsPerSecond'
>>>> > > > > > are
>>>> > > > > > > >>> getting populated correctly!
>>>> > > > > > > >>>
>>>> > > > > > > >>> On Mon, May 25, 2020 at 8:39 PM Srinivas V <
>>>> srini@gmail.com>
>>>> > > > > > wrote:
>>>> > > > > > > >>>
>>>> > > > > > > >>>> Hello,
>>>> > > > > > > >>>> Even for me it comes as 0 when I print in
>>>> OnQueryProgress. I use
>>>> > > > > > > >>>> LongAccumulator as well. Yes, it prints on my local
>>>> but not on
>>>> > > > > > cluster.
>>>> > > > > > > >>>> But one consolation is that when I send metrics to
>>>> Graphana, the
>>>> > > > > > values
>>>> > > > > > > >>>> are coming there.
>>>> > > > > > > >>>>
>>>> > > > > > > >>>> On Tue, May 26, 2020 at 3:10 AM Something Something <
>>>> > > > > > > >>>> mailinglist...@gmail.com> wrote:
>>>> > > > > > > >>>>
>>>> > > > > > > >>>>> No this is not working even if I use LongAccumulator.
>>>> > > > > > > >>>>>
>>>> > > > > > > >>>>> On Fri, May 15, 2020 at 9:54 PM ZHANG Wei <
>>>> wezh...@outlook.com
>>>> > > > >
>>>> > > > > > wrote:
>>>> > > > > > > >>>>>
>>>> > > > > > > >>>>>> There is a restriction in AccumulatorV2 API [1], the
>>>> OUT type
>>>> > > > > > should
>>>> > > > > > > >>>>>> be atomic or thread safe. I'm wondering if the
>>>> implementation
>>>> > > > for
>>>> > > > > > > >>>>>> `java.util.Map[T, Long]` can meet it or not. Is
>>>> there any
>>>> > > > chance
>>>> > > > > > to replace
>>>> > > > > > > >>>>>> CollectionLongAccumulator by
>>>> CollectionAccumulator[2] or
>>>> > > > > > LongAccumulator[3]
>>>> > > > > > > >>>>>> and test if the StreamingListener and other codes
>>>> are able to
>>>> > > > > > work?
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>> ---
>>>> > > > > > > >>>>>> Cheers,
>>>> > > > > > > >>>>>> -z
>>>> > > > > > > >>>>>> [1]
>>>> > > > > > > >>>>>>
>>>> > > > > >
>>>> > > >
>>>> https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fscala%2Findex.html%23org.apache.spark.util.AccumulatorV2&data=02%7C01%7C%7Cf802f480bbab46ae07b308d803fc661f%7C84df9e7fe9f640afb435%7C1%7C0%7C637263729860033353&sdata=NPpiZC%2Bnx9rec6G35QvMDV1D3FgvD%2FnIct6OJ06I728%3D&reserved=0
>>>> > > > > > > >>>>>> [2]
>>>> > > > > > > >>>>>>
>>>> > > > > >
>>>> > > >
>>>> https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fscala%2Findex.html%23org.apache.spark.util.CollectionAccumulator&data=02%7C01%7C%7Cf802f480bbab46ae07b308d803fc661f%7C84df9e7fe9f640afb435%7C1%7C0%7C637263729860038343&sdata=wMskE72per9Js0V7UHJ0qi4UzCEEYh%2Fk53fuP2e92mA%3D&reserved=0
>>>> > > > > > > >>>>>> [3]
>>>> > > > > > > >>>>>>
>>>> > > > > >
>>>> > > >
>>>> https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fscala%2Findex.html%23org.apache.spark.util.LongAccumulator&data=02%7C01%7C%7Cf802f480bbab46ae07b308d803fc661f%7C84df9e7fe9f640afb435%7C1%7C0%7C637263729860038343&sdata=INgHzc0rc6jj7UapB%2FRLfCiGNWEBSKWfgmuJ2dUZ3eM%3D&reserved=0
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>> 
>>>> > > > > > > >>>>>> From: Something Something 
>>>> > > > > > > >>>>>> Sent: Saturday, May 16, 2020 0:38
>>>> > > > > > > >>>>>> To: spark-user
>>>> > > > > > > >>>>>> Subject: Re: Using Spark Accumulators with Structured
>>>> > > > Streaming
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>> Can someone from Spark Development team tell me if
>>>> this
>>>> > > > > > functionality
>>>> > > > > > > >>>>>> is supported and tested? I've spent a lot of time on
>>>> this but
>>>> > > > > > can't get it
>>>> > > > > > > >>>>>> to work. Just to add more context, we've our own
>>>> Accumulator
>>>> > > > > > class that
>>>> > > > > > > >>>>>> extends from AccumulatorV2. In this class we keep
>>>> track of
>>>> > > > one or
>>>> > > > > > more
>>>> > > > > > > >>>>>> accumulators. Here's the definition:
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>> class CollectionLongAccumulator[T]
>>>> > > > > > > >>>>>> extends AccumulatorV2[T, java.util.Map[T, Long]]
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>> When the job begins we register an instance of this
>>>> class:
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>> spark.sparkContext.register(myAccumulator,
>>>> "MyAccumulator")
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>> Is this working under Structured Streaming?
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>> I will keep looking for alternate approaches but any
>>>> help
>>>> > > > would be
>>>> > > > > > > >>>>>> greatly appreciated. Thanks.
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>> On Thu, May 14, 2020 at 2:36 PM Something Something <
>>>> > > > > > > >>>>>> mailinglist...@gmail.com>>> mailinglist...@gmail.com>>
>>>> > > > wrote:
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>> In my structured streaming job I am updating Spark
>>>> > > > Accumulators in
>>>> > > > > > > >>>>>> the updateAcrossEvents method but they are always 0
>>>> when I
>>>> > > > try to
>>>> > > > > > print
>>>> > > > > > > >>>>>> them in my StreamingListener. Here's the code:
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>>
>>>> > > > .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
>>>> > > > > > > >>>>>> updateAcrossEvents
>>>> > > > > > > >>>>>>   )
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>> The accumulators get incremented in
>>>> 'updateAcrossEvents'.
>>>> > > > I've a
>>>> > > > > > > >>>>>> StreamingListener which writes values of the
>>>> accumulators in
>>>> > > > > > > >>>>>> 'onQueryProgress' method but in this method the
>>>> Accumulators
>>>> > > > are
>>>> > > > > > ALWAYS
>>>> > > > > > > >>>>>> ZERO!
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>> When I added log statements in the
>>>> updateAcrossEvents, I
>>>> > > > could see
>>>> > > > > > > >>>>>> that these accumulators are getting incremented as
>>>> expected.
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>> This only happens when I run in the 'Cluster' mode.
>>>> In Local
>>>> > > > mode
>>>> > > > > > it
>>>> > > > > > > >>>>>> works fine which implies that the Accumulators are
>>>> not getting
>>>> > > > > > distributed
>>>> > > > > > > >>>>>> correctly - or something like that!
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>> Note: I've seen quite a few answers on the Web that
>>>> tell me to
>>>> > > > > > > >>>>>> perform an "Action". That's not a solution here.
>>>> This is a
>>>> > > > > > 'Stateful
>>>> > > > > > > >>>>>> Structured Streaming' job. Yes, I am also
>>>> 'registering' them
>>>> > > > in
>>>> > > > > > > >>>>>> SparkContext.
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>>
>>>> > > > > > > >>>>>>
>>>> > > > > >
>>>> > > >
>>>> >
>>>> > -
>>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>> >
>>>>
>>>


Re: Using Spark Accumulators with Structured Streaming

2020-06-08 Thread Something Something
gTimeTimeout())(
>>> > > > > > > > updateAcrossEvents
>>> > > > > > > >   )
>>> > > > > > > >
>>> > > > > > > > And updating the Accumulator inside 'updateAcrossEvents'?
>>> We're
>>> > > > > > experiencing this only under 'Stateful Structured Streaming'.
>>> In other
>>> > > > > > streaming applications it works as expected.
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > On Wed, May 27, 2020 at 9:01 AM Srinivas V <
>>> srini@gmail.com>
>>> > > > > > wrote:
>>> > > > > > > >
>>> > > > > > > >> Yes, I am talking about Application specific Accumulators.
>>> > > > Actually I
>>> > > > > > am
>>> > > > > > > >> getting the values printed in my driver log as well as
>>> sent to
>>> > > > > > Grafana. Not
>>> > > > > > > >> sure where and when I saw 0 before. My deploy mode is
>>> “client” on
>>> > > > a
>>> > > > > > yarn
>>> > > > > > > >> cluster(not local Mac) where I submit from master node.
>>> It should
>>> > > > > > work the
>>> > > > > > > >> same for cluster mode as well.
>>> > > > > > > >> Create accumulators like this:
>>> > > > > > > >> AccumulatorV2 accumulator =
>>> sparkContext.longAccumulator(name);
>>> > > > > > > >>
>>> > > > > > > >>
>>> > > > > > > >> On Tue, May 26, 2020 at 8:42 PM Something Something <
>>> > > > > > > >> mailinglist...@gmail.com> wrote:
>>> > > > > > > >>
>>> > > > > > > >>> Hmm... how would they go to Graphana if they are not
>>> getting
>>> > > > > > computed in
>>> > > > > > > >>> your code? I am talking about the Application Specific
>>> > > > Accumulators.
>>> > > > > > The
>>> > > > > > > >>> other standard counters such as
>>> > > > 'event.progress.inputRowsPerSecond'
>>> > > > > > are
>>> > > > > > > >>> getting populated correctly!
>>> > > > > > > >>>
>>> > > > > > > >>> On Mon, May 25, 2020 at 8:39 PM Srinivas V <
>>> srini@gmail.com>
>>> > > > > > wrote:
>>> > > > > > > >>>
>>> > > > > > > >>>> Hello,
>>> > > > > > > >>>> Even for me it comes as 0 when I print in
>>> OnQueryProgress. I use
>>> > > > > > > >>>> LongAccumulator as well. Yes, it prints on my local but
>>> not on
>>> > > > > > cluster.
>>> > > > > > > >>>> But one consolation is that when I send metrics to
>>> Graphana, the
>>> > > > > > values
>>> > > > > > > >>>> are coming there.
>>> > > > > > > >>>>
>>> > > > > > > >>>> On Tue, May 26, 2020 at 3:10 AM Something Something <
>>> > > > > > > >>>> mailinglist...@gmail.com> wrote:
>>> > > > > > > >>>>
>>> > > > > > > >>>>> No this is not working even if I use LongAccumulator.
>>> > > > > > > >>>>>
>>> > > > > > > >>>>> On Fri, May 15, 2020 at 9:54 PM ZHANG Wei <
>>> wezh...@outlook.com
>>> > > > >
>>> > > > > > wrote:
>>> > > > > > > >>>>>
>>> > > > > > > >>>>>> There is a restriction in AccumulatorV2 API [1], the
>>> OUT type
>>> > > > > > should
>>> > > > > > > >>>>>> be atomic or thread safe. I'm wondering if the
>>> implementation
>>> > > > for
>>> > > > > > > >>>>>> `java.util.Map[T, Long]` can meet it or not. Is there
>>> any
>>> > > > chance
>>> > > > > > to replace
>>> > > > > > > >>>>>> CollectionLongAccumulator by CollectionAccumulator[2]
>>> or
>>> > > > > > LongAccumulator[3]
>>> > > > > > > >>>>>> and test if the StreamingListener and other codes are
>>> able to
>>> > > > > > work?
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>> ---
>>> > > > > > > >>>>>> Cheers,
>>> > > > > > > >>>>>> -z
>>> > > > > > > >>>>>> [1]
>>> > > > > > > >>>>>>
>>> > > > > >
>>> > > >
>>> https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fscala%2Findex.html%23org.apache.spark.util.AccumulatorV2&data=02%7C01%7C%7Cf802f480bbab46ae07b308d803fc661f%7C84df9e7fe9f640afb435%7C1%7C0%7C637263729860033353&sdata=NPpiZC%2Bnx9rec6G35QvMDV1D3FgvD%2FnIct6OJ06I728%3D&reserved=0
>>> > > > > > > >>>>>> [2]
>>> > > > > > > >>>>>>
>>> > > > > >
>>> > > >
>>> https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fscala%2Findex.html%23org.apache.spark.util.CollectionAccumulator&data=02%7C01%7C%7Cf802f480bbab46ae07b308d803fc661f%7C84df9e7fe9f640afb435%7C1%7C0%7C637263729860038343&sdata=wMskE72per9Js0V7UHJ0qi4UzCEEYh%2Fk53fuP2e92mA%3D&reserved=0
>>> > > > > > > >>>>>> [3]
>>> > > > > > > >>>>>>
>>> > > > > >
>>> > > >
>>> https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fscala%2Findex.html%23org.apache.spark.util.LongAccumulator&data=02%7C01%7C%7Cf802f480bbab46ae07b308d803fc661f%7C84df9e7fe9f640afb435%7C1%7C0%7C637263729860038343&sdata=INgHzc0rc6jj7UapB%2FRLfCiGNWEBSKWfgmuJ2dUZ3eM%3D&reserved=0
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>> 
>>> > > > > > > >>>>>> From: Something Something 
>>> > > > > > > >>>>>> Sent: Saturday, May 16, 2020 0:38
>>> > > > > > > >>>>>> To: spark-user
>>> > > > > > > >>>>>> Subject: Re: Using Spark Accumulators with Structured
>>> > > > Streaming
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>> Can someone from Spark Development team tell me if
>>> this
>>> > > > > > functionality
>>> > > > > > > >>>>>> is supported and tested? I've spent a lot of time on
>>> this but
>>> > > > > > can't get it
>>> > > > > > > >>>>>> to work. Just to add more context, we've our own
>>> Accumulator
>>> > > > > > class that
>>> > > > > > > >>>>>> extends from AccumulatorV2. In this class we keep
>>> track of
>>> > > > one or
>>> > > > > > more
>>> > > > > > > >>>>>> accumulators. Here's the definition:
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>> class CollectionLongAccumulator[T]
>>> > > > > > > >>>>>> extends AccumulatorV2[T, java.util.Map[T, Long]]
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>> When the job begins we register an instance of this
>>> class:
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>> spark.sparkContext.register(myAccumulator,
>>> "MyAccumulator")
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>> Is this working under Structured Streaming?
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>> I will keep looking for alternate approaches but any
>>> help
>>> > > > would be
>>> > > > > > > >>>>>> greatly appreciated. Thanks.
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>> On Thu, May 14, 2020 at 2:36 PM Something Something <
>>> > > > > > > >>>>>> mailinglist...@gmail.com>> mailinglist...@gmail.com>>
>>> > > > wrote:
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>> In my structured streaming job I am updating Spark
>>> > > > Accumulators in
>>> > > > > > > >>>>>> the updateAcrossEvents method but they are always 0
>>> when I
>>> > > > try to
>>> > > > > > print
>>> > > > > > > >>>>>> them in my StreamingListener. Here's the code:
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>>
>>> > > > .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
>>> > > > > > > >>>>>> updateAcrossEvents
>>> > > > > > > >>>>>>   )
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>> The accumulators get incremented in
>>> 'updateAcrossEvents'.
>>> > > > I've a
>>> > > > > > > >>>>>> StreamingListener which writes values of the
>>> accumulators in
>>> > > > > > > >>>>>> 'onQueryProgress' method but in this method the
>>> Accumulators
>>> > > > are
>>> > > > > > ALWAYS
>>> > > > > > > >>>>>> ZERO!
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>> When I added log statements in the
>>> updateAcrossEvents, I
>>> > > > could see
>>> > > > > > > >>>>>> that these accumulators are getting incremented as
>>> expected.
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>> This only happens when I run in the 'Cluster' mode.
>>> In Local
>>> > > > mode
>>> > > > > > it
>>> > > > > > > >>>>>> works fine which implies that the Accumulators are
>>> not getting
>>> > > > > > distributed
>>> > > > > > > >>>>>> correctly - or something like that!
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>> Note: I've seen quite a few answers on the Web that
>>> tell me to
>>> > > > > > > >>>>>> perform an "Action". That's not a solution here. This
>>> is a
>>> > > > > > 'Stateful
>>> > > > > > > >>>>>> Structured Streaming' job. Yes, I am also
>>> 'registering' them
>>> > > > in
>>> > > > > > > >>>>>> SparkContext.
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>>
>>> > > > > > > >>>>>>
>>> > > > > >
>>> > > >
>>> >
>>> > -
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >
>>>
>>


Re: Using Spark Accumulators with Structured Streaming

2020-06-08 Thread Srinivas V
gt; > > > > >
>> > > > > > AFAICT, the accumulator is incremented well. There is a way to
>> verify
>> > > > that
>> > > > > > in cluster like this:
>> > > > > > ```
>> > > > > > // Add the following while loop before invoking
>> awaitTermination
>> > > > > > while (true) {
>> > > > > >   println("My acc: " + myAcc.value)
>> > > > > >   Thread.sleep(5 * 1000)
>> > > > > > }
>> > > > > >
>> > > > > > //query.awaitTermination()
>> > > > > > ```
>> > > > > >
>> > > > > > And the accumulator value updated can be found from driver
>> stdout.
>> > > > > >
>> > > > > > --
>> > > > > > Cheers,
>> > > > > > -z
>> > > > > >
>> > > > > > On Thu, 28 May 2020 17:12:48 +0530
>> > > > > > Srinivas V  wrote:
>> > > > > >
>> > > > > > > yes, I am using stateful structured streaming. Yes similar to
>> what
>> > > > you
>> > > > > > do.
>> > > > > > > This is in Java
>> > > > > > > I do it this way:
>> > > > > > > Dataset productUpdates = watermarkedDS
>> > > > > > > .groupByKey(
>> > > > > > > (MapFunction> String>) event
>> > > > ->
>> > > > > > > event.getId(), Encoders.STRING())
>> > > > > > > .mapGroupsWithState(
>> > > > > > > new
>> > > > > > >
>> > > > > >
>> > > >
>> StateUpdateTask(Long.parseLong(appConfig.getSparkStructuredStreamingConfig().STATE_TIMEOUT),
>> > > > > > > appConfig, accumulators),
>> > > > > > > Encoders.bean(ModelStateInfo.class),
>> > > > > > > Encoders.bean(ModelUpdate.class),
>> > > > > > >
>>  GroupStateTimeout.ProcessingTimeTimeout());
>> > > > > > >
>> > > > > > > StateUpdateTask contains the update method.
>> > > > > > >
>> > > > > > > On Thu, May 28, 2020 at 4:41 AM Something Something <
>> > > > > > > mailinglist...@gmail.com> wrote:
>> > > > > > >
>> > > > > > > > Yes, that's exactly how I am creating them.
>> > > > > > > >
>> > > > > > > > Question... Are you using 'Stateful Structured Streaming'
>> in which
>> > > > > > you've
>> > > > > > > > something like this?
>> > > > > > > >
>> > > > > > > >
>> .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
>> > > > > > > > updateAcrossEvents
>> > > > > > > >   )
>> > > > > > > >
>> > > > > > > > And updating the Accumulator inside 'updateAcrossEvents'?
>> We're
>> > > > > > experiencing this only under 'Stateful Structured Streaming'.
>> In other
>> > > > > > streaming applications it works as expected.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Wed, May 27, 2020 at 9:01 AM Srinivas V <
>> srini@gmail.com>
>> > > > > > wrote:
>> > > > > > > >
>> > > > > > > >> Yes, I am talking about Application specific Accumulators.
>> > > > Actually I
>> > > > > > am
>> > > > > > > >> getting the values printed in my driver log as well as
>> sent to
>> > > > > > Grafana. Not
>> > > > > > > >> sure where and when I saw 0 before. My deploy mode is
>> “client” on
>> > > > a
>> > > > > > yarn
>> > > > > > > >> cluster(not local Mac) where I submit from master node. It
>> should
>> > > >

Re: Using Spark Accumulators with Structured Streaming

2020-06-07 Thread Something Something
> > > > new
> > > > > > >
> > > > > >
> > > >
> StateUpdateTask(Long.parseLong(appConfig.getSparkStructuredStreamingConfig().STATE_TIMEOUT),
> > > > > > > appConfig, accumulators),
> > > > > > > Encoders.bean(ModelStateInfo.class),
> > > > > > > Encoders.bean(ModelUpdate.class),
> > > > > > >
>  GroupStateTimeout.ProcessingTimeTimeout());
> > > > > > >
> > > > > > > StateUpdateTask contains the update method.
> > > > > > >
> > > > > > > On Thu, May 28, 2020 at 4:41 AM Something Something <
> > > > > > > mailinglist...@gmail.com> wrote:
> > > > > > >
> > > > > > > > Yes, that's exactly how I am creating them.
> > > > > > > >
> > > > > > > > Question... Are you using 'Stateful Structured Streaming' in
> which
> > > > > > you've
> > > > > > > > something like this?
> > > > > > > >
> > > > > > > >
> .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
> > > > > > > > updateAcrossEvents
> > > > > > > >   )
> > > > > > > >
> > > > > > > > And updating the Accumulator inside 'updateAcrossEvents'?
> We're
> > > > > > experiencing this only under 'Stateful Structured Streaming'. In
> other
> > > > > > streaming applications it works as expected.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, May 27, 2020 at 9:01 AM Srinivas V <
> srini@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > >> Yes, I am talking about Application specific Accumulators.
> > > > Actually I
> > > > > > am
> > > > > > > >> getting the values printed in my driver log as well as sent
> to
> > > > > > Grafana. Not
> > > > > > > >> sure where and when I saw 0 before. My deploy mode is
> “client” on
> > > > a
> > > > > > yarn
> > > > > > > >> cluster(not local Mac) where I submit from master node. It
> should
> > > > > > work the
> > > > > > > >> same for cluster mode as well.
> > > > > > > >> Create accumulators like this:
> > > > > > > >> AccumulatorV2 accumulator =
> sparkContext.longAccumulator(name);
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Tue, May 26, 2020 at 8:42 PM Something Something <
> > > > > > > >> mailinglist...@gmail.com> wrote:
> > > > > > > >>
> > > > > > > >>> Hmm... how would they go to Graphana if they are not
> getting
> > > > > > computed in
> > > > > > > >>> your code? I am talking about the Application Specific
> > > > Accumulators.
> > > > > > The
> > > > > > > >>> other standard counters such as
> > > > 'event.progress.inputRowsPerSecond'
> > > > > > are
> > > > > > > >>> getting populated correctly!
> > > > > > > >>>
> > > > > > > >>> On Mon, May 25, 2020 at 8:39 PM Srinivas V <
> srini@gmail.com>
> > > > > > wrote:
> > > > > > > >>>
> > > > > > > >>>> Hello,
> > > > > > > >>>> Even for me it comes as 0 when I print in
> OnQueryProgress. I use
> > > > > > > >>>> LongAccumulator as well. Yes, it prints on my local but
> not on
> > > > > > cluster.
> > > > > > > >>>> But one consolation is that when I send metrics to
> Graphana, the
> > > > > > values
> > > > > > > >>>> are coming there.
> > > > > > > >>>>
> > > > > > > >>>> On Tue, May 26, 2020 at 3:10 AM Somet

Re: Using Spark Accumulators with Structured Streaming

2020-06-03 Thread ZHANG Wei
> > > > > updateAcrossEvents
> > > > > > >   )
> > > > > > >
> > > > > > > And updating the Accumulator inside 'updateAcrossEvents'? We're
> > > > > experiencing this only under 'Stateful Structured Streaming'. In other
> > > > > streaming applications it works as expected.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, May 27, 2020 at 9:01 AM Srinivas V 
> > > > > wrote:
> > > > > > >
> > > > > > >> Yes, I am talking about Application specific Accumulators.
> > > Actually I
> > > > > am
> > > > > > >> getting the values printed in my driver log as well as sent to
> > > > > Grafana. Not
> > > > > > >> sure where and when I saw 0 before. My deploy mode is “client” on
> > > a
> > > > > yarn
> > > > > > >> cluster(not local Mac) where I submit from master node. It should
> > > > > work the
> > > > > > >> same for cluster mode as well.
> > > > > > >> Create accumulators like this:
> > > > > > >> AccumulatorV2 accumulator = sparkContext.longAccumulator(name);
> > > > > > >>
> > > > > > >>
> > > > > > >> On Tue, May 26, 2020 at 8:42 PM Something Something <
> > > > > > >> mailinglist...@gmail.com> wrote:
> > > > > > >>
> > > > > > >>> Hmm... how would they go to Graphana if they are not getting
> > > > > computed in
> > > > > > >>> your code? I am talking about the Application Specific
> > > Accumulators.
> > > > > The
> > > > > > >>> other standard counters such as
> > > 'event.progress.inputRowsPerSecond'
> > > > > are
> > > > > > >>> getting populated correctly!
> > > > > > >>>
> > > > > > >>> On Mon, May 25, 2020 at 8:39 PM Srinivas V 
> > > > > wrote:
> > > > > > >>>
> > > > > > >>>> Hello,
> > > > > > >>>> Even for me it comes as 0 when I print in OnQueryProgress. I 
> > > > > > >>>> use
> > > > > > >>>> LongAccumulator as well. Yes, it prints on my local but not on
> > > > > cluster.
> > > > > > >>>> But one consolation is that when I send metrics to Graphana, 
> > > > > > >>>> the
> > > > > values
> > > > > > >>>> are coming there.
> > > > > > >>>>
> > > > > > >>>> On Tue, May 26, 2020 at 3:10 AM Something Something <
> > > > > > >>>> mailinglist...@gmail.com> wrote:
> > > > > > >>>>
> > > > > > >>>>> No this is not working even if I use LongAccumulator.
> > > > > > >>>>>
> > > > > > >>>>> On Fri, May 15, 2020 at 9:54 PM ZHANG Wei  > > >
> > > > > wrote:
> > > > > > >>>>>
> > > > > > >>>>>> There is a restriction in AccumulatorV2 API [1], the OUT type
> > > > > should
> > > > > > >>>>>> be atomic or thread safe. I'm wondering if the implementation
> > > for
> > > > > > >>>>>> `java.util.Map[T, Long]` can meet it or not. Is there any
> > > chance
> > > > > to replace
> > > > > > >>>>>> CollectionLongAccumulator by CollectionAccumulator[2] or
> > > > > LongAccumulator[3]
> > > > > > >>>>>> and test if the StreamingListener and other codes are able to
> > > > > work?
> > > > > > >>>>>>
> > > > > > >>>>>> ---
> > > > > > >>>>>> Cheers,
> > > > > > >>>>>> -z
> > > > > > >>>>>> [1]
> > > > > > >>>>>>
> > > > >
> > > https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi

Re: Using Spark Accumulators with Structured Streaming

2020-06-01 Thread ZHANG Wei
>
> > > > > >>> On Mon, May 25, 2020 at 8:39 PM Srinivas V 
> > > > wrote:
> > > > > >>>
> > > > > >>>> Hello,
> > > > > >>>> Even for me it comes as 0 when I print in OnQueryProgress. I use
> > > > > >>>> LongAccumulator as well. Yes, it prints on my local but not on
> > > > cluster.
> > > > > >>>> But one consolation is that when I send metrics to Graphana, the
> > > > values
> > > > > >>>> are coming there.
> > > > > >>>>
> > > > > >>>> On Tue, May 26, 2020 at 3:10 AM Something Something <
> > > > > >>>> mailinglist...@gmail.com> wrote:
> > > > > >>>>
> > > > > >>>>> No this is not working even if I use LongAccumulator.
> > > > > >>>>>
> > > > > >>>>> On Fri, May 15, 2020 at 9:54 PM ZHANG Wei  > >
> > > > wrote:
> > > > > >>>>>
> > > > > >>>>>> There is a restriction in AccumulatorV2 API [1], the OUT type
> > > > should
> > > > > >>>>>> be atomic or thread safe. I'm wondering if the implementation
> > for
> > > > > >>>>>> `java.util.Map[T, Long]` can meet it or not. Is there any
> > chance
> > > > to replace
> > > > > >>>>>> CollectionLongAccumulator by CollectionAccumulator[2] or
> > > > LongAccumulator[3]
> > > > > >>>>>> and test if the StreamingListener and other codes are able to
> > > > work?
> > > > > >>>>>>
> > > > > >>>>>> ---
> > > > > >>>>>> Cheers,
> > > > > >>>>>> -z
> > > > > >>>>>> [1]
> > > > > >>>>>>
> > > >
> > https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fscala%2Findex.html%23org.apache.spark.util.AccumulatorV2&data=02%7C01%7C%7Cf802f480bbab46ae07b308d803fc661f%7C84df9e7fe9f640afb435%7C1%7C0%7C637263729860033353&sdata=NPpiZC%2Bnx9rec6G35QvMDV1D3FgvD%2FnIct6OJ06I728%3D&reserved=0
> > > > > >>>>>> [2]
> > > > > >>>>>>
> > > >
> > https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fscala%2Findex.html%23org.apache.spark.util.CollectionAccumulator&data=02%7C01%7C%7Cf802f480bbab46ae07b308d803fc661f%7C84df9e7fe9f640afb435%7C1%7C0%7C637263729860038343&sdata=wMskE72per9Js0V7UHJ0qi4UzCEEYh%2Fk53fuP2e92mA%3D&reserved=0
> > > > > >>>>>> [3]
> > > > > >>>>>>
> > > >
> > https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fscala%2Findex.html%23org.apache.spark.util.LongAccumulator&data=02%7C01%7C%7Cf802f480bbab46ae07b308d803fc661f%7C84df9e7fe9f640afb435%7C1%7C0%7C637263729860038343&sdata=INgHzc0rc6jj7UapB%2FRLfCiGNWEBSKWfgmuJ2dUZ3eM%3D&reserved=0
> > > > > >>>>>>
> > > > > >>>>>> 
> > > > > >>>>>> From: Something Something 
> > > > > >>>>>> Sent: Saturday, May 16, 2020 0:38
> > > > > >>>>>> To: spark-user
> > > > > >>>>>> Subject: Re: Using Spark Accumulators with Structured
> > Streaming
> > > > > >>>>>>
> > > > > >>>>>> Can someone from Spark Development team tell me if this
> > > > functionality
> > > > > >>>>>> is supported and tested? I've spent a lot of time on this but
> > > > can't get it
> > > > > >>>>>> to work. Just to add more context, we've our own Accumulator
> > > > class that
> > > > > >>>>>> extends from AccumulatorV2. In this class we keep track of
> > one or
> > > > more
> > > > > >>>>>> accumulators. Here's the definition:
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> class CollectionLongAccumulator[T]
> > > > > >>>>>> extends AccumulatorV2[T, java.util.Map[T, Long]]
> > > > > >>>>>>
> > > > > >>>>>> When the job begins we register an instance of this class:
> > > > > >>>>>>
> > > > > >>>>>> spark.sparkContext.register(myAccumulator, "MyAccumulator")
> > > > > >>>>>>
> > > > > >>>>>> Is this working under Structured Streaming?
> > > > > >>>>>>
> > > > > >>>>>> I will keep looking for alternate approaches but any help
> > would be
> > > > > >>>>>> greatly appreciated. Thanks.
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> On Thu, May 14, 2020 at 2:36 PM Something Something <
> > > > > >>>>>> mailinglist...@gmail.com<mailto:mailinglist...@gmail.com>>
> > wrote:
> > > > > >>>>>>
> > > > > >>>>>> In my structured streaming job I am updating Spark
> > Accumulators in
> > > > > >>>>>> the updateAcrossEvents method but they are always 0 when I
> > try to
> > > > print
> > > > > >>>>>> them in my StreamingListener. Here's the code:
> > > > > >>>>>>
> > > > > >>>>>>
> > .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
> > > > > >>>>>> updateAcrossEvents
> > > > > >>>>>>   )
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> The accumulators get incremented in 'updateAcrossEvents'.
> > I've a
> > > > > >>>>>> StreamingListener which writes values of the accumulators in
> > > > > >>>>>> 'onQueryProgress' method but in this method the Accumulators
> > are
> > > > ALWAYS
> > > > > >>>>>> ZERO!
> > > > > >>>>>>
> > > > > >>>>>> When I added log statements in the updateAcrossEvents, I
> > could see
> > > > > >>>>>> that these accumulators are getting incremented as expected.
> > > > > >>>>>>
> > > > > >>>>>> This only happens when I run in the 'Cluster' mode. In Local
> > mode
> > > > it
> > > > > >>>>>> works fine which implies that the Accumulators are not getting
> > > > distributed
> > > > > >>>>>> correctly - or something like that!
> > > > > >>>>>>
> > > > > >>>>>> Note: I've seen quite a few answers on the Web that tell me to
> > > > > >>>>>> perform an "Action". That's not a solution here. This is a
> > > > 'Stateful
> > > > > >>>>>> Structured Streaming' job. Yes, I am also 'registering' them
> > in
> > > > > >>>>>> SparkContext.
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > >
> >

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Using Spark Accumulators with Structured Streaming

2020-05-29 Thread Srinivas V
t;>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 27, 2020 at 9:01 AM Srinivas V 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yes, I am talking about Application specific Accumulators. Actually
>>>>>>>> I am getting the values printed in my driver log as well as sent to
>>>>>>>> Grafana. Not sure where and when I saw 0 before. My deploy mode is 
>>>>>>>> “client”
>>>>>>>> on a yarn cluster(not local Mac) where I submit from master node. It 
>>>>>>>> should
>>>>>>>> work the same for cluster mode as well.
>>>>>>>> Create accumulators like this:
>>>>>>>> AccumulatorV2 accumulator = sparkContext.longAccumulator(name);
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, May 26, 2020 at 8:42 PM Something Something <
>>>>>>>> mailinglist...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hmm... how would they go to Graphana if they are not getting
>>>>>>>>> computed in your code? I am talking about the Application Specific
>>>>>>>>> Accumulators. The other standard counters such as
>>>>>>>>> 'event.progress.inputRowsPerSecond' are getting populated correctly!
>>>>>>>>>
>>>>>>>>> On Mon, May 25, 2020 at 8:39 PM Srinivas V 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>> Even for me it comes as 0 when I print in OnQueryProgress. I use
>>>>>>>>>> LongAccumulator as well. Yes, it prints on my local but not on 
>>>>>>>>>> cluster.
>>>>>>>>>> But one consolation is that when I send metrics to Graphana, the
>>>>>>>>>> values are coming there.
>>>>>>>>>>
>>>>>>>>>> On Tue, May 26, 2020 at 3:10 AM Something Something <
>>>>>>>>>> mailinglist...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> No this is not working even if I use LongAccumulator.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 15, 2020 at 9:54 PM ZHANG Wei 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> There is a restriction in AccumulatorV2 API [1], the OUT type
>>>>>>>>>>>> should be atomic or thread safe. I'm wondering if the 
>>>>>>>>>>>> implementation for
>>>>>>>>>>>> `java.util.Map[T, Long]` can meet it or not. Is there any chance 
>>>>>>>>>>>> to replace
>>>>>>>>>>>> CollectionLongAccumulator by CollectionAccumulator[2] or 
>>>>>>>>>>>> LongAccumulator[3]
>>>>>>>>>>>> and test if the StreamingListener and other codes are able to work?
>>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> -z
>>>>>>>>>>>> [1]
>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.AccumulatorV2
>>>>>>>>>>>> [2]
>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.CollectionAccumulator
>>>>>>>>>>>> [3]
>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.LongAccumulator
>>>>>>>>>>>>
>>>>>>>>>>>> 
>>>>>>>>>>>> From: Something Something 
>>>>>>>>>>>> Sent: Saturday, May 16, 2020 0:38
>>>>>>>>>>>> To: spark-user
>>>>>>>>>>>> Subject: Re: Using Spark Accumulators with Structured Streaming
>>>>>>>>>>>>
>>>>>>>>>>>> Can someone from Spark Development team tell me if this
>>>>>>>>>>>> functionality is supported and tested? I've spent a lot of time on 
>>>>>>>>>>>> this but
>>>>>>>>>>>> can't get it to work. Just to add more context, we've our own 
>>>>>>>>>>>> Accumulator
>>>>>>>>>>>> class that extends from AccumulatorV2. In this class we keep track 
>>>>>>>>>>>> of one
>>>>>>>>>>>> or more accumulators. Here's the definition:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> class CollectionLongAccumulator[T]
>>>>>>>>>>>> extends AccumulatorV2[T, java.util.Map[T, Long]]
>>>>>>>>>>>>
>>>>>>>>>>>> When the job begins we register an instance of this class:
>>>>>>>>>>>>
>>>>>>>>>>>> spark.sparkContext.register(myAccumulator, "MyAccumulator")
>>>>>>>>>>>>
>>>>>>>>>>>> Is this working under Structured Streaming?
>>>>>>>>>>>>
>>>>>>>>>>>> I will keep looking for alternate approaches but any help would
>>>>>>>>>>>> be greatly appreciated. Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, May 14, 2020 at 2:36 PM Something Something <
>>>>>>>>>>>> mailinglist...@gmail.com<mailto:mailinglist...@gmail.com>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> In my structured streaming job I am updating Spark Accumulators
>>>>>>>>>>>> in the updateAcrossEvents method but they are always 0 when I try 
>>>>>>>>>>>> to print
>>>>>>>>>>>> them in my StreamingListener. Here's the code:
>>>>>>>>>>>>
>>>>>>>>>>>> .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
>>>>>>>>>>>> updateAcrossEvents
>>>>>>>>>>>>   )
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The accumulators get incremented in 'updateAcrossEvents'. I've
>>>>>>>>>>>> a StreamingListener which writes values of the accumulators in
>>>>>>>>>>>> 'onQueryProgress' method but in this method the Accumulators are 
>>>>>>>>>>>> ALWAYS
>>>>>>>>>>>> ZERO!
>>>>>>>>>>>>
>>>>>>>>>>>> When I added log statements in the updateAcrossEvents, I could
>>>>>>>>>>>> see that these accumulators are getting incremented as expected.
>>>>>>>>>>>>
>>>>>>>>>>>> This only happens when I run in the 'Cluster' mode. In Local
>>>>>>>>>>>> mode it works fine which implies that the Accumulators are not 
>>>>>>>>>>>> getting
>>>>>>>>>>>> distributed correctly - or something like that!
>>>>>>>>>>>>
>>>>>>>>>>>> Note: I've seen quite a few answers on the Web that tell me to
>>>>>>>>>>>> perform an "Action". That's not a solution here. This is a 
>>>>>>>>>>>> 'Stateful
>>>>>>>>>>>> Structured Streaming' job. Yes, I am also 'registering' them in
>>>>>>>>>>>> SparkContext.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>


Re: Using Spark Accumulators with Structured Streaming

2020-05-29 Thread Something Something
ent”
>>>>>>> on a yarn cluster(not local Mac) where I submit from master node. It 
>>>>>>> should
>>>>>>> work the same for cluster mode as well.
>>>>>>> Create accumulators like this:
>>>>>>> AccumulatorV2 accumulator = sparkContext.longAccumulator(name);
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 26, 2020 at 8:42 PM Something Something <
>>>>>>> mailinglist...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hmm... how would they go to Graphana if they are not getting
>>>>>>>> computed in your code? I am talking about the Application Specific
>>>>>>>> Accumulators. The other standard counters such as
>>>>>>>> 'event.progress.inputRowsPerSecond' are getting populated correctly!
>>>>>>>>
>>>>>>>> On Mon, May 25, 2020 at 8:39 PM Srinivas V 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>> Even for me it comes as 0 when I print in OnQueryProgress. I use
>>>>>>>>> LongAccumulator as well. Yes, it prints on my local but not on 
>>>>>>>>> cluster.
>>>>>>>>> But one consolation is that when I send metrics to Graphana, the
>>>>>>>>> values are coming there.
>>>>>>>>>
>>>>>>>>> On Tue, May 26, 2020 at 3:10 AM Something Something <
>>>>>>>>> mailinglist...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> No this is not working even if I use LongAccumulator.
>>>>>>>>>>
>>>>>>>>>> On Fri, May 15, 2020 at 9:54 PM ZHANG Wei  wrote:
>>>>>>>>>>
>>>>>>>>>>> There is a restriction in AccumulatorV2 API [1], the OUT type
>>>>>>>>>>> should be atomic or thread safe. I'm wondering if the 
>>>>>>>>>>> implementation for
>>>>>>>>>>> `java.util.Map[T, Long]` can meet it or not. Is there any chance to 
>>>>>>>>>>> replace
>>>>>>>>>>> CollectionLongAccumulator by CollectionAccumulator[2] or 
>>>>>>>>>>> LongAccumulator[3]
>>>>>>>>>>> and test if the StreamingListener and other codes are able to work?
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>> Cheers,
>>>>>>>>>>> -z
>>>>>>>>>>> [1]
>>>>>>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.AccumulatorV2
>>>>>>>>>>> [2]
>>>>>>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.CollectionAccumulator
>>>>>>>>>>> [3]
>>>>>>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.LongAccumulator
>>>>>>>>>>>
>>>>>>>>>>> 
>>>>>>>>>>> From: Something Something 
>>>>>>>>>>> Sent: Saturday, May 16, 2020 0:38
>>>>>>>>>>> To: spark-user
>>>>>>>>>>> Subject: Re: Using Spark Accumulators with Structured Streaming
>>>>>>>>>>>
>>>>>>>>>>> Can someone from Spark Development team tell me if this
>>>>>>>>>>> functionality is supported and tested? I've spent a lot of time on 
>>>>>>>>>>> this but
>>>>>>>>>>> can't get it to work. Just to add more context, we've our own 
>>>>>>>>>>> Accumulator
>>>>>>>>>>> class that extends from AccumulatorV2. In this class we keep track 
>>>>>>>>>>> of one
>>>>>>>>>>> or more accumulators. Here's the definition:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> class CollectionLongAccumulator[T]
>>>>>>>>>>> extends AccumulatorV2[T, java.util.Map[T, Long]]
>>>>>>>>>>>
>>>>>>>>>>> When the job begins we register an instance of this class:
>>>>>>>>>>>
>>>>>>>>>>> spark.sparkContext.register(myAccumulator, "MyAccumulator")
>>>>>>>>>>>
>>>>>>>>>>> Is this working under Structured Streaming?
>>>>>>>>>>>
>>>>>>>>>>> I will keep looking for alternate approaches but any help would
>>>>>>>>>>> be greatly appreciated. Thanks.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, May 14, 2020 at 2:36 PM Something Something <
>>>>>>>>>>> mailinglist...@gmail.com<mailto:mailinglist...@gmail.com>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> In my structured streaming job I am updating Spark Accumulators
>>>>>>>>>>> in the updateAcrossEvents method but they are always 0 when I try 
>>>>>>>>>>> to print
>>>>>>>>>>> them in my StreamingListener. Here's the code:
>>>>>>>>>>>
>>>>>>>>>>> .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
>>>>>>>>>>> updateAcrossEvents
>>>>>>>>>>>   )
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The accumulators get incremented in 'updateAcrossEvents'. I've a
>>>>>>>>>>> StreamingListener which writes values of the accumulators in
>>>>>>>>>>> 'onQueryProgress' method but in this method the Accumulators are 
>>>>>>>>>>> ALWAYS
>>>>>>>>>>> ZERO!
>>>>>>>>>>>
>>>>>>>>>>> When I added log statements in the updateAcrossEvents, I could
>>>>>>>>>>> see that these accumulators are getting incremented as expected.
>>>>>>>>>>>
>>>>>>>>>>> This only happens when I run in the 'Cluster' mode. In Local
>>>>>>>>>>> mode it works fine which implies that the Accumulators are not 
>>>>>>>>>>> getting
>>>>>>>>>>> distributed correctly - or something like that!
>>>>>>>>>>>
>>>>>>>>>>> Note: I've seen quite a few answers on the Web that tell me to
>>>>>>>>>>> perform an "Action". That's not a solution here. This is a 'Stateful
>>>>>>>>>>> Structured Streaming' job. Yes, I am also 'registering' them in
>>>>>>>>>>> SparkContext.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>


Re: Using Spark Accumulators with Structured Streaming

2020-05-29 Thread Srinivas V
x27; are getting populated correctly!
>>>>>>>
>>>>>>> On Mon, May 25, 2020 at 8:39 PM Srinivas V 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>> Even for me it comes as 0 when I print in OnQueryProgress. I use
>>>>>>>> LongAccumulator as well. Yes, it prints on my local but not on cluster.
>>>>>>>> But one consolation is that when I send metrics to Graphana, the
>>>>>>>> values are coming there.
>>>>>>>>
>>>>>>>> On Tue, May 26, 2020 at 3:10 AM Something Something <
>>>>>>>> mailinglist...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> No this is not working even if I use LongAccumulator.
>>>>>>>>>
>>>>>>>>> On Fri, May 15, 2020 at 9:54 PM ZHANG Wei  wrote:
>>>>>>>>>
>>>>>>>>>> There is a restriction in AccumulatorV2 API [1], the OUT type
>>>>>>>>>> should be atomic or thread safe. I'm wondering if the implementation 
>>>>>>>>>> for
>>>>>>>>>> `java.util.Map[T, Long]` can meet it or not. Is there any chance to 
>>>>>>>>>> replace
>>>>>>>>>> CollectionLongAccumulator by CollectionAccumulator[2] or 
>>>>>>>>>> LongAccumulator[3]
>>>>>>>>>> and test if the StreamingListener and other codes are able to work?
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>> Cheers,
>>>>>>>>>> -z
>>>>>>>>>> [1]
>>>>>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.AccumulatorV2
>>>>>>>>>> [2]
>>>>>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.CollectionAccumulator
>>>>>>>>>> [3]
>>>>>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.LongAccumulator
>>>>>>>>>>
>>>>>>>>>> 
>>>>>>>>>> From: Something Something 
>>>>>>>>>> Sent: Saturday, May 16, 2020 0:38
>>>>>>>>>> To: spark-user
>>>>>>>>>> Subject: Re: Using Spark Accumulators with Structured Streaming
>>>>>>>>>>
>>>>>>>>>> Can someone from Spark Development team tell me if this
>>>>>>>>>> functionality is supported and tested? I've spent a lot of time on 
>>>>>>>>>> this but
>>>>>>>>>> can't get it to work. Just to add more context, we've our own 
>>>>>>>>>> Accumulator
>>>>>>>>>> class that extends from AccumulatorV2. In this class we keep track 
>>>>>>>>>> of one
>>>>>>>>>> or more accumulators. Here's the definition:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> class CollectionLongAccumulator[T]
>>>>>>>>>> extends AccumulatorV2[T, java.util.Map[T, Long]]
>>>>>>>>>>
>>>>>>>>>> When the job begins we register an instance of this class:
>>>>>>>>>>
>>>>>>>>>> spark.sparkContext.register(myAccumulator, "MyAccumulator")
>>>>>>>>>>
>>>>>>>>>> Is this working under Structured Streaming?
>>>>>>>>>>
>>>>>>>>>> I will keep looking for alternate approaches but any help would
>>>>>>>>>> be greatly appreciated. Thanks.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, May 14, 2020 at 2:36 PM Something Something <
>>>>>>>>>> mailinglist...@gmail.com<mailto:mailinglist...@gmail.com>> wrote:
>>>>>>>>>>
>>>>>>>>>> In my structured streaming job I am updating Spark Accumulators
>>>>>>>>>> in the updateAcrossEvents method but they are always 0 when I try to 
>>>>>>>>>> print
>>>>>>>>>> them in my StreamingListener. Here's the code:
>>>>>>>>>>
>>>>>>>>>> .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
>>>>>>>>>> updateAcrossEvents
>>>>>>>>>>   )
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The accumulators get incremented in 'updateAcrossEvents'. I've a
>>>>>>>>>> StreamingListener which writes values of the accumulators in
>>>>>>>>>> 'onQueryProgress' method but in this method the Accumulators are 
>>>>>>>>>> ALWAYS
>>>>>>>>>> ZERO!
>>>>>>>>>>
>>>>>>>>>> When I added log statements in the updateAcrossEvents, I could
>>>>>>>>>> see that these accumulators are getting incremented as expected.
>>>>>>>>>>
>>>>>>>>>> This only happens when I run in the 'Cluster' mode. In Local mode
>>>>>>>>>> it works fine which implies that the Accumulators are not getting
>>>>>>>>>> distributed correctly - or something like that!
>>>>>>>>>>
>>>>>>>>>> Note: I've seen quite a few answers on the Web that tell me to
>>>>>>>>>> perform an "Action". That's not a solution here. This is a 'Stateful
>>>>>>>>>> Structured Streaming' job. Yes, I am also 'registering' them in
>>>>>>>>>> SparkContext.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>


Re: Using Spark Accumulators with Structured Streaming

2020-05-29 Thread Something Something
>>>>>>> On Tue, May 26, 2020 at 3:10 AM Something Something <
>>>>>>> mailinglist...@gmail.com> wrote:
>>>>>>>
>>>>>>>> No this is not working even if I use LongAccumulator.
>>>>>>>>
>>>>>>>> On Fri, May 15, 2020 at 9:54 PM ZHANG Wei  wrote:
>>>>>>>>
>>>>>>>>> There is a restriction in AccumulatorV2 API [1], the OUT type
>>>>>>>>> should be atomic or thread safe. I'm wondering if the implementation 
>>>>>>>>> for
>>>>>>>>> `java.util.Map[T, Long]` can meet it or not. Is there any chance to 
>>>>>>>>> replace
>>>>>>>>> CollectionLongAccumulator by CollectionAccumulator[2] or 
>>>>>>>>> LongAccumulator[3]
>>>>>>>>> and test if the StreamingListener and other codes are able to work?
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> Cheers,
>>>>>>>>> -z
>>>>>>>>> [1]
>>>>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.AccumulatorV2
>>>>>>>>> [2]
>>>>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.CollectionAccumulator
>>>>>>>>> [3]
>>>>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.LongAccumulator
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>> From: Something Something 
>>>>>>>>> Sent: Saturday, May 16, 2020 0:38
>>>>>>>>> To: spark-user
>>>>>>>>> Subject: Re: Using Spark Accumulators with Structured Streaming
>>>>>>>>>
>>>>>>>>> Can someone from Spark Development team tell me if this
>>>>>>>>> functionality is supported and tested? I've spent a lot of time on 
>>>>>>>>> this but
>>>>>>>>> can't get it to work. Just to add more context, we've our own 
>>>>>>>>> Accumulator
>>>>>>>>> class that extends from AccumulatorV2. In this class we keep track of 
>>>>>>>>> one
>>>>>>>>> or more accumulators. Here's the definition:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> class CollectionLongAccumulator[T]
>>>>>>>>> extends AccumulatorV2[T, java.util.Map[T, Long]]
>>>>>>>>>
>>>>>>>>> When the job begins we register an instance of this class:
>>>>>>>>>
>>>>>>>>> spark.sparkContext.register(myAccumulator, "MyAccumulator")
>>>>>>>>>
>>>>>>>>> Is this working under Structured Streaming?
>>>>>>>>>
>>>>>>>>> I will keep looking for alternate approaches but any help would be
>>>>>>>>> greatly appreciated. Thanks.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, May 14, 2020 at 2:36 PM Something Something <
>>>>>>>>> mailinglist...@gmail.com<mailto:mailinglist...@gmail.com>> wrote:
>>>>>>>>>
>>>>>>>>> In my structured streaming job I am updating Spark Accumulators in
>>>>>>>>> the updateAcrossEvents method but they are always 0 when I try to 
>>>>>>>>> print
>>>>>>>>> them in my StreamingListener. Here's the code:
>>>>>>>>>
>>>>>>>>> .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
>>>>>>>>> updateAcrossEvents
>>>>>>>>>   )
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The accumulators get incremented in 'updateAcrossEvents'. I've a
>>>>>>>>> StreamingListener which writes values of the accumulators in
>>>>>>>>> 'onQueryProgress' method but in this method the Accumulators are 
>>>>>>>>> ALWAYS
>>>>>>>>> ZERO!
>>>>>>>>>
>>>>>>>>> When I added log statements in the updateAcrossEvents, I could see
>>>>>>>>> that these accumulators are getting incremented as expected.
>>>>>>>>>
>>>>>>>>> This only happens when I run in the 'Cluster' mode. In Local mode
>>>>>>>>> it works fine which implies that the Accumulators are not getting
>>>>>>>>> distributed correctly - or something like that!
>>>>>>>>>
>>>>>>>>> Note: I've seen quite a few answers on the Web that tell me to
>>>>>>>>> perform an "Action". That's not a solution here. This is a 'Stateful
>>>>>>>>> Structured Streaming' job. Yes, I am also 'registering' them in
>>>>>>>>> SparkContext.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>


Re: Using Spark Accumulators with Structured Streaming

2020-05-29 Thread Something Something
> > > ```
> > >
> > > And the accumulator value updated can be found from driver stdout.
> > >
> > > --
> > > Cheers,
> > > -z
> > >
> > > On Thu, 28 May 2020 17:12:48 +0530
> > > Srinivas V  wrote:
> > >
> > > > yes, I am using stateful structured streaming. Yes similar to what
> you
> > > do.
> > > > This is in Java
> > > > I do it this way:
> > > > Dataset productUpdates = watermarkedDS
> > > > .groupByKey(
> > > > (MapFunction) event
> ->
> > > > event.getId(), Encoders.STRING())
> > > > .mapGroupsWithState(
> > > > new
> > > >
> > >
> StateUpdateTask(Long.parseLong(appConfig.getSparkStructuredStreamingConfig().STATE_TIMEOUT),
> > > > appConfig, accumulators),
> > > > Encoders.bean(ModelStateInfo.class),
> > > > Encoders.bean(ModelUpdate.class),
> > > > GroupStateTimeout.ProcessingTimeTimeout());
> > > >
> > > > StateUpdateTask contains the update method.
> > > >
> > > > On Thu, May 28, 2020 at 4:41 AM Something Something <
> > > > mailinglist...@gmail.com> wrote:
> > > >
> > > > > Yes, that's exactly how I am creating them.
> > > > >
> > > > > Question... Are you using 'Stateful Structured Streaming' in which
> > > you've
> > > > > something like this?
> > > > >
> > > > > .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
> > > > > updateAcrossEvents
> > > > >   )
> > > > >
> > > > > And updating the Accumulator inside 'updateAcrossEvents'? We're
> > > experiencing this only under 'Stateful Structured Streaming'. In other
> > > streaming applications it works as expected.
> > > > >
> > > > >
> > > > >
> > > > > On Wed, May 27, 2020 at 9:01 AM Srinivas V 
> > > wrote:
> > > > >
> > > > >> Yes, I am talking about Application specific Accumulators.
> Actually I
> > > am
> > > > >> getting the values printed in my driver log as well as sent to
> > > Grafana. Not
> > > > >> sure where and when I saw 0 before. My deploy mode is “client” on
> a
> > > yarn
> > > > >> cluster(not local Mac) where I submit from master node. It should
> > > work the
> > > > >> same for cluster mode as well.
> > > > >> Create accumulators like this:
> > > > >> AccumulatorV2 accumulator = sparkContext.longAccumulator(name);
> > > > >>
> > > > >>
> > > > >> On Tue, May 26, 2020 at 8:42 PM Something Something <
> > > > >> mailinglist...@gmail.com> wrote:
> > > > >>
> > > > >>> Hmm... how would they go to Graphana if they are not getting
> > > computed in
> > > > >>> your code? I am talking about the Application Specific
> Accumulators.
> > > The
> > > > >>> other standard counters such as
> 'event.progress.inputRowsPerSecond'
> > > are
> > > > >>> getting populated correctly!
> > > > >>>
> > > > >>> On Mon, May 25, 2020 at 8:39 PM Srinivas V 
> > > wrote:
> > > > >>>
> > > > >>>> Hello,
> > > > >>>> Even for me it comes as 0 when I print in OnQueryProgress. I use
> > > > >>>> LongAccumulator as well. Yes, it prints on my local but not on
> > > cluster.
> > > > >>>> But one consolation is that when I send metrics to Graphana, the
> > > values
> > > > >>>> are coming there.
> > > > >>>>
> > > > >>>> On Tue, May 26, 2020 at 3:10 AM Something Something <
> > > > >>>> mailinglist...@gmail.com> wrote:
> > > > >>>>
> > > > >>>>> No this is not working even if I use LongAccumulator.
> > > > >>>>>
> > > > >>>>> On Fri, May 15, 2020 at 9:54 PM ZHANG Wei  >
> > > wrote:
> > > > >>>>>
> > > 

Re: Using Spark Accumulators with Structured Streaming

2020-05-29 Thread Srinivas V
Yes it is application specific class. This is how java Spark Functions work.
You can refer to this code in the documentation:
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredSessionization.java

public class StateUpdateTask implements MapGroupsWithStateFunction {

@Override
public ModelUpdate call(String productId, Iterator
eventsIterator, GroupState state) {
}
}

On Thu, May 28, 2020 at 10:59 PM Something Something <
mailinglist...@gmail.com> wrote:

> I am assuming StateUpdateTask is your application specific class. Does it
> have 'updateState' method or something? I googled but couldn't find any
> documentation about doing it this way. Can you please direct me to some
> documentation. Thanks.
>
> On Thu, May 28, 2020 at 4:43 AM Srinivas V  wrote:
>
>> yes, I am using stateful structured streaming. Yes similar to what you
>> do. This is in Java
>> I do it this way:
>> Dataset productUpdates = watermarkedDS
>> .groupByKey(
>> (MapFunction) event ->
>> event.getId(), Encoders.STRING())
>> .mapGroupsWithState(
>> new
>> StateUpdateTask(Long.parseLong(appConfig.getSparkStructuredStreamingConfig().STATE_TIMEOUT),
>> appConfig, accumulators),
>> Encoders.bean(ModelStateInfo.class),
>> Encoders.bean(ModelUpdate.class),
>> GroupStateTimeout.ProcessingTimeTimeout());
>>
>> StateUpdateTask contains the update method.
>>
>> On Thu, May 28, 2020 at 4:41 AM Something Something <
>> mailinglist...@gmail.com> wrote:
>>
>>> Yes, that's exactly how I am creating them.
>>>
>>> Question... Are you using 'Stateful Structured Streaming' in which
>>> you've something like this?
>>>
>>> .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
>>> updateAcrossEvents
>>>   )
>>>
>>> And updating the Accumulator inside 'updateAcrossEvents'? We're 
>>> experiencing this only under 'Stateful Structured Streaming'. In other 
>>> streaming applications it works as expected.
>>>
>>>
>>>
>>> On Wed, May 27, 2020 at 9:01 AM Srinivas V  wrote:
>>>
>>>> Yes, I am talking about Application specific Accumulators. Actually I
>>>> am getting the values printed in my driver log as well as sent to Grafana.
>>>> Not sure where and when I saw 0 before. My deploy mode is “client” on a
>>>> yarn cluster(not local Mac) where I submit from master node. It should work
>>>> the same for cluster mode as well.
>>>> Create accumulators like this:
>>>> AccumulatorV2 accumulator = sparkContext.longAccumulator(name);
>>>>
>>>>
>>>> On Tue, May 26, 2020 at 8:42 PM Something Something <
>>>> mailinglist...@gmail.com> wrote:
>>>>
>>>>> Hmm... how would they go to Graphana if they are not getting computed
>>>>> in your code? I am talking about the Application Specific Accumulators. 
>>>>> The
>>>>> other standard counters such as 'event.progress.inputRowsPerSecond' are
>>>>> getting populated correctly!
>>>>>
>>>>> On Mon, May 25, 2020 at 8:39 PM Srinivas V 
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>> Even for me it comes as 0 when I print in OnQueryProgress. I use
>>>>>> LongAccumulator as well. Yes, it prints on my local but not on cluster.
>>>>>> But one consolation is that when I send metrics to Graphana, the
>>>>>> values are coming there.
>>>>>>
>>>>>> On Tue, May 26, 2020 at 3:10 AM Something Something <
>>>>>> mailinglist...@gmail.com> wrote:
>>>>>>
>>>>>>> No this is not working even if I use LongAccumulator.
>>>>>>>
>>>>>>> On Fri, May 15, 2020 at 9:54 PM ZHANG Wei  wrote:
>>>>>>>
>>>>>>>> There is a restriction in AccumulatorV2 API [1], the OUT type
>>>>>>>> should be atomic or thread safe. I'm wondering if the implementation 
>>>>>>>> for
>>>>>>>> `java.util.Map[T, Long]` can meet it or not. Is there any chance to 
>>>>>>>> replace
>>>>>>>> 

Re: Using Spark Accumulators with Structured Streaming

2020-05-28 Thread ZHANG Wei
  .mapGroupsWithState(
> > > new
> > >
> > StateUpdateTask(Long.parseLong(appConfig.getSparkStructuredStreamingConfig().STATE_TIMEOUT),
> > > appConfig, accumulators),
> > > Encoders.bean(ModelStateInfo.class),
> > > Encoders.bean(ModelUpdate.class),
> > > GroupStateTimeout.ProcessingTimeTimeout());
> > >
> > > StateUpdateTask contains the update method.
> > >
> > > On Thu, May 28, 2020 at 4:41 AM Something Something <
> > > mailinglist...@gmail.com> wrote:
> > >
> > > > Yes, that's exactly how I am creating them.
> > > >
> > > > Question... Are you using 'Stateful Structured Streaming' in which
> > you've
> > > > something like this?
> > > >
> > > > .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
> > > > updateAcrossEvents
> > > >   )
> > > >
> > > > And updating the Accumulator inside 'updateAcrossEvents'? We're
> > experiencing this only under 'Stateful Structured Streaming'. In other
> > streaming applications it works as expected.
> > > >
> > > >
> > > >
> > > > On Wed, May 27, 2020 at 9:01 AM Srinivas V 
> > wrote:
> > > >
> > > >> Yes, I am talking about Application specific Accumulators. Actually I
> > am
> > > >> getting the values printed in my driver log as well as sent to
> > Grafana. Not
> > > >> sure where and when I saw 0 before. My deploy mode is “client” on a
> > yarn
> > > >> cluster(not local Mac) where I submit from master node. It should
> > work the
> > > >> same for cluster mode as well.
> > > >> Create accumulators like this:
> > > >> AccumulatorV2 accumulator = sparkContext.longAccumulator(name);
> > > >>
> > > >>
> > > >> On Tue, May 26, 2020 at 8:42 PM Something Something <
> > > >> mailinglist...@gmail.com> wrote:
> > > >>
> > > >>> Hmm... how would they go to Graphana if they are not getting
> > computed in
> > > >>> your code? I am talking about the Application Specific Accumulators.
> > The
> > > >>> other standard counters such as 'event.progress.inputRowsPerSecond'
> > are
> > > >>> getting populated correctly!
> > > >>>
> > > >>> On Mon, May 25, 2020 at 8:39 PM Srinivas V 
> > wrote:
> > > >>>
> > > >>>> Hello,
> > > >>>> Even for me it comes as 0 when I print in OnQueryProgress. I use
> > > >>>> LongAccumulator as well. Yes, it prints on my local but not on
> > cluster.
> > > >>>> But one consolation is that when I send metrics to Graphana, the
> > values
> > > >>>> are coming there.
> > > >>>>
> > > >>>> On Tue, May 26, 2020 at 3:10 AM Something Something <
> > > >>>> mailinglist...@gmail.com> wrote:
> > > >>>>
> > > >>>>> No this is not working even if I use LongAccumulator.
> > > >>>>>
> > > >>>>> On Fri, May 15, 2020 at 9:54 PM ZHANG Wei 
> > wrote:
> > > >>>>>
> > > >>>>>> There is a restriction in AccumulatorV2 API [1], the OUT type
> > should
> > > >>>>>> be atomic or thread safe. I'm wondering if the implementation for
> > > >>>>>> `java.util.Map[T, Long]` can meet it or not. Is there any chance
> > to replace
> > > >>>>>> CollectionLongAccumulator by CollectionAccumulator[2] or
> > LongAccumulator[3]
> > > >>>>>> and test if the StreamingListener and other codes are able to
> > work?
> > > >>>>>>
> > > >>>>>> ---
> > > >>>>>> Cheers,
> > > >>>>>> -z
> > > >>>>>> [1]
> > > >>>>>>
> > https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fscala%2Findex.html%23org.apache.spark.util.AccumulatorV2&data=02%7C01%7C%7C3d67f4e536ab422670f008d80313920c%7C84df9e7fe9f640afb435%7C1%7C0%7C637262729866357123&sdata=fY6a%2FeGVVwFvwJKMP6v8

Re: Using Spark Accumulators with Structured Streaming

2020-05-28 Thread Something Something
I am assuming StateUpdateTask is your application specific class. Does it
have 'updateState' method or something? I googled but couldn't find any
documentation about doing it this way. Can you please direct me to some
documentation. Thanks.

On Thu, May 28, 2020 at 4:43 AM Srinivas V  wrote:

> yes, I am using stateful structured streaming. Yes similar to what you do.
> This is in Java
> I do it this way:
> Dataset productUpdates = watermarkedDS
> .groupByKey(
> (MapFunction) event ->
> event.getId(), Encoders.STRING())
> .mapGroupsWithState(
> new
> StateUpdateTask(Long.parseLong(appConfig.getSparkStructuredStreamingConfig().STATE_TIMEOUT),
> appConfig, accumulators),
> Encoders.bean(ModelStateInfo.class),
> Encoders.bean(ModelUpdate.class),
> GroupStateTimeout.ProcessingTimeTimeout());
>
> StateUpdateTask contains the update method.
>
> On Thu, May 28, 2020 at 4:41 AM Something Something <
> mailinglist...@gmail.com> wrote:
>
>> Yes, that's exactly how I am creating them.
>>
>> Question... Are you using 'Stateful Structured Streaming' in which you've
>> something like this?
>>
>> .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
>> updateAcrossEvents
>>   )
>>
>> And updating the Accumulator inside 'updateAcrossEvents'? We're experiencing 
>> this only under 'Stateful Structured Streaming'. In other streaming 
>> applications it works as expected.
>>
>>
>>
>> On Wed, May 27, 2020 at 9:01 AM Srinivas V  wrote:
>>
>>> Yes, I am talking about Application specific Accumulators. Actually I am
>>> getting the values printed in my driver log as well as sent to Grafana. Not
>>> sure where and when I saw 0 before. My deploy mode is “client” on a yarn
>>> cluster(not local Mac) where I submit from master node. It should work the
>>> same for cluster mode as well.
>>> Create accumulators like this:
>>> AccumulatorV2 accumulator = sparkContext.longAccumulator(name);
>>>
>>>
>>> On Tue, May 26, 2020 at 8:42 PM Something Something <
>>> mailinglist...@gmail.com> wrote:
>>>
>>>> Hmm... how would they go to Graphana if they are not getting computed
>>>> in your code? I am talking about the Application Specific Accumulators. The
>>>> other standard counters such as 'event.progress.inputRowsPerSecond' are
>>>> getting populated correctly!
>>>>
>>>> On Mon, May 25, 2020 at 8:39 PM Srinivas V  wrote:
>>>>
>>>>> Hello,
>>>>> Even for me it comes as 0 when I print in OnQueryProgress. I use
>>>>> LongAccumulator as well. Yes, it prints on my local but not on cluster.
>>>>> But one consolation is that when I send metrics to Graphana, the
>>>>> values are coming there.
>>>>>
>>>>> On Tue, May 26, 2020 at 3:10 AM Something Something <
>>>>> mailinglist...@gmail.com> wrote:
>>>>>
>>>>>> No this is not working even if I use LongAccumulator.
>>>>>>
>>>>>> On Fri, May 15, 2020 at 9:54 PM ZHANG Wei  wrote:
>>>>>>
>>>>>>> There is a restriction in AccumulatorV2 API [1], the OUT type should
>>>>>>> be atomic or thread safe. I'm wondering if the implementation for
>>>>>>> `java.util.Map[T, Long]` can meet it or not. Is there any chance to 
>>>>>>> replace
>>>>>>> CollectionLongAccumulator by CollectionAccumulator[2] or 
>>>>>>> LongAccumulator[3]
>>>>>>> and test if the StreamingListener and other codes are able to work?
>>>>>>>
>>>>>>> ---
>>>>>>> Cheers,
>>>>>>> -z
>>>>>>> [1]
>>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.AccumulatorV2
>>>>>>> [2]
>>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.CollectionAccumulator
>>>>>>> [3]
>>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.LongAccumulator
>>>>>>>
>>>>>>> 
>>>>>>> From: Something Somet

  1   2   3   4   5   6   7   8   9   10   >