Re: Spark Not Connecting

2023-07-12 Thread Artemis User
12, 2023, 6:00 PM Artemis User wrote: The error screenshot doesn't tell much.  Maybe your job wasn't submitted properly.  Make sure you IP/port numbers were defined correctly.  Take a look at the Spark server UI to see what errors occur. On 7/12/23 6:11 AM, timi ayoade

Re: 回复:Re: Build SPARK from source with SBT failed

2023-03-07 Thread Artemis User
Looks like Maven build did find the javac, just can't run it.  So it's not a path problem but a compatibility problem.  Are you doing this on a Mac with M1/M2?  I don't think that Zulu JDK supports Apple silicon.   Your best option would be to use homebrew to install the dev tools (including

Re: Help needed regarding error with 5 node Spark cluster (shuffle error)- Comcast

2023-01-30 Thread Artemis User
Not sure where you get the property name "spark.memory.offHeap.use". The correct one should be "spark.memory.offHeap.enabled".  See https://spark.apache.org/docs/latest/configuration.html#spark-properties for details. On 1/30/23 10:12 AM, Jain, Sanchi wrote: I am not sure if this is the

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Artemis User
Try this one:  "select country, city, max(population) from your_table group by country" Please note this returns a table of three columns, instead of two. This is a standard SQL query, and supported by Spark as well. On 12/20/22 3:35 PM, Oliver Ruebenacker wrote: Hello,   Let's say

Re: Can we upload a csv dataset into Hive using SparkSQL?

2022-12-13 Thread Artemis User
Your DDL statement doesn't look right.  You may want to check the Spark SQL Reference online for how to create table in Hive format (https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-hiveformat.html). You should be able to populate the table directly using CREATE by

Re: Increasing Spark history resources

2022-12-09 Thread Artemis User
If you didn't have performance issues before with the history server, it may not be a threading or RAM problem.  You may want to check on the disk space availability for the event logs... On 12/8/22 8:00 PM, Nikhil Goyal wrote: Hi folks, We are experiencing slowness in Spark history server,

Re: [PySpark] Join using condition where each record may be joined multiple times

2022-11-27 Thread Artemis User
What if you just do a join with the first condition (equal chromosome) and append a select with the rest of the conditions after join?  This will allow you to test your query step by step, maybe with a visual inspection to figure out what the problem is. It may be a data quality problem as

Re: Pyspark ML model Save Error

2022-11-16 Thread Artemis User
What problems did you encounter?  Most likely your problem may be related to saving the model object in different partitions.  If that the case, just apply the dataframe's coalesce(1) method before saving the model to a shared disk drive... On 11/16/22 1:51 AM, Vajiha Begum S A wrote: Hi,

Re: [EXTERNAL] Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-03 Thread Artemis User
. So, the question is how do I limit the stage resources to 20 GPUs total? Thanks again, Shay *From:* Artemis User *Sent:* Thursday, November 3, 2022 5:23 PM *To:* user@spark.apache.org *Subject:* [EXTERNAL] Re: Re: Stage

Re: [EXTERNAL] Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-03 Thread Artemis User
tracking for dynamic allocation, anyhow. The question is how we can limit the *number of executors *when building a new ResourceProfile, directly (API) or indirectly (some advanced workaround). Thanks, Shay *From:* Artemis

Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-02 Thread Artemis User
Are you using Rapids for GPU support in Spark?  Couple of options you may want to try: 1. In addition to dynamic allocation turned on, you may also need to turn on external shuffling service. 2. Sounds like you are using Kubernetes.  In that case, you may also need to turn on shuffle

Re: How to find final status (Driver's) for an application

2022-10-28 Thread Artemis User
The master UI doesn't return much details, not designed for this purpose.  You need to use the application-level/driver UI instead (on port 4040/4041...).  Please see online doc monitoring and instrumentation for details (https://spark.apache.org/docs/latest/monitoring.html#rest-api). On

Re: Dynamic Scaling without Kubernetes

2022-10-26 Thread Artemis User
, Holden Karau wrote: So Spark can dynamically scale on YARN, but standalone mode becomes a bit complicated — where do you envision Spark gets the extra resources from? On Wed, Oct 26, 2022 at 12:18 PM Artemis User wrote: Has anyone tried to make a Spark cluster dynamically scalable, i.e

Dynamic Scaling without Kubernetes

2022-10-26 Thread Artemis User
Has anyone tried to make a Spark cluster dynamically scalable, i.e., adding a new worker node automatically to the cluster when no more executors are available upon a new job submitted?  We need to make the whole cluster on-prem and really lightweight, so standalone mode is preferred and no

Re: Running 30 Spark applications at the same time is slower than one on average

2022-10-26 Thread Artemis User
Are these Cloudera specific acronyms?  Not sure how Cloudera configures Spark differently, but obviously the number of nodes is too small, considering each app only uses a small number of cores and RAM.  So you may consider increase the number of nodes.   When all these apps jam on a few

Re: pyspark connect to spark thrift server port

2022-10-21 Thread Artemis User
to connect using pyspark. the port 9083 is open for anyone without authentication feature. The only way pyspark able to connect to hive is through 9083 and not through port 1. On Friday, October 21, 2022 at 04:06:38 AM GMT+8, Artemis User wrote: By default, Spark uses Apache Derby (running

Re: pyspark connect to spark thrift server port

2022-10-20 Thread Artemis User
By default, Spark uses Apache Derby (running in embedded mode with store content defined in local files) for hosting the Hive metastore.  You can externalize the metastore on a JDBC-compliant database (e.g., PostgreSQL) and use the database authentication provided by the database.  The JDBC

Re: How to use neo4j cypher/opencypher to query spark RDD/graphdb

2022-10-16 Thread Artemis User
Spark doesn't offer a native graph database like Neo4j does since GraphX is still using the RDD tabular data structure.  Spark doesn't have a GQL or Cypher query engine either, but uses Google's Pregal API for graph processing.  Don't see any prospect that Spark is going to implement any types

Re: Apache Spark Operator for Kubernetes?

2022-10-14 Thread Artemis User
If you have the hardware resources, it isn't difficult to set up Spark in a kubernetes cluster.  The online doc describes everything you would need (https://spark.apache.org/docs/latest/running-on-kubernetes.html). You're right, both AWS EMR and Google's environment aren't flexible and not

Re: Efficiently updating running sums only on new data

2022-10-12 Thread Artemis User
Do you have to use SQL/window function for this? If I understand this correctly, you could just keep track of the last record of each "thing", then calculate the new sum by adding the current value of "thing" to the sum of last record when a new record is generated. Looks like your problem

Re: Reading too many files

2022-10-04 Thread Artemis User
Read by default can't be parallelized in a Spark job, and doing your own multi-threaded programming in a Spark program isn't a good idea.  Adding fast disk I/O and increase RAM may speed things up, but won't help with parallelization. You may have to be more creative here.  One option would

Re: Help with Shuffle Read performance

2022-09-30 Thread Artemis User
The reduce phase is always more resource-intensive than the map phase.  Couple of suggestions you may want to consider: 1. Setting the number of partitions to 18K may be way too high (the default number is only 200).  You may want to just use the default and the scheduler will

Re: [SPARK STRUCTURED STREAMING] : Rocks DB uses off-heap usage

2022-09-12 Thread Artemis User
The off-heap memory isn't subjected to GC.  So the obvious reason is that your have too many states to maintain in your streaming app, and the GC couldn't keep up, and end up with resources but to die. Are you using continues processing or microbatch in structured streaming?  You may want to

Re: Spark Structured Streaming -- Cannot consume next messages

2022-07-21 Thread Artemis User
Not sure what you mean by offerts/offsets.  I assume you were using file-based instead of Kafka-based of data sources.  Are the incoming data generated in mini-batch files or in a single large file?  Have you had this type of problem before? On 7/21/22 1:02 PM, KhajaAsmath Mohammed wrote:

Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-19 Thread Artemis User
WAITFOR is part of the Transact-SQL and it's Microsoft SQL server specific, not supported by Spark SQL.  If you want to impose a delay in a Spark program, you may want to use the thread sleep function in Java or Scala.  Hope this helps... On 5/19/22 1:45 PM, K. N. Ramachandran wrote: Hi

Re: Vulnerabilities in htrace-core4-4.1.0-incubating.jar jar used in spark.

2022-05-02 Thread Artemis User
What scanner did you use? Looks like all CVEs you listed for jackson-databind-xxx.jar are for older versions (2.9.10.x).  A quick search on NVD revealed that there is only one CVE (CVE-2020-36518) that affects your Spark versions.  This CVE (not on your scanned CVE list) is on jackson-databind

Re: how spark handle the abnormal values

2022-05-01 Thread Artemis User
Your test result just gave the verdict so #2 is the answer - Spark ignores those non-numeric rows completely when aggregating the average. On 5/1/22 8:20 PM, wilson wrote: I did a small test as follows. scala> df.printSchema() root  |-- fruit: string (nullable = true)  |-- number: string

Re: Dealing with large number of small files

2022-04-26 Thread Artemis User
Most likely your JSON files are not formatted correctly.  Please see the Spark doc on specific formatting requirement for JSON data. https://spark.apache.org/docs/latest/sql-data-sources-json.html. On 4/26/22 10:43 AM, Sid wrote: Hello, Can somebody help me with the below problem?

Problems with DataFrameReader in Structured Streaming

2022-04-13 Thread Artemis User
We have a single file directory that's being used by both the file generator/publisher and the Spark job consumer.  When using microbatch files in structured streaming, we encountered the following problems: 1. We would like to have a Spark streaming job consume only data files after a

Re: Continuous ML model training in stream mode

2022-03-18 Thread Artemis User
example in Spark. https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means On Tue, Mar 15, 2022, 3:46 PM Artemis User wrote: Has anyone done any experiments of training an ML model using stream data? especially for unsupervised models?   Any

Re: Continuous ML model training in stream mode

2022-03-15 Thread Artemis User
://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means On Tue, Mar 15, 2022, 3:46 PM Artemis User wrote: Has anyone done any experiments of training an ML model using stream data? especially for unsupervised models?   Any suggestions/references are highly appreciated

Re: How Spark establishes connectivity to Hive

2022-03-15 Thread Artemis User
I guess it's really depends on your configuration.  The Hive metastore is providing just the metadata/schema data for your database, not actual data storage.  Hive is running on top of Hadoop. If you configure your Spark to run on the same Hadoop cluster using Yarn, your SQL dataframe in Spark

Continuous ML model training in stream mode

2022-03-15 Thread Artemis User
Has anyone done any experiments of training an ML model using stream data? especially for unsupervised models?   Any suggestions/references are highly appreciated... - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-11 Thread Artemis User
(and should not) know ahead of time which jobs will be executed, that's the job of the orchestration layer (and can be dynamic). I know I can specify multiple packages. Also not worried about memory. On Thu, 10 Mar 2022 at 13:54, Artemis User

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-10 Thread Artemis User
he orchestration layer (and can be dynamic). I know I can specify multiple packages. Also not worried about memory. On Thu, 10 Mar 2022 at 13:54, Artemis User wrote: If changing packages or jars isn't your concern, why not just specify ALL p

Re: Spark 3.1 with spark AVRO

2022-03-10 Thread Artemis User
It must be some misconfiguration in your environment.  Do you perhaps have a hardwired $SPARK_HOME env variable in your shell?  An easy test would be to place the spark-avro jar file you downloaded in the jars directory of Spark and run spark-shell again without the packages option.  This will

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-10 Thread Artemis User
karound, copy-pasting from the issue: ``` s: SparkSession = ... # Hard reset: s.stop() s._sc._gateway.shutdown() s._sc._gateway.proc.stdin.close() SparkContext._gateway = None SparkContext._jvm = None ``` Cheers - Rafal On 2022/03/09 15:39:58 Artemis User wrote: > This is indeed a JVM issue

Re: CPU usage from Event log

2022-03-09 Thread Artemis User
I am not sure what column/properties you are referring to.  But the event log in Spark deals with application level "events', not JVM-level metrics.  To retrieve the JVM metrics, you need to use the REST API provided in Spark.  Please see https://spark.apache.org/docs/latest/monitoring.html

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Artemis User
This is indeed a JVM issue, not a Spark issue.  You may want to ask yourself why it is necessary to change the jar packages during runtime.  Changing package doesn't mean to reload the classes. There is no way to reload the same class unless you customize the classloader of Spark.  I also

Re: spark jobs don't require the master/worker to startup?

2022-03-09 Thread Artemis User
To be specific: 1. Check the log files on both master and worker and see if any errors. 2. If you are not running your browser on the same machine and the Spark cluster, please use the host's external IP instead of localhost IP when launching the worker Hope this helps... -- ND On 3/9/22

Non-Partition based Workload Distribution

2022-02-24 Thread Artemis User
We got a Spark program that iterates through a while loop on the same input DataFrame and produces different results per iteration. I see through Spark UI that the workload is concentrated on a single core of the same worker.  Is there anyway to distribute the workload to different

Re: Logging to determine why driver fails

2022-02-21 Thread Artemis User
Williams (SSI) wrote: Thank you. *From:* Artemis User [mailto:arte...@dtechspace.com] *Sent:* Monday, February 21, 2022 8:23 AM *To:* Michael Williams (SSI) *Subject:* Re: Logging to determine why driver fails Spark uses Log4j for logging.  There is a log4j properties template file located

Re: Logging to determine why driver fails

2022-02-21 Thread Artemis User
Spark uses log4j for logging.  There is a log4j properties template file in the conf directory.  Just remove the "template" extension and change the content of log4j.properties to meet your need.  More info on log4j can be found at logging.apache.org... On 2/21/22 9:15 AM, Michael Williams

Scala/Spark Kernel for Jupyter

2022-02-18 Thread Artemis User
Could someone recommend a Scala/Spark kernel for Jupyter/JupyterHub that support the latest Spark version?  Thanks! - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Using Avro file format with SparkSQL

2022-02-17 Thread Artemis User
Please try these two corrections: 1. The --packages isn't the right command line argument for spark-submit.  Please use --conf spark.jars.packages=your-package to specify Maven packages or define your configuration parameters in the spark-defaults.conf file 2. Please check the version

Re: Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your target release day for Spark3.3?

2022-01-12 Thread Artemis User
There was a discussion on this issue couple of weeks ago.  Basically if you look at the CVE definition of Log4j, the vulnerability only affects certain versions of log4j 2.x, not 1.x.  Since Spark doesn't use any of the affected log4j versions, this shouldn't be a concern..

Re: JDBCConnectionProvider in Spark

2022-01-06 Thread Artemis User
such provider API is still needed? Is there any use cases for using the provider API instead of the dataframe reader/writer when dealing with JDBC?  Thanks! On 1/6/22 9:09 AM, Sean Owen wrote: There are 8 concrete implementations of it? OracleConnectionProvider, etc On Wed, Jan 5, 2022 at 9:26 PM Artemis

JDBCConnectionProvider in Spark

2022-01-05 Thread Artemis User
Could someone provide some insight/examples on the usage of this API? https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/jdbc/JdbcConnectionProvider.html Why is it needed since this is an abstract class and there isn't any concrete implementation of it?   Thanks a lot in

Re: pyspark

2022-01-05 Thread Artemis User
Did you install and configure the proper Spark kernel (SparkMagic) on your Jupyter Lab or Hub?  See https://github.com/jupyter/jupyter/wiki/Jupyter-kernels for more info... On 1/5/22 4:01 AM, 流年以东” wrote: In the process of using pyspark,there is no spark context when opening jupyter and

Re: Equivalent Function in ml for computeCost()

2021-11-29 Thread Artemis User
, pretty easily, in any event, either by just writing up a few lines of code or using the .mllib model inside the .ml model object anyway. On Mon, Nov 29, 2021 at 2:50 PM Artemis User wrote: The RDD-based org.apache.spark.mllib.clustering.KMeansModel class defines a method called

Equivalent Function in ml for computeCost()

2021-11-29 Thread Artemis User
The RDD-based org.apache.spark.mllib.clustering.KMeansModel class defines a method called computeCost that is used to calculate the WCSS error of K-Means clusters (https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/clustering/KMeansModel.html). Is there an equivalent method

Re: Spark for Image Processing Acceleration

2021-10-14 Thread Artemis User
Spark is good with SQL type of structured data, not image data. Unless you algorithms don' t require dealing with image data directly. I guess your best option would be to go with Tensorflow since it has image classification models built-in and can integrate with NVidia GPUs out of the box. 

Re: Does Apache Spark 3 support GPU usage for Spark RDDs?

2021-09-21 Thread Artemis User
Unfortunately the answer you got from the forum is true.  The current Spark-rapids package doesn't support RDD.  Please see https://nvidia.github.io/spark-rapids/docs/FAQ.html#what-parts-of-apache-spark-are-accelerated I guess to be able to use spark-rapids, one option you have would be to

Re: Processing Multiple Streams in a Single Job

2021-08-27 Thread Artemis User
mail's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Tue, 24 Aug 2021 at 23:37, Artemis User <mailto:arte...@dtechspace.com>> wrote: Is there a way to run multiple stre

Re: Processing Multiple Streams in a Single Job

2021-08-25 Thread Artemis User
API too. No jobs can't communicate with each other. On Tue, Aug 24, 2021 at 9:51 PM Artemis User <mailto:arte...@dtechspace.com>> wrote: Thanks Daniel.  I guess you were suggesting using DStream/RDD.  Would it be possible to use structured streaming/DataFrames for mul

Re: Processing Multiple Streams in a Single Job

2021-08-24 Thread Artemis User
wrote: Yeah. Build up the streams as a collection and map that query to the start() invocation and map those results to awaitTermination() or whatever other blocking mechanism you’d like to use. On Tue, Aug 24, 2021 at 4:37 PM Artemis User <mailto:arte...@dtechspace.com>&

Processing Multiple Streams in a Single Job

2021-08-24 Thread Artemis User
Is there a way to run multiple streams in a single Spark job using Structured Streaming?  If not, is there an easy way to perform inter-job communications (e.g. referencing a dataframe among concurrent jobs) in Spark?  Thanks a lot in advance! -- ND

Re: Spark Thriftserver is failing for when submitting command from beeline

2021-08-20 Thread Artemis User
Looks like your problem is related to not setting up a hive.xml file properly.  The standard Spark distribution doesn't include a hive.xml template file in the conf directory.  You will have to create one by yourself.  Please refer to the Spark user doc and Hive metastore config guide for

Re: spark-submit not running on macbook pro

2021-08-19 Thread Artemis User
Looks like PySpark can't initiate a JVM in the backend.  How did you set up Java and Spark on your machine?  Some suggestions that may help solve your issue: 1. Use OpenJDK instead of Apple JDK since Spark was developed using OpenJDK, not Apple's.  You can use homebrew to install OpenJDK (I

Re: error: java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available

2021-08-07 Thread Artemis User
Without seeing the code and the whole stack trace, just a wild guess if you set the config param for enabling arrow (spark.sql.execution.arrow.pyspark.enabled)?  If not in your code, you would have to set it in the spark-default.conf.   Please note that the parameter

Re: Convert timestamp to unix miliseconds

2021-08-07 Thread Artemis User
Apparently you were not using the right formatting string.  For sub-second formatting, use capital S instead of lower case s.  See Spark's doc at https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html. Hope this helps... -- ND On 8/4/21 4:42 PM, Tzahi File wrote: Hi All, I'm

Re: How can transform RDD[Seq[String]] to RDD[ROW]

2021-08-05 Thread Artemis User
I am not sure why you need to create an RDD first.  You can create a data frame directly from csv file, for instance: spark.read.format("csv").option("header","true").schema(yourSchema).load(ftpUrl) -- ND On 8/5/21 3:14 AM, igyu wrote: val ftpUrl

Re: Reading the last line of each file in a set of text files

2021-08-03 Thread Artemis User
Assuming you are running Linux, an easy option would be just to use the Linux tail command to extract the last line (or last couple of lines) of a file and save them to a different file/directory, before feeding it to Spark.  It shouldn't be hard to write a shell script that executes tail on

Re: Running Spark Rapids on GPU-Powered Spark Cluster

2021-07-30 Thread Artemis User
and supported by companies towards whom you are being so unkind Regards, Gourav Sengupta On Fri, Jul 30, 2021 at 4:02 PM Artemis User <mailto:arte...@dtechspace.com>> wrote: Thanks Gourav for the info.  Actually I am looking for concrete experiences and detailed best practices from p

Re: Running Spark Rapids on GPU-Powered Spark Cluster

2021-07-30 Thread Artemis User
, but surely Ray also has to win as well and nothing better than to ride on the success of SPARK. But I may be wrong, and SPARK community may still be developing those integrations. Regards, Gourav Sengupta On Fri, Jul 30, 2021 at 2:46 AM Artemis User <mailto:arte...@dtechspace.com>&

Running Spark Rapids on GPU-Powered Spark Cluster

2021-07-29 Thread Artemis User
Has anyone had any experience with running Spark-Rapids on a GPU-powered cluster (https://github.com/NVIDIA/spark-rapids)?  I am very interested in knowing: 1. What is the hardware/software platform and the type of Spark cluster you are using to run Spark-Rapids? 2. How easy was the

Re: Well balanced Python code with Pandas compared to PySpark

2021-07-29 Thread Artemis User
PySpark still uses Spark dataframe underneath (it wraps java code). Use PySpark when you have to deal with big data ETL and analytics so you can leverage the distributed architecture in Spark.  If you job is simple, dataset is relatively small, and doesn't require distributed processing, use

Re: Connection Reset by Peer : failed to remove cached rdd

2021-07-29 Thread Artemis User
Can you please post the error log/exception messages?  There is not enough info to help diagnose what the real problem is On 7/29/21 8:55 AM, Big data developer need help relat to spark gateway roles in 2.0 wrote: Hi Team , We are facing issue in production where we are getting frequent

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-19 Thread Artemis User
As Mich mentioned, no need to use jdbc API, using the DataFrameWriter's saveAsTable method is the way to go.   JDBC Driver is for a JDBC client (a Java client for instance) to access the Hive tables in Spark via the Thrift server interface. -- ND On 7/19/21 2:42 AM, Badrinath Patchikolla

Performance Improvement with Hive/Thrift Server

2021-07-12 Thread Artemis User
We are trying to switch from Postgres to the Spark's built-in Hive with Thrift server as the data sink to persist the ML result data, with the hope that Hive would improve the ML pipeline performance. However, it turned out that it took significantly longer for Hive to persist dataframes (via

Re: Issue with Running Spark in Jupyter Notebook

2021-06-24 Thread Artemis User
Looks like you didn't set up your environment properly.  I assume you are running this from a standalone python program instead of from the pyspark shell.  I would first run your code from the pyspark shell, then follow the spark python installation guide to set up your python environment

Re: Performance Problems Migrating to S3A Committers

2021-06-23 Thread Artemis User
Thanks Johnny for sharing your experience.  Have you tried to use S3A committer?  Looks like this one is introduced in the latest Hadoop for solving problems with other committers. https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html - ND On 6/22/21 6:41 PM,

Problem in Restoring ML Pipeline with UDF

2021-06-08 Thread Artemis User
We have a feature engineering transformer defined as a custom class with UDF as follows: class FeatureModder extends Transformer with DefaultParamsWritable with DefaultParamsReadable[FeatureModder] {     val uid: String = "FeatureModder"+randomUUID     final val inputCol: Param[String] = new

Re: Exception on Avro Schema Object Serialization

2021-02-02 Thread Artemis User
it to ensure that it isn't used in the function. On Tue, Feb 2, 2021 at 2:32 PM Artemis User <mailto:arte...@dtechspace.com>> wrote: We tried to standardize the SQL data source management using the Avro schema, but encountered some serialization exceptions when trying to use

Exception on Avro Schema Object Serialization

2021-02-02 Thread Artemis User
We tried to standardize the SQL data source management using the Avro schema, but encountered some serialization exceptions when trying to use the data.  The interesting part is that we didn't have any problems in reading the Avro schema JSON file and converting the Avro schema into a SQL

Persisting Customized Transformer

2021-01-19 Thread Artemis User
We are trying to create a customized transformer for a ML pipeline and also want to persist the trained pipeline and retrieve it for production.  To enable persistency, we will have to implement read/write functions.  However, this is not feasible in Scala since the read/write methods are

Customizing K-Means for Anomaly Detection

2021-01-12 Thread Artemis User
First some background: * We want to use the k-means model for anomaly detection against a multi-dimensional dataset.  The current k-means implementation in Spark is designed for clustering purpose, not exactly for anomaly detection.  Once a model is trained and pipeline is

Re: Use case advice

2021-01-09 Thread Artemis User
Could you please clarify what do you mean by 1)? Driver is only responsible for submitting Spark job, not performing. -- ND On 1/9/21 9:35 AM, András Kolbert wrote: Hi, I would like to get your advice on my use case. I have a few spark streaming applications where I need to keep updating a

Re: Integrating multiple streaming sources

2020-12-22 Thread Artemis User
Hmm, looks like Spark 2.3+ does support stream-to-stream join. But the online doc doesn't provide any examples.  If anyone could provide some concrete reference, I'd really appreciate.  Thanks! -- ND On 12/22/20 9:57 AM, Artemis User wrote: Is there anyway to integrate/fuse multiple streaming

Integrating multiple streaming sources

2020-12-22 Thread Artemis User
Is there anyway to integrate/fuse multiple streaming sources into a single stream process?  In other words, the current structured streaming API dictates a single a streaming source and sink.  We'd like to have a stream process that interfaces with multiple stream sources, perform a join and

Re: Issue while installing dependencies Python Spark

2020-12-17 Thread Artemis User
Wheel is used for package management and setting up your virtual environment , not used as a library package.  To run spark-submit in a virtual env, use the --py-files option instead.  Usage: --py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the

Re: mysql connector java issue

2020-12-11 Thread Artemis User
. Artemis User mailto:arte...@dtechspace.com>> 于2020年12月11日周五 上午5:21写道: What happened was that you made the mysql jar file only available to the spark driver, not the executors.  Use the --jars parameter instead of driver-class-path to specify your third-party jar files, o

Re: mysql connector java issue

2020-12-10 Thread Artemis User
What happened was that you made the mysql jar file only available to the spark driver, not the executors.  Use the --jars parameter instead of driver-class-path to specify your third-party jar files, or copy the third-party jar files to the jars directory for Spark in your HDFS, and specify

How to Spawn Child Thread or Sub-jobs in a Spark Session

2020-12-04 Thread Artemis User
We have a Spark job that produces a result data frame, say DF-1 at the end of the pipeline (i.e. Proc-1).  From DF-1, we need to create two or more dataf rames, say DF-2 and DF-3 via additional SQL or ML processes, i.e. Proc-2 and Proc-3.  Ideally, we would like to perform Proc-2 and Proc-3 in

Re: In windows 10, accessing Hive from PySpark with PyCharm throws error

2020-12-03 Thread Artemis User
ontent is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 2 Dec 2020 at 23:11, Artemis User <mailto:arte...@dtechspace.com>> wrote: Apparently this is a OS dynamic lib link error.  Make sure you

Re: In windows 10, accessing Hive from PySpark with PyCharm throws error

2020-12-02 Thread Artemis User
Apparently this is a OS dynamic lib link error.  Make sure you have the LD_LIBRARY_PATH (in Linux) or PATH (windows) set up properly for the right .so or .dll file... On 12/2/20 5:31 PM, Mich Talebzadeh wrote: Hi, I have a simple code that tries to create Hive derby database as follows:

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Artemis User
of these files   in executors can be accessed via SparkFiles.get(fileName). -- ND On 11/25/20 9:51 PM, Artemis User wrote: This is a typical file sharing problem in Spark.  Just setting up HDFS won't solve the problem unless you make your local machine as part

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Artemis User
This is a typical file sharing problem in Spark.  Just setting up HDFS won't solve the problem unless you make your local machine as part of the cluster.  Spark server doesn't share files with your local machine without mounting drives to each other.  The best/easiest way to share the data

Re: Submitting extra jars on spark applications on yarn with cluster mode

2020-11-14 Thread Artemis User
I guess I misread your message.  The archive directory shall contain only jar files, not tar.gz files... On 11/14/20 10:11 AM, Artemis User wrote: Assuming you were using hadoop for your yarn cluster.  You can specify the spark parameters spark.yarn.archive or spark.yarn.jars to contain

Re: Submitting extra jars on spark applications on yarn with cluster mode

2020-11-14 Thread Artemis User
Assuming you were using hadoop for your yarn cluster.  You can specify the spark parameters spark.yarn.archive or spark.yarn.jars to contain the jar directory or jar files so that hadoop can find them by default.  See Spark online doc for details

Re: Debugging tools for Spark Structured Streaming

2020-10-30 Thread Artemis User
Spark distribute loads to executors and the executors are usually pre-configured with the number of cores.  You may want to check with your Spark admin on how many executors (or slaves) your Spark cluster is configured with and how many cores are pre-configured for executors.  The debugging

Re: Apache Spark Connector for SQL Server and Azure SQL

2020-10-26 Thread Artemis User
The best option certainly would be to recompile the Spark Connector for MS SQL server using the Spark 3.0.1/Scala 2.12 dependencies, and just fix the compiler errors as you go. The code is open source on github (https://github.com/microsoft/sql-spark-connector).  Looks like this connector is

Re: Spark hive build and connectivity

2020-10-22 Thread Artemis User
By default Spark will build with Hive 2.3.7, according to the Spark build doc.  If you want to replace it with a different hive jar, you need to change the Maven pom.xml file. -- ND On 10/22/20 11:35 AM, Ravi Shankar wrote: Hello all, I am trying to understand how the Spark SQL integration

Client APIs for Accessing Spark Data Frames Directly

2020-10-21 Thread Artemis User
Is there anyway to access the Data Frames content directly/interactively via some client access APIs?  Some background info: 1. We have a Java client application that uses spark launcher to submit a spark job to a spark master. 2. The default spark launcher API has only a handle API that

Re: Spark Streaming Job is stucked

2020-10-18 Thread Artemis User
If it was running fine before and stops working now, one thing I could think of may be your disk was full.  Check your disk space and clean up your old log files might help... On 10/18/20 12:06 PM, rajat kumar wrote: Hello Everyone, My spark streaming job is running too slow, it is having

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User
know if it helps. On Fri, Oct 16, 2020 at 10:37 AM Artemis User <mailto:arte...@dtechspace.com>> wrote: Thank you all for the responses.  Basically we were dealing with file source (not Kafka, therefore no topics involved) and dumping csv files (about 1000 lines, 300KB

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User
in one node, and then run ML transformations in parallel *From: *Artemis User *Date: *Friday, October 16, 2020 at 3:52 PM *To: *"user@spark.apache.org" *Subject: *RE: [EXTERNAL] How to Scale Streaming Application to Multiple Workers *CAUTION*: This email originated fr

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User
solutions. *From: *Artemis User *Date: *Friday, October 16, 2020 at 2:19 PM *Cc: *user *Subject: *RE: [EXTERNAL] How to Scale Streaming Application to Multiple Workers *CAUTION*: This email originated from outside of the organization. Do not click links or open attachments unless you can

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User
damages arising from such loss, damage or destruction. On Thu, 15 Oct 2020 at 20:02, Artemis User mailto:arte...@dtechspace.com>> wrote: Thanks for the input. What I am interested is how to have multiple workers to read and process the small files in parallel, and

Re: How to Scale Streaming Application to Multiple Workers

2020-10-15 Thread Artemis User
wrote: Parallelism of streaming depends on the input source. If you are getting one small file per microbatch, then Spark will read it in one worker. You can always repartition your data frame after reading it to increase the parallelism. On 10/14/20, 11:26 PM, "Artemis User" wrote:

  1   2   >