from:"Artemis User"

Re: Classification request

2024-06-04 Thread Artemis User

Sara, Apache Spark is open source under Apache License 2.0 (https://github.com/apache/spark/blob/master/LICENSE). It is not under export control of any country! Please feel free to use, reproduce and distribute, as long as your practice is compliant with the license. Having said that, some c

Re: Spark Not Connecting

2023-07-12 Thread Artemis User

n Wed, Jul 12, 2023, 6:00 PM Artemis User wrote: The error screenshot doesn't tell much. Maybe your job wasn't submitted properly. Make sure you IP/port numbers were defined correctly. Take a look at the Spark server UI to see what errors occur. On 7/12/23

Re: 回复：Re: Build SPARK from source with SBT failed

2023-03-07 Thread Artemis User

Looks like Maven build did find the javac, just can't run it. So it's not a path problem but a compatibility problem. Are you doing this on a Mac with M1/M2? I don't think that Zulu JDK supports Apple silicon. Your best option would be to use homebrew to install the dev tools (including Op

Re: Help needed regarding error with 5 node Spark cluster (shuffle error)- Comcast

2023-01-30 Thread Artemis User

Not sure where you get the property name "spark.memory.offHeap.use". The correct one should be "spark.memory.offHeap.enabled". See https://spark.apache.org/docs/latest/configuration.html#spark-properties for details. On 1/30/23 10:12 AM, Jain, Sanchi wrote: I am not sure if this is the inte

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Artemis User

Try this one: "select country, city, max(population) from your_table group by country" Please note this returns a table of three columns, instead of two. This is a standard SQL query, and supported by Spark as well. On 12/20/22 3:35 PM, Oliver Ruebenacker wrote: Hello, Let's say th

Re: Can we upload a csv dataset into Hive using SparkSQL?

2022-12-13 Thread Artemis User

Your DDL statement doesn't look right. You may want to check the Spark SQL Reference online for how to create table in Hive format (https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-hiveformat.html). You should be able to populate the table directly using CREATE by providing

Re: Increasing Spark history resources

2022-12-09 Thread Artemis User

If you didn't have performance issues before with the history server, it may not be a threading or RAM problem. You may want to check on the disk space availability for the event logs... On 12/8/22 8:00 PM, Nikhil Goyal wrote: Hi folks, We are experiencing slowness in Spark history server, he

Re: [PySpark] Join using condition where each record may be joined multiple times

2022-11-27 Thread Artemis User

What if you just do a join with the first condition (equal chromosome) and append a select with the rest of the conditions after join? This will allow you to test your query step by step, maybe with a visual inspection to figure out what the problem is. It may be a data quality problem as well

Re: Pyspark ML model Save Error

2022-11-16 Thread Artemis User

What problems did you encounter? Most likely your problem may be related to saving the model object in different partitions. If that the case, just apply the dataframe's coalesce(1) method before saving the model to a shared disk drive... On 11/16/22 1:51 AM, Vajiha Begum S A wrote: Hi, Thi

Re: [EXTERNAL] Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-03 Thread Artemis User

1 GPU per executor. So, the question is how do I limit the stage resources to 20 GPUs total? Thanks again, Shay -------- *From:* Artemis User *Sent:* Thursday, November 3, 2022 5:23 PM *To:* user@spark.apache.org *Subject:* [EXT

Re: [EXTERNAL] Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-03 Thread Artemis User

---- *From:* Artemis User *Sent:* Thursday, November 3, 2022 1:16 AM *To:* user@spark.apache.org *Subject:* [EXTERNAL] Re: Stage level scheduling - lower the number of executors when using GPUs *ATTENTION:*This email originated from outside of GM. Are you using Rapids for GPU support in Spark? Cou

Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-02 Thread Artemis User

Are you using Rapids for GPU support in Spark? Couple of options you may want to try: 1. In addition to dynamic allocation turned on, you may also need to turn on external shuffling service. 2. Sounds like you are using Kubernetes. In that case, you may also need to turn on shuffle track

Re: How to find final status (Driver's) for an application

2022-10-28 Thread Artemis User

The master UI doesn't return much details, not designed for this purpose. You need to use the application-level/driver UI instead (on port 4040/4041...). Please see online doc monitoring and instrumentation for details (https://spark.apache.org/docs/latest/monitoring.html#rest-api). On 10/2

Re: Dynamic Scaling without Kubernetes

2022-10-26 Thread Artemis User

0/26/22 3:20 PM, Holden Karau wrote: So Spark can dynamically scale on YARN, but standalone mode becomes a bit complicated — where do you envision Spark gets the extra resources from? On Wed, Oct 26, 2022 at 12:18 PM Artemis User wrote: Has anyone tried to make a Spark cluster dynamicall

Dynamic Scaling without Kubernetes

2022-10-26 Thread Artemis User

Has anyone tried to make a Spark cluster dynamically scalable, i.e., adding a new worker node automatically to the cluster when no more executors are available upon a new job submitted? We need to make the whole cluster on-prem and really lightweight, so standalone mode is preferred and no k8s

Re: Running 30 Spark applications at the same time is slower than one on average

2022-10-26 Thread Artemis User

Are these Cloudera specific acronyms? Not sure how Cloudera configures Spark differently, but obviously the number of nodes is too small, considering each app only uses a small number of cores and RAM. So you may consider increase the number of nodes. When all these apps jam on a few nodes,

Re: pyspark connect to spark thrift server port

2022-10-21 Thread Artemis User

anyone to connect using pyspark. the port 9083 is open for anyone without authentication feature. The only way pyspark able to connect to hive is through 9083 and not through port 1. On Friday, October 21, 2022 at 04:06:38 AM GMT+8, Artemis User wrote: By default, Spark uses Apache Derby (

Re: pyspark connect to spark thrift server port

2022-10-20 Thread Artemis User

By default, Spark uses Apache Derby (running in embedded mode with store content defined in local files) for hosting the Hive metastore. You can externalize the metastore on a JDBC-compliant database (e.g., PostgreSQL) and use the database authentication provided by the database. The JDBC con

Re: How to use neo4j cypher/opencypher to query spark RDD/graphdb

2022-10-16 Thread Artemis User

Spark doesn't offer a native graph database like Neo4j does since GraphX is still using the RDD tabular data structure. Spark doesn't have a GQL or Cypher query engine either, but uses Google's Pregal API for graph processing. Don't see any prospect that Spark is going to implement any types

Re: Apache Spark Operator for Kubernetes?

2022-10-14 Thread Artemis User

If you have the hardware resources, it isn't difficult to set up Spark in a kubernetes cluster. The online doc describes everything you would need (https://spark.apache.org/docs/latest/running-on-kubernetes.html). You're right, both AWS EMR and Google's environment aren't flexible and not che

Re: Efficiently updating running sums only on new data

2022-10-12 Thread Artemis User

Do you have to use SQL/window function for this? If I understand this correctly, you could just keep track of the last record of each "thing", then calculate the new sum by adding the current value of "thing" to the sum of last record when a new record is generated. Looks like your problem will

Re: Reading too many files

2022-10-04 Thread Artemis User

Read by default can't be parallelized in a Spark job, and doing your own multi-threaded programming in a Spark program isn't a good idea. Adding fast disk I/O and increase RAM may speed things up, but won't help with parallelization. You may have to be more creative here. One option would be,

Re: Help with Shuffle Read performance

2022-09-30 Thread Artemis User

The reduce phase is always more resource-intensive than the map phase. Couple of suggestions you may want to consider: 1. Setting the number of partitions to 18K may be way too high (the default number is only 200). You may want to just use the default and the scheduler will automaticall

Re: [SPARK STRUCTURED STREAMING] : Rocks DB uses off-heap usage

2022-09-12 Thread Artemis User

The off-heap memory isn't subjected to GC. So the obvious reason is that your have too many states to maintain in your streaming app, and the GC couldn't keep up, and end up with resources but to die. Are you using continues processing or microbatch in structured streaming? You may want to lo

Re: Spark Structured Streaming -- Cannot consume next messages

2022-07-21 Thread Artemis User

Not sure what you mean by offerts/offsets. I assume you were using file-based instead of Kafka-based of data sources. Are the incoming data generated in mini-batch files or in a single large file? Have you had this type of problem before? On 7/21/22 1:02 PM, KhajaAsmath Mohammed wrote: Hi,

Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-19 Thread Artemis User

WAITFOR is part of the Transact-SQL and it's Microsoft SQL server specific, not supported by Spark SQL. If you want to impose a delay in a Spark program, you may want to use the thread sleep function in Java or Scala. Hope this helps... On 5/19/22 1:45 PM, K. N. Ramachandran wrote: Hi Sean,

Re: Vulnerabilities in htrace-core4-4.1.0-incubating.jar jar used in spark.

2022-05-02 Thread Artemis User

What scanner did you use? Looks like all CVEs you listed for jackson-databind-xxx.jar are for older versions (2.9.10.x). A quick search on NVD revealed that there is only one CVE (CVE-2020-36518) that affects your Spark versions. This CVE (not on your scanned CVE list) is on jackson-databind

Re: how spark handle the abnormal values

2022-05-01 Thread Artemis User

Your test result just gave the verdict so #2 is the answer - Spark ignores those non-numeric rows completely when aggregating the average. On 5/1/22 8:20 PM, wilson wrote: I did a small test as follows. scala> df.printSchema() root |-- fruit: string (nullable = true) |-- number: string (null

Re: Dealing with large number of small files

2022-04-26 Thread Artemis User

Most likely your JSON files are not formatted correctly. Please see the Spark doc on specific formatting requirement for JSON data. https://spark.apache.org/docs/latest/sql-data-sources-json.html. On 4/26/22 10:43 AM, Sid wrote: Hello, Can somebody help me with the below problem? https://st

Problems with DataFrameReader in Structured Streaming

2022-04-13 Thread Artemis User

We have a single file directory that's being used by both the file generator/publisher and the Spark job consumer. When using microbatch files in structured streaming, we encountered the following problems: 1. We would like to have a Spark streaming job consume only data files after a prede

Re: Continuous ML model training in stream mode

2022-03-18 Thread Artemis User

example in Spark. https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means On Tue, Mar 15, 2022, 3:46 PM Artemis User wrote: Has anyone done any experiments of training an ML model using stream data? especially for unsupervised models? Any

Re: Continuous ML model training in stream mode

2022-03-15 Thread Artemis User

://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means On Tue, Mar 15, 2022, 3:46 PM Artemis User wrote: Has anyone done any experiments of training an ML model using stream data? especially for unsupervised models? Any suggestions/references are highly appreciated

Re: How Spark establishes connectivity to Hive

2022-03-15 Thread Artemis User

I guess it's really depends on your configuration. The Hive metastore is providing just the metadata/schema data for your database, not actual data storage. Hive is running on top of Hadoop. If you configure your Spark to run on the same Hadoop cluster using Yarn, your SQL dataframe in Spark

Continuous ML model training in stream mode

2022-03-15 Thread Artemis User

Has anyone done any experiments of training an ML model using stream data? especially for unsupervised models? Any suggestions/references are highly appreciated... - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-11 Thread Artemis User

On Thu, Mar 10, 2022, 12:05 PM Rafał Wojdyła wrote: Because I can't (and should not) know ahead of time which jobs will be executed, that's the job of the orchestration layer (and can be dynamic). I know I can specify multiple

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-10 Thread Artemis User

will be executed, that's the job of the orchestration layer (and can be dynamic). I know I can specify multiple packages. Also not worried about memory. On Thu, 10 Mar 2022 at 13:54, Artemis User wrote: If changing packages or jars isn'

Re: Spark 3.1 with spark AVRO

2022-03-10 Thread Artemis User

It must be some misconfiguration in your environment. Do you perhaps have a hardwired $SPARK_HOME env variable in your shell? An easy test would be to place the spark-avro jar file you downloaded in the jars directory of Spark and run spark-shell again without the packages option. This will

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-10 Thread Artemis User

the "hard-reset" workaround, copy-pasting from the issue: ``` s: SparkSession = ... # Hard reset: s.stop() s._sc._gateway.shutdown() s._sc._gateway.proc.stdin.close() SparkContext._gateway = None SparkContext._jvm = None ``` Cheers - Rafal On 2022/03/09 15:39:58 Artemis User wrote: &

Re: CPU usage from Event log

2022-03-09 Thread Artemis User

I am not sure what column/properties you are referring to. But the event log in Spark deals with application level "events', not JVM-level metrics. To retrieve the JVM metrics, you need to use the REST API provided in Spark. Please see https://spark.apache.org/docs/latest/monitoring.html for

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Artemis User

This is indeed a JVM issue, not a Spark issue. You may want to ask yourself why it is necessary to change the jar packages during runtime. Changing package doesn't mean to reload the classes. There is no way to reload the same class unless you customize the classloader of Spark. I also don't

Re: spark jobs don't require the master/worker to startup?

2022-03-09 Thread Artemis User

To be specific: 1. Check the log files on both master and worker and see if any errors. 2. If you are not running your browser on the same machine and the Spark cluster, please use the host's external IP instead of localhost IP when launching the worker Hope this helps... -- ND On 3/9/22

Non-Partition based Workload Distribution

2022-02-24 Thread Artemis User

We got a Spark program that iterates through a while loop on the same input DataFrame and produces different results per iteration. I see through Spark UI that the workload is concentrated on a single core of the same worker. Is there anyway to distribute the workload to different cores/worker

Re: Logging to determine why driver fails

2022-02-21 Thread Artemis User

/22 9:37 AM, Michael Williams (SSI) wrote: Thank you. *From:* Artemis User [mailto:arte...@dtechspace.com] *Sent:* Monday, February 21, 2022 8:23 AM *To:* Michael Williams (SSI) *Subject:* Re: Logging to determine why driver fails Spark uses Log4j for logging. There is a log4j properties template

Re: Logging to determine why driver fails

2022-02-21 Thread Artemis User

Spark uses log4j for logging. There is a log4j properties template file in the conf directory. Just remove the "template" extension and change the content of log4j.properties to meet your need. More info on log4j can be found at logging.apache.org... On 2/21/22 9:15 AM, Michael Williams (SS

Scala/Spark Kernel for Jupyter

2022-02-18 Thread Artemis User

Could someone recommend a Scala/Spark kernel for Jupyter/JupyterHub that support the latest Spark version? Thanks! - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Using Avro file format with SparkSQL

2022-02-17 Thread Artemis User

Please try these two corrections: 1. The --packages isn't the right command line argument for spark-submit. Please use --conf spark.jars.packages=your-package to specify Maven packages or define your configuration parameters in the spark-defaults.conf file 2. Please check the version nu

Re: Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your target release day for Spark3.3?

2022-01-12 Thread Artemis User

There was a discussion on this issue couple of weeks ago. Basically if you look at the CVE definition of Log4j, the vulnerability only affects certain versions of log4j 2.x, not 1.x. Since Spark doesn't use any of the affected log4j versions, this shouldn't be a concern.. https://lists.apach

Re: JDBCConnectionProvider in Spark

2022-01-06 Thread Artemis User

provider API is still needed? Is there any use cases for using the provider API instead of the dataframe reader/writer when dealing with JDBC? Thanks! On 1/6/22 9:09 AM, Sean Owen wrote: There are 8 concrete implementations of it? OracleConnectionProvider, etc On Wed, Jan 5, 2022 at 9:26 PM Artemis

JDBCConnectionProvider in Spark

2022-01-05 Thread Artemis User

Could someone provide some insight/examples on the usage of this API? https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/jdbc/JdbcConnectionProvider.html Why is it needed since this is an abstract class and there isn't any concrete implementation of it? Thanks a lot in advanc

Re: pyspark

2022-01-05 Thread Artemis User

Did you install and configure the proper Spark kernel (SparkMagic) on your Jupyter Lab or Hub? See https://github.com/jupyter/jupyter/wiki/Jupyter-kernels for more info... On 1/5/22 4:01 AM, 流年以东” wrote: In the process of using pyspark,there is no spark context when opening jupyter and inp

Re: Equivalent Function in ml for computeCost()

2021-11-29 Thread Artemis User

y? You can compute it directly, pretty easily, in any event, either by just writing up a few lines of code or using the .mllib model inside the .ml model object anyway. On Mon, Nov 29, 2021 at 2:50 PM Artemis User wrote: The RDD-based org.apache.spark.mllib.clustering.KMeansModel cla

Equivalent Function in ml for computeCost()

2021-11-29 Thread Artemis User

The RDD-based org.apache.spark.mllib.clustering.KMeansModel class defines a method called computeCost that is used to calculate the WCSS error of K-Means clusters (https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/clustering/KMeansModel.html). Is there an equivalent method o

Re: Spark for Image Processing Acceleration

2021-10-14 Thread Artemis User

Spark is good with SQL type of structured data, not image data. Unless you algorithms don' t require dealing with image data directly. I guess your best option would be to go with Tensorflow since it has image classification models built-in and can integrate with NVidia GPUs out of the box. Th

Re: Does Apache Spark 3 support GPU usage for Spark RDDs?

2021-09-21 Thread Artemis User

Unfortunately the answer you got from the forum is true. The current Spark-rapids package doesn't support RDD. Please see https://nvidia.github.io/spark-rapids/docs/FAQ.html#what-parts-of-apache-spark-are-accelerated I guess to be able to use spark-rapids, one option you have would be to con

Re: Processing Multiple Streams in a Single Job

2021-08-27 Thread Artemis User

se from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Tue, 24 Aug 2021 at 23:37, Artemis User <mailto:arte...@dtechspace.com>> wrote: Is th

Re: Processing Multiple Streams in a Single Job

2021-08-25 Thread Artemis User

Frame API too. No jobs can't communicate with each other. On Tue, Aug 24, 2021 at 9:51 PM Artemis User <mailto:arte...@dtechspace.com>> wrote: Thanks Daniel. I guess you were suggesting using DStream/RDD. Would it be possible to use structured streaming/DataFrames f

Re: Processing Multiple Streams in a Single Job

2021-08-24 Thread Artemis User

wrote: Yeah. Build up the streams as a collection and map that query to the start() invocation and map those results to awaitTermination() or whatever other blocking mechanism you’d like to use. On Tue, Aug 24, 2021 at 4:37 PM Artemis User <mailto:arte...@dtechspace.com>> wrote: I

Processing Multiple Streams in a Single Job

2021-08-24 Thread Artemis User

Is there a way to run multiple streams in a single Spark job using Structured Streaming? If not, is there an easy way to perform inter-job communications (e.g. referencing a dataframe among concurrent jobs) in Spark? Thanks a lot in advance! -- ND ---

Re: Spark Thriftserver is failing for when submitting command from beeline

2021-08-20 Thread Artemis User

Looks like your problem is related to not setting up a hive.xml file properly. The standard Spark distribution doesn't include a hive.xml template file in the conf directory. You will have to create one by yourself. Please refer to the Spark user doc and Hive metastore config guide for detai

Re: spark-submit not running on macbook pro

2021-08-19 Thread Artemis User

Looks like PySpark can't initiate a JVM in the backend. How did you set up Java and Spark on your machine? Some suggestions that may help solve your issue: 1. Use OpenJDK instead of Apple JDK since Spark was developed using OpenJDK, not Apple's. You can use homebrew to install OpenJDK (I

Re: error: java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available

2021-08-07 Thread Artemis User

Without seeing the code and the whole stack trace, just a wild guess if you set the config param for enabling arrow (spark.sql.execution.arrow.pyspark.enabled)? If not in your code, you would have to set it in the spark-default.conf. Please note that the parameter spark.sql.execution.arrow.e

Re: Convert timestamp to unix miliseconds

2021-08-07 Thread Artemis User

Apparently you were not using the right formatting string. For sub-second formatting, use capital S instead of lower case s. See Spark's doc at https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html. Hope this helps... -- ND On 8/4/21 4:42 PM, Tzahi File wrote: Hi All, I'm us

Re: How can transform RDD[Seq[String]] to RDD[ROW]

2021-08-05 Thread Artemis User

I am not sure why you need to create an RDD first. You can create a data frame directly from csv file, for instance: spark.read.format("csv").option("header","true").schema(yourSchema).load(ftpUrl) -- ND On 8/5/21 3:14 AM, igyu wrote: val ftpUrl ="ftp://test:test@ip:21/upload/test/_temporary/

Re: Reading the last line of each file in a set of text files

2021-08-03 Thread Artemis User

Assuming you are running Linux, an easy option would be just to use the Linux tail command to extract the last line (or last couple of lines) of a file and save them to a different file/directory, before feeding it to Spark. It shouldn't be hard to write a shell script that executes tail on al

Re: Running Spark Rapids on GPU-Powered Spark Cluster

2021-07-30 Thread Artemis User

ell who are paid and supported by companies towards whom you are being so unkind Regards, Gourav Sengupta On Fri, Jul 30, 2021 at 4:02 PM Artemis User <mailto:arte...@dtechspace.com>> wrote: Thanks Gourav for the info. Actually I am looking for concrete experiences and detailed b

Re: Running Spark Rapids on GPU-Powered Spark Cluster

2021-07-30 Thread Artemis User

community, but surely Ray also has to win as well and nothing better than to ride on the success of SPARK. But I may be wrong, and SPARK community may still be developing those integrations. Regards, Gourav Sengupta On Fri, Jul 30, 2021 at 2:46 AM Artemis User <mailto:arte...@dtechspac

Running Spark Rapids on GPU-Powered Spark Cluster

2021-07-29 Thread Artemis User

Has anyone had any experience with running Spark-Rapids on a GPU-powered cluster (https://github.com/NVIDIA/spark-rapids)? I am very interested in knowing: 1. What is the hardware/software platform and the type of Spark cluster you are using to run Spark-Rapids? 2. How easy was the installa

Re: Well balanced Python code with Pandas compared to PySpark

2021-07-29 Thread Artemis User

PySpark still uses Spark dataframe underneath (it wraps java code). Use PySpark when you have to deal with big data ETL and analytics so you can leverage the distributed architecture in Spark. If you job is simple, dataset is relatively small, and doesn't require distributed processing, use Pa

Re: Connection Reset by Peer : failed to remove cached rdd

2021-07-29 Thread Artemis User

Can you please post the error log/exception messages? There is not enough info to help diagnose what the real problem is On 7/29/21 8:55 AM, Big data developer need help relat to spark gateway roles in 2.0 wrote: Hi Team , We are facing issue in production where we are getting frequent

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-19 Thread Artemis User

As Mich mentioned, no need to use jdbc API, using the DataFrameWriter's saveAsTable method is the way to go. JDBC Driver is for a JDBC client (a Java client for instance) to access the Hive tables in Spark via the Thrift server interface. -- ND On 7/19/21 2:42 AM, Badrinath Patchikolla wrot

Performance Improvement with Hive/Thrift Server

2021-07-12 Thread Artemis User

We are trying to switch from Postgres to the Spark's built-in Hive with Thrift server as the data sink to persist the ML result data, with the hope that Hive would improve the ML pipeline performance. However, it turned out that it took significantly longer for Hive to persist dataframes (via t

Re: Issue with Running Spark in Jupyter Notebook

2021-06-24 Thread Artemis User

Looks like you didn't set up your environment properly. I assume you are running this from a standalone python program instead of from the pyspark shell. I would first run your code from the pyspark shell, then follow the spark python installation guide to set up your python environment prope

Re: Performance Problems Migrating to S3A Committers

2021-06-23 Thread Artemis User

Thanks Johnny for sharing your experience. Have you tried to use S3A committer? Looks like this one is introduced in the latest Hadoop for solving problems with other committers. https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html - ND On 6/22/21 6:41 PM, Johnn

Problem in Restoring ML Pipeline with UDF

2021-06-08 Thread Artemis User

We have a feature engineering transformer defined as a custom class with UDF as follows: class FeatureModder extends Transformer with DefaultParamsWritable with DefaultParamsReadable[FeatureModder] { val uid: String = "FeatureModder"+randomUUID final val inputCol: Param[String] = new

Re: Exception on Avro Schema Object Serialization

2021-02-02 Thread Artemis User

rewriting it to ensure that it isn't used in the function. On Tue, Feb 2, 2021 at 2:32 PM Artemis User <mailto:arte...@dtechspace.com>> wrote: We tried to standardize the SQL data source management using the Avro schema, but encountered some serialization exceptions when

Exception on Avro Schema Object Serialization

2021-02-02 Thread Artemis User

We tried to standardize the SQL data source management using the Avro schema, but encountered some serialization exceptions when trying to use the data. The interesting part is that we didn't have any problems in reading the Avro schema JSON file and converting the Avro schema into a SQL Struc

Persisting Customized Transformer

2021-01-19 Thread Artemis User

We are trying to create a customized transformer for a ML pipeline and also want to persist the trained pipeline and retrieve it for production. To enable persistency, we will have to implement read/write functions. However, this is not feasible in Scala since the read/write methods are priva

Customizing K-Means for Anomaly Detection

2021-01-12 Thread Artemis User

First some background: * We want to use the k-means model for anomaly detection against a multi-dimensional dataset. The current k-means implementation in Spark is designed for clustering purpose, not exactly for anomaly detection. Once a model is trained and pipeline is instantiated,

Re: Use case advice

2021-01-09 Thread Artemis User

Could you please clarify what do you mean by 1)? Driver is only responsible for submitting Spark job, not performing. -- ND On 1/9/21 9:35 AM, András Kolbert wrote: Hi, I would like to get your advice on my use case. I have a few spark streaming applications where I need to keep updating a da

Re: Integrating multiple streaming sources

2020-12-22 Thread Artemis User

Hmm, looks like Spark 2.3+ does support stream-to-stream join. But the online doc doesn't provide any examples. If anyone could provide some concrete reference, I'd really appreciate. Thanks! -- ND On 12/22/20 9:57 AM, Artemis User wrote: Is there anyway to integrate/fuse multiple

Integrating multiple streaming sources

2020-12-22 Thread Artemis User

Is there anyway to integrate/fuse multiple streaming sources into a single stream process? In other words, the current structured streaming API dictates a single a streaming source and sink. We'd like to have a stream process that interfaces with multiple stream sources, perform a join and di

Re: Issue while installing dependencies Python Spark

2020-12-17 Thread Artemis User

Wheel is used for package management and setting up your virtual environment , not used as a library package. To run spark-submit in a virtual env, use the --py-files option instead. Usage: --py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPAT

Re: mysql connector java issue

2020-12-11 Thread Artemis User

E/lib. Artemis User mailto:arte...@dtechspace.com>> 于2020年12月11日周五上午5:21写道： What happened was that you made the mysql jar file only available to the spark driver, not the executors. Use the --jars parameter instead of driver-class-path to specify your third-party jar files,

Re: mysql connector java issue

2020-12-10 Thread Artemis User

What happened was that you made the mysql jar file only available to the spark driver, not the executors. Use the --jars parameter instead of driver-class-path to specify your third-party jar files, or copy the third-party jar files to the jars directory for Spark in your HDFS, and specify the

How to Spawn Child Thread or Sub-jobs in a Spark Session

2020-12-04 Thread Artemis User

We have a Spark job that produces a result data frame, say DF-1 at the end of the pipeline (i.e. Proc-1). From DF-1, we need to create two or more dataf rames, say DF-2 and DF-3 via additional SQL or ML processes, i.e. Proc-2 and Proc-3. Ideally, we would like to perform Proc-2 and Proc-3 in

Re: In windows 10, accessing Hive from PySpark with PyCharm throws error

2020-12-03 Thread Artemis User

ther property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 2 Dec 2020 at 23:11, Artemis User <mailto:arte...@dtechspace.com>

Re: In windows 10, accessing Hive from PySpark with PyCharm throws error

2020-12-02 Thread Artemis User

Apparently this is a OS dynamic lib link error. Make sure you have the LD_LIBRARY_PATH (in Linux) or PATH (windows) set up properly for the right .so or .dll file... On 12/2/20 5:31 PM, Mich Talebzadeh wrote: Hi, I have a simple code that tries to create Hive derby database as follows: from

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Artemis User

of these files in executors can be accessed via SparkFiles.get(fileName). -- ND On 11/25/20 9:51 PM, Artemis User wrote: This is a typical file sharing problem in Spark. Just setting up HDFS won't solve the problem unless you make your local machine as pa

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Artemis User

This is a typical file sharing problem in Spark. Just setting up HDFS won't solve the problem unless you make your local machine as part of the cluster. Spark server doesn't share files with your local machine without mounting drives to each other. The best/easiest way to share the data betw

Re: Submitting extra jars on spark applications on yarn with cluster mode

2020-11-14 Thread Artemis User

I guess I misread your message. The archive directory shall contain only jar files, not tar.gz files... On 11/14/20 10:11 AM, Artemis User wrote: Assuming you were using hadoop for your yarn cluster. You can specify the spark parameters spark.yarn.archive or spark.yarn.jars to contain the

Re: Submitting extra jars on spark applications on yarn with cluster mode

2020-11-14 Thread Artemis User

Assuming you were using hadoop for your yarn cluster. You can specify the spark parameters spark.yarn.archive or spark.yarn.jars to contain the jar directory or jar files so that hadoop can find them by default. See Spark online doc for details (http://spark.apache.org/docs/latest/running-on-

Re: Debugging tools for Spark Structured Streaming

2020-10-30 Thread Artemis User

Spark distribute loads to executors and the executors are usually pre-configured with the number of cores. You may want to check with your Spark admin on how many executors (or slaves) your Spark cluster is configured with and how many cores are pre-configured for executors. The debugging too

Re: Apache Spark Connector for SQL Server and Azure SQL

2020-10-26 Thread Artemis User

The best option certainly would be to recompile the Spark Connector for MS SQL server using the Spark 3.0.1/Scala 2.12 dependencies, and just fix the compiler errors as you go. The code is open source on github (https://github.com/microsoft/sql-spark-connector). Looks like this connector is us

Re: Spark hive build and connectivity

2020-10-22 Thread Artemis User

By default Spark will build with Hive 2.3.7, according to the Spark build doc. If you want to replace it with a different hive jar, you need to change the Maven pom.xml file. -- ND On 10/22/20 11:35 AM, Ravi Shankar wrote: Hello all, I am trying to understand how the Spark SQL integration wi

Client APIs for Accessing Spark Data Frames Directly

2020-10-21 Thread Artemis User

Is there anyway to access the Data Frames content directly/interactively via some client access APIs? Some background info: 1. We have a Java client application that uses spark launcher to submit a spark job to a spark master. 2. The default spark launcher API has only a handle API that prov

Re: Spark Streaming Job is stucked

2020-10-18 Thread Artemis User

If it was running fine before and stops working now, one thing I could think of may be your disk was full. Check your disk space and clean up your old log files might help... On 10/18/20 12:06 PM, rajat kumar wrote: Hello Everyone, My spark streaming job is running too slow, it is having bat

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User

and let me know if it helps. On Fri, Oct 16, 2020 at 10:37 AM Artemis User <mailto:arte...@dtechspace.com>> wrote: Thank you all for the responses. Basically we were dealing with file source (not Kafka, therefore no topics involved) and dumping csv files (about 1000 lines, 3

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User

your data in one node, and then run ML transformations in parallel *From: *Artemis User *Date: *Friday, October 16, 2020 at 3:52 PM *To: *"user@spark.apache.org" *Subject: *RE: [EXTERNAL] How to Scale Streaming Application to Multiple Workers *CAUTION*: This email originated

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User

mpler solutions. *From: *Artemis User *Date: *Friday, October 16, 2020 at 2:19 PM *Cc: *user *Subject: *RE: [EXTERNAL] How to Scale Streaming Application to Multiple Workers *CAUTION*: This email originated from outside of the organization. Do not click links or open attachments unless yo

Re: How to Scale Streaming Application to Multiple Workers

2020-10-16 Thread Artemis User

any monetary damages arising from such loss, damage or destruction. On Thu, 15 Oct 2020 at 20:02, Artemis User mailto:arte...@dtechspace.com>> wrote: Thanks for the input. What I am interested is how to have multiple workers to read and process the small files in pa

1 2 >

1 - 100 of 104 matches

Mail list logo