Fwd: Spark-submit doesn't load all app classes in the classpath

2023-01-28 Thread Soheil Pourbafrani
Hello all, I'm using Oozie to manage a Spark application on YARN cluster, in yarn-cluster mode. Recently I made some changes to the application in which the Hikari lib was involved. Surprisingly when I started the job, I got ClassNotFound exception for the Hikari classes. I'm passing a shade jar

Converting RelationalGroupedDataSet to DataFrame

2021-02-06 Thread Soheil Pourbafrani
Hi, In my problem, I need to group the DataFrame, apply the business logic for each group and finally emit a new DataFrame based on that. To describe in detail, there is a device_dataframe which contains the timestamp of when the device had been turned on (on) and turned off (off).

Create Hive table from CSVfile

2019-02-11 Thread Soheil Pourbafrani
Hi, Using the following code I create a Thrift Server including a Hive table from CSV file and I expect it considers the first line as a header but when I select data from the so-called table, I see it considers the CSV header as data row! It seems the line "TBLPROPERTIES(skip.header.line.count =

Customizing Spark ThriftServer

2019-01-23 Thread Soheil Pourbafrani
Hi, I want to create a thrift server that has some hive table predefined and listen on a port for the user query. Here is my code: val spark = SparkSession.builder() .config("hive.server2.thrift.port", "1") .config("spark.sql.hive.thriftServer.singleSession", "true")

How to query on Cassandra and load results in Spark dataframe

2019-01-22 Thread Soheil Pourbafrani
Hi, Using the command val table = spark .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "A", "keyspace" -> "B")) .load someone can load whole table data into a dataframe. Instead, I want to run a query in Cassandra and load just the result in dataframe (not

Re: How to sleep Spark job

2019-01-22 Thread Soheil Pourbafrani
aming endpoint. That decision really depends more on >> your use case. >> >> On Tue, Jan 22, 2019 at 11:56 PM Soheil Pourbafrani < >> soheil.i...@gmail.com> wrote: >> >>> Hi, >>> >>> I want to submit a job in YARN cluster to read data

How to sleep Spark job

2019-01-22 Thread Soheil Pourbafrani
Hi, I want to submit a job in YARN cluster to read data from Cassandra and write them in HDFS, every hour, for example. Is it possible to make Spark Application sleep in a while true loop and awake every hour to process data?

How thriftserver load data

2019-01-16 Thread Soheil Pourbafrani
Hi, I want to write an application that load data from HDFS into tables and create a ThriftServer and submit it to the YARN cluster. The question is how Spark actually load data. Does Spark load data in the memory since the application started or it waits for query and just loads data according

SparkSql query on a port and peocess queries

2019-01-15 Thread Soheil Pourbafrani
Hi, In my problem data is stored on both Database and HDFS. I create an application that according to the query, Spark load data, process the query and return the answer. I'm looking for a service that gets SQL queries and returns the answers (like Databases command line). Is there a way that my

Spark App Write nothing on HDFS

2018-12-17 Thread Soheil Pourbafrani
Hi, I submit an app on Spark2 cluster using standalone scheduler on client mode. The app saves the results of the processing on the HDFS. There is no error on output logs and the app finished successfully. But the problem is it just create _SUCSSES and empty part-0 file on the saving

Using cosinSimilarity method for getting pairwise documents similarity

2018-11-15 Thread Soheil Pourbafrani
Hi, I got the TF-IDF vector for the documents and store it in an RDD and convert into RowMatrix type: val mat = new RowMatrix(tweets_tfidf) Every element of RDD is a sparse Vector related to a document. The problem is the *cosinSimilarity *compute the similarity between columns. Is there any

Using columnSimilarity with threshold result in greater than one

2018-11-15 Thread Soheil Pourbafrani
Testing the *columnSimilarity *method in Spark, I create a *RowMatrix * object: val temp = sc.parallelize(Array((5.0, 1.0, 4.0), (2.0, 3.0, 8.0), (4.0, 5.0, 10.0), (1.0,3.0, 6.0))) val rows = temp.map(line => { Vectors.dense(Array(line._1, line._2, line._3)) }) val mat = new RowMatrix(rows)

Scala: The Util is not accessible in def main

2018-11-11 Thread Soheil Pourbafrani
Hi, I want to use org.apache.spark.util.Utils library in def main but I got the error: Symbole Util is not accessible from this place. Here is the code: val temp = tokens.map(word => Utils.nonNegativeMod(x, y)) How can I make it accessible?

What is BDV in Spark Source

2018-11-09 Thread Soheil Pourbafrani
Hi, Checking the Spark Sources, I faced with a type BDV: breeze.linalg.{DenseVector => BDV} and they used it in calculating IDF from Term Frequencies. What is it exactly?

Modifying pyspark sources

2018-11-05 Thread Soheil Pourbafrani
I want to apply minor modification in PySpark mllib but checking the PySpark sources I found it uses the Scala (Java) sources. Is there any way to do modification using Python?

Re: Is there any Spark source in Java

2018-11-03 Thread Soheil Pourbafrani
u can find some classes implemented in Java: >> >> https://github.com/apache/spark/search?l=java >> >> Cheers, >> Jeyhun >> >> On Sat, Nov 3, 2018 at 6:42 PM Soheil Pourbafrani >> wrote: >> >>> Hi, I want to customize some part of Spark. I was wondering if there any >>> Spark source is written in Java language, or all the sources are in Scala >>> language? >>> >>

Is there any Spark source in Java

2018-11-03 Thread Soheil Pourbafrani
Hi, I want to customize some part of Spark. I was wondering if there any Spark source is written in Java language, or all the sources are in Scala language?

Is it possible to customize Spark TF-IDF implementation

2018-11-02 Thread Soheil Pourbafrani
Hi, I want to know is it possible to customize the logic of TF_IDF in Apache Spark? In typical TF_IDF the TF is computed for each word regarding its documents. For example, the TF of word "A" can be differentiated in documents D1 and D2, but I want to see the TF as term frequency among whole

Multiply Matrix to it's transpose get undesired output

2018-11-02 Thread Soheil Pourbafrani
Hi, I want to compute the cosine similarities of vectors using apache spark. In a simple example, I created a vector from each document using built-in tf-idf. Here is the code: hashingTF = HashingTF(inputCol="tokenized", outputCol="tf") tf = hashingTF.transform(df) idf = IDF(inputCol="tf",

Re: Pyspark create RDD of dictionary

2018-11-02 Thread Soheil Pourbafrani
Got it, thanks! On Fri, Nov 2, 2018 at 7:18 PM Eike von Seggern wrote: > Hi, > > Soheil Pourbafrani schrieb am Fr., 2. Nov. 2018 > um 15:43 Uhr: > >> Hi, I have an RDD of the form (((a), (b), (c), (d)), (e)) and I want to >> transform every row to a dictionar

Pyspark create RDD of dictionary

2018-11-02 Thread Soheil Pourbafrani
Hi, I have an RDD of the form (((a), (b), (c), (d)), (e)) and I want to transform every row to a dictionary of the form a:(b, c, d, e) Here is my code, but it's errorful! map(lambda row : {row[0][0] : (row[1], row[0][1], row[0][2], row[0][3])) Is it possible to do such a transformation?

Number of rows divided by rowsPerBlock cannot exceed maximum integer

2018-10-28 Thread Soheil Pourbafrani
Hi, Doing cartesian multiplication against a matrix, I got the error: pyspark.sql.utils.IllegalArgumentException: requirement failed: Number of rows divided by rowsPerBlock cannot exceed maximum integer. Here is the code: normalizer = Normalizer(inputCol="feature", outputCol="norm") data =

Processing Flexibility Between RDD and Dataframe API

2018-10-28 Thread Soheil Pourbafrani
Hi, There are some functions like map, flatMap, reduce and ..., that construct the base data processing operation in big data (and Apache Spark). But Spark, in new versions, introduces the high-level Dataframe API and recommend using it. This is while there are no such functions in Dataframe API

How to access line fileName in loading file using the textFile method

2018-09-24 Thread Soheil Pourbafrani
Hi, My text data are in the form of text file. In the processing logic, I need to know each word is from which file. Actually, I need to tokenize the words and create the pair of . The naive solution is to call sc.textFile for each file and having the fileName in a variable, create the pairs, but

Is it possible to implement Vector Space Model using PySpark

2018-09-23 Thread Soheil Pourbafrani
Hi, I want to implement the Vector Space Model for texts using Spark. At the first step, I calculate the Vector of the files (dictionary) and I made it a broadcast variable to be accessible for all executors. Vector_of_Words = selected_data.select('full_text').rdd\ .map(lambda x :

Re: Use Shared Variable in PySpark Executors

2018-09-22 Thread Soheil Pourbafrani
Ok, I'll do that. Thanks On Sat, Sep 22, 2018 at 7:09 PM Jörn Franke wrote: > Do you want to calculate it and share it once with all other executors? > Then a broadcast variable maybe interesting for you, > > > On 22. Sep 2018, at 16:33, Soheil Pourbafrani > wrote: > &

Use Shared Variable in PySpark Executors

2018-09-22 Thread Soheil Pourbafrani
Hi, I want to do some processing with PySpark and save the results in a variable of type tuple that should be shared among the executors for further processing. Actually, it's a Text Mining Processing and I want to use the Vector Space Model. So I want to calculate the Vector of all Words (that

Spark application complete it's job successfully on Yarn cluster but yarn register it as failed

2018-06-20 Thread Soheil Pourbafrani
Hi, I run a Spark application on Yarn cluster and it complete the process successfully, but at the end Yarn print in the console: client token: N/A diagnostics: Application application_1529485137783_0004 failed 4 times due to AM Container for appattempt_1529485137783_0004_04 exited with

How can I do the following simple scenario in spark

2018-06-19 Thread Soheil Pourbafrani
Hi, I have a JSON file in the following structure: ++---+ | full_text| id| ++---+ I want to tokenize each sentence into pairs of (word, id) for example, having the record : ("Hi, How are you?", id)

Error on fetchin mass data from cassandra using SparkSQL

2018-05-28 Thread Soheil Pourbafrani
I tried to fetch some data from Cassandra using SparkSql. For small tables, all things go well but trying to fetch data from big tables I got the following error: java.lang.NoSuchMethodError:

export dataset in image format

2018-04-27 Thread Soheil Pourbafrani
Hi, usin Spark 2.3 I read a image in dataset using imageschema. Now after some changes, I want to save dataset as a new image. How can I achieve this in Spark ?

Writing data in HDFS high available cluster

2018-01-18 Thread Soheil Pourbafrani
I have a HDFS high available cluster with two namenode, one as active namenode and one as standby namenode. When I want to write data to HDFS I use the active namenode address. Now, my question is what happened if during spark writing data active namenode fails. Is there any way to set both active

Spark application on yarn cluster clarification

2018-01-18 Thread Soheil Pourbafrani
I am setting up a Yarn cluster to run Spark applications on that, but I'm confused a bit! Consider I have a 4-node yarn cluster including one resource manager and 3 node manager and spark are installed in all 4 nodes. Now my question is when I want to submit spark application to yarn cluster, is

Re: Why this code is errorfull

2017-12-13 Thread Soheil Pourbafrani
arams) ); On Thu, Dec 14, 2017 at 10:30 AM, Soheil Pourbafrani <soheil.i...@gmail.com> wrote: > The following code is in SparkStreaming : > > JavaInputDStream results = stream.map(record -> > SparkTest.getTime(record.value()) + ":" &

Why this code is errorfull

2017-12-13 Thread Soheil Pourbafrani
The following code is in SparkStreaming : JavaInputDStream results = stream.map(record -> SparkTest.getTime(record.value()) + ":" + Long.toString(System.currentTimeMillis()) + ":" + Arrays.deepToString(SparkTest.finall(record.value())) + ":" +