Hello all,
I'm using Oozie to manage a Spark application on YARN cluster, in
yarn-cluster mode.
Recently I made some changes to the application in which the Hikari lib was
involved. Surprisingly when I started the job, I got ClassNotFound
exception for the Hikari classes. I'm passing a shade jar
Hi,
In my problem, I need to group the DataFrame, apply the business logic for
each group and finally emit a new DataFrame based on that. To describe in
detail, there is a device_dataframe which contains the timestamp of when
the device had been turned on (on) and turned off (off).
Hi, Using the following code I create a Thrift Server including a Hive
table from CSV file and I expect it considers the first line as a header
but when I select data from the so-called table, I see it considers the CSV
header as data row! It seems the line "TBLPROPERTIES(skip.header.line.count
=
Hi, I want to create a thrift server that has some hive table predefined
and listen on a port for the user query. Here is my code:
val spark = SparkSession.builder()
.config("hive.server2.thrift.port", "1")
.config("spark.sql.hive.thriftServer.singleSession", "true")
Hi,
Using the command
val table = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "A", "keyspace" -> "B"))
.load
someone can load whole table data into a dataframe. Instead, I want to run
a query in Cassandra and load just the result in dataframe (not
aming endpoint. That decision really depends more on
>> your use case.
>>
>> On Tue, Jan 22, 2019 at 11:56 PM Soheil Pourbafrani <
>> soheil.i...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I want to submit a job in YARN cluster to read data
Hi,
I want to submit a job in YARN cluster to read data from Cassandra and
write them in HDFS, every hour, for example.
Is it possible to make Spark Application sleep in a while true loop and
awake every hour to process data?
Hi, I want to write an application that load data from HDFS into tables and
create a ThriftServer and submit it to the YARN cluster.
The question is how Spark actually load data. Does Spark load data in the
memory since the application started or it waits for query and just loads
data according
Hi,
In my problem data is stored on both Database and HDFS. I create an
application that according to the query, Spark load data, process the
query and return the answer.
I'm looking for a service that gets SQL queries and returns the answers
(like Databases command line). Is there a way that my
Hi, I submit an app on Spark2 cluster using standalone scheduler on client
mode.
The app saves the results of the processing on the HDFS. There is no error
on output logs and the app finished successfully.
But the problem is it just create _SUCSSES and empty part-0 file on the
saving
Hi,
I got the TF-IDF vector for the documents and store it in an RDD and
convert into RowMatrix type:
val mat = new RowMatrix(tweets_tfidf)
Every element of RDD is a sparse Vector related to a document.
The problem is the *cosinSimilarity *compute the similarity between
columns. Is there any
Testing the *columnSimilarity *method in Spark, I create a *RowMatrix *
object:
val temp = sc.parallelize(Array((5.0, 1.0, 4.0), (2.0, 3.0, 8.0),
(4.0, 5.0, 10.0), (1.0,3.0, 6.0)))
val rows = temp.map(line => {
Vectors.dense(Array(line._1, line._2, line._3))
})
val mat = new RowMatrix(rows)
Hi,
I want to use org.apache.spark.util.Utils library in def main but I got the
error:
Symbole Util is not accessible from this place. Here is the code:
val temp = tokens.map(word => Utils.nonNegativeMod(x, y))
How can I make it accessible?
Hi,
Checking the Spark Sources, I faced with a type BDV:
breeze.linalg.{DenseVector => BDV}
and they used it in calculating IDF from Term Frequencies. What is it
exactly?
I want to apply minor modification in PySpark mllib but checking the
PySpark sources I found it uses the Scala (Java) sources. Is there any way
to do modification using Python?
u can find some classes implemented in Java:
>>
>> https://github.com/apache/spark/search?l=java
>>
>> Cheers,
>> Jeyhun
>>
>> On Sat, Nov 3, 2018 at 6:42 PM Soheil Pourbafrani
>> wrote:
>>
>>> Hi, I want to customize some part of Spark. I was wondering if there any
>>> Spark source is written in Java language, or all the sources are in Scala
>>> language?
>>>
>>
Hi, I want to customize some part of Spark. I was wondering if there any
Spark source is written in Java language, or all the sources are in Scala
language?
Hi, I want to know is it possible to customize the logic of TF_IDF in
Apache Spark?
In typical TF_IDF the TF is computed for each word regarding its documents.
For example, the TF of word "A" can be differentiated in documents D1 and
D2, but I want to see the TF as term frequency among whole
Hi,
I want to compute the cosine similarities of vectors using apache spark. In
a simple example, I created a vector from each document using built-in
tf-idf. Here is the code:
hashingTF = HashingTF(inputCol="tokenized", outputCol="tf")
tf = hashingTF.transform(df)
idf = IDF(inputCol="tf",
Got it, thanks!
On Fri, Nov 2, 2018 at 7:18 PM Eike von Seggern
wrote:
> Hi,
>
> Soheil Pourbafrani schrieb am Fr., 2. Nov. 2018
> um 15:43 Uhr:
>
>> Hi, I have an RDD of the form (((a), (b), (c), (d)), (e)) and I want to
>> transform every row to a dictionar
Hi, I have an RDD of the form (((a), (b), (c), (d)), (e)) and I want to
transform every row to a dictionary of the form a:(b, c, d, e)
Here is my code, but it's errorful!
map(lambda row : {row[0][0] : (row[1], row[0][1], row[0][2], row[0][3]))
Is it possible to do such a transformation?
Hi,
Doing cartesian multiplication against a matrix, I got the error:
pyspark.sql.utils.IllegalArgumentException: requirement failed: Number of
rows divided by rowsPerBlock cannot exceed maximum integer.
Here is the code:
normalizer = Normalizer(inputCol="feature", outputCol="norm")
data =
Hi,
There are some functions like map, flatMap, reduce and ..., that construct
the base data processing operation in big data (and Apache Spark). But
Spark, in new versions, introduces the high-level Dataframe API and
recommend using it. This is while there are no such functions in Dataframe
API
Hi, My text data are in the form of text file. In the processing logic, I
need to know each word is from which file. Actually, I need to tokenize the
words and create the pair of . The naive solution is to
call sc.textFile for each file and having the fileName in a variable,
create the pairs, but
Hi, I want to implement the Vector Space Model for texts using Spark. At
the first step, I calculate the Vector of the files (dictionary) and I made
it a broadcast variable to be accessible for all executors.
Vector_of_Words = selected_data.select('full_text').rdd\
.map(lambda x :
Ok, I'll do that. Thanks
On Sat, Sep 22, 2018 at 7:09 PM Jörn Franke wrote:
> Do you want to calculate it and share it once with all other executors?
> Then a broadcast variable maybe interesting for you,
>
> > On 22. Sep 2018, at 16:33, Soheil Pourbafrani
> wrote:
> &
Hi, I want to do some processing with PySpark and save the results in a
variable of type tuple that should be shared among the executors for
further processing.
Actually, it's a Text Mining Processing and I want to use the Vector Space
Model. So I want to calculate the Vector of all Words (that
Hi,
I run a Spark application on Yarn cluster and it complete the process
successfully, but at the end Yarn print in the console:
client token: N/A
diagnostics: Application application_1529485137783_0004 failed 4 times due
to AM Container for appattempt_1529485137783_0004_04 exited with
Hi, I have a JSON file in the following structure:
++---+
| full_text| id|
++---+
I want to tokenize each sentence into pairs of (word, id)
for example, having the record : ("Hi, How are you?", id)
I tried to fetch some data from Cassandra using SparkSql. For small tables,
all things go well but trying to fetch data from big tables I got the
following error:
java.lang.NoSuchMethodError:
Hi, usin Spark 2.3 I read a image in dataset using imageschema. Now after
some changes, I want to save dataset as a new image. How can I achieve this
in Spark ?
I have a HDFS high available cluster with two namenode, one as active
namenode and one as standby namenode. When I want to write data to HDFS I
use the active namenode address. Now, my question is what happened if
during spark writing data active namenode fails. Is there any way to set
both active
I am setting up a Yarn cluster to run Spark applications on that, but I'm
confused a bit!
Consider I have a 4-node yarn cluster including one resource manager and 3
node manager and spark are installed in all 4 nodes.
Now my question is when I want to submit spark application to yarn cluster,
is
arams)
);
On Thu, Dec 14, 2017 at 10:30 AM, Soheil Pourbafrani <soheil.i...@gmail.com>
wrote:
> The following code is in SparkStreaming :
>
> JavaInputDStream results = stream.map(record ->
> SparkTest.getTime(record.value()) + ":"
&
The following code is in SparkStreaming :
JavaInputDStream results = stream.map(record ->
SparkTest.getTime(record.value()) + ":"
+ Long.toString(System.currentTimeMillis()) + ":"
+ Arrays.deepToString(SparkTest.finall(record.value()))
+ ":" +
35 matches
Mail list logo