Unsubscribe

2018-10-05 Thread Donni Khan



the best tool to interact with Spark

2018-06-26 Thread Donni Khan
Hi all,

What is the best tool to interact easly with Spark?

Thank you,
Donni


problem with saving RandomForestClassifier model - Saprk Java

2018-05-22 Thread Donni Khan
Hi SPark users,

I built Random forest model by using Spark 1.6 with Java. I'm getting the
following exception:

User class threw exception: java.lang.UnsupportedOperationException:
Pipeline write will fail on this Pipeline because it contains a stage which
does not implement Writable.


Does anyone know how I can fix it?

Many thanks,
Donni


Tuning Resource Allocation during runtime

2018-04-27 Thread Donni Khan
Hi All,

Is there any way to change the  number of executors/cores  during running
Saprk Job.
I have Spark Job containing two tasks: First task need many executors to
run fastly. the second task has many input and output opeartions and
shuffling, so it needs  few executors, otherwise it taks loong time to
finish.
Does anyone knows if that possible in YARN?


Thank you.
Donni


run huge number of queries in Spark

2018-04-04 Thread Donni Khan
Hi all,

I want to run huge number of queries on Dataframe in Spark. I have a big
data of text documents, I loded all documents into SparkDataFrame and
create a temp table.

dataFrame.registerTempTable("table1");

I have more than 50,000 terms, I want to get the document frequency for
each by using the "table1".

I use the follwing:

DataFrame df=sqlContext.sql("select count(ID) from table1 where text like
'%"+term+"%'");

but this scenario needs much time to finish because I have t run it from
Spark Driver for each term.


Does anyone has idea how I can run all queries in distributed way?

Thank you && Best Regards,

Donni


Re: Calculate co-occurring terms

2018-03-27 Thread Donni Khan
Hi again,

I found example in Scala
<https://stackoverflow.com/questions/43797758/calculate-co-occurrence-terms-with-spark-using-scala?rq=1>
 but I don't have any experience with scala?
can anyone convert it to java please?

Thank you,
Donni

On Fri, Mar 23, 2018 at 8:57 AM, Donni Khan <prince.don...@googlemail.com>
wrote:

> Hi,
>
> I have a collection of text documents, I extracted the list of significat
> terms from that collection.
> I want to calculate co-occurance matrix for the extracted terms by using
> spark.
>
> I actually stored the the collection of text document in a DataFrame,
>
> StructType schema = *new* StructType(*new* StructField[] {
>
> *new* StructField("ID", DataTypes.*StringType*, *false*,
>
> Metadata.*empty*()),
>
> *new* StructField("text", DataTypes.*StringType*, *false*,
>
> Metadata.*empty*()) });
>
> // Create a DataFrame *wrt* a new schema
>
> DataFrame preProcessedDF = sqlContext.createDataFrame(jrdd, schema);
>
> I can extract the list of terms from "preProcessedDF " into a List or RDD
> or DataFrame.
> for each (term_i,term_j) I want to calculate the realted frequency from
> the original dataset "preProcessedDF "
>
> anyone has scalbale soloution?
>
> thank you,
> Donni
>
>
>
>
>
>
>
>
>


Calculate co-occurring terms

2018-03-23 Thread Donni Khan
Hi,

I have a collection of text documents, I extracted the list of significat
terms from that collection.
I want to calculate co-occurance matrix for the extracted terms by using
spark.

I actually stored the the collection of text document in a DataFrame,

StructType schema = *new* StructType(*new* StructField[] {

*new* StructField("ID", DataTypes.*StringType*, *false*,

Metadata.*empty*()),

*new* StructField("text", DataTypes.*StringType*, *false*,

Metadata.*empty*()) });

// Create a DataFrame *wrt* a new schema

DataFrame preProcessedDF = sqlContext.createDataFrame(jrdd, schema);

I can extract the list of terms from "preProcessedDF " into a List or RDD
or DataFrame.
for each (term_i,term_j) I want to calculate the realted frequency from the
original dataset "preProcessedDF "

anyone has scalbale soloution?

thank you,
Donni


high TFIDF value terms

2018-02-05 Thread Donni Khan
Hi,

anyone knows how I can get the high TFIDF value terms by using Spark(Java)?

IDF idf = *new* IDF().setInputCol("TF").setOutputCol("IDF");

IDFModel idfModel = idf.fit(featurizedData);

DataFrame tfidf = idfModel.transform(featurizedData);


Thanks;

Donni


Singular Value Decomposition (SVD) in Spark Java

2018-01-31 Thread Donni Khan
Hi,

I would like to use the *Singular Value Decomposition* (SVD) to extract the
important concepts from a collection of text documents. I applied all
preprcessing pipeline( Tokenizer,  IDFModel,  Matrix, ... )

then I applied SVD
SingularValueDecomposition svd = rowMatrix.computeSVD(5,
*true*, 1.0E-9d);
Vect Matrix V = svd.V();


 I actually want to convert the results of SVD into the text(related
feature).

Does anyone know how can I get the original featurs from the Java Spark
Matrix ?


Thank you.
Doni


cosine similarity implementation in Java Spark

2017-12-14 Thread Donni Khan
Hi all,
Is there any  Implemenation of cosine similarity supports Java?

Thanks,
Donni


cosine similarity in Java Spark

2017-12-14 Thread Donni Khan
Hi all,
Is there any  Implemenation of cosine similarity supports Java?

Thanks,
Donni


Cosine Similarity between documents - Rows

2017-11-27 Thread Donni Khan
I have spark job to compute the similarity between text documents:

RowMatrix rowMatrix = new RowMatrix(vectorsRDD.rdd());
CoordinateMatrix
rowsimilarity=rowMatrix.columnSimilarities(0.5);JavaRDD
entries = rowsimilarity.entries().toJavaRDD();
List list = entries.collect();
for(MatrixEntry s : list) System.out.println(s);

the MatrixEntry(i, j, value) represents the similarity between
columns(let's say the features of documents).
But how can I show the similarity between rows?
suppose I have five documents Doc1, Doc5, We would like to show the
similarity between all those documnts.
 How do I get that? any help?

Thank you
Donni


cosine similarity between rows

2017-10-27 Thread Donni Khan
I have spark job to compute the similarity between text documents:

RowMatrix rowMatrix = new RowMatrix(vectorsRDD.rdd());
CoordinateMatrix
rowsimilarity=rowMatrix.columnSimilarities(0.5);JavaRDD
entries = rowsimilarity.entries().toJavaRDD();
List list = entries.collect();
for(MatrixEntry s : list) System.out.println(s);

the MatrixEntry(i, j, value) represents the similarity between
columns(let's say the features of documents).But how can I show the
similarity between rows? suppose I have five documents Doc1, Doc5, I
would like to show the similarity between all those documnts. How do I get
that?
 any help?


text processing in spark (Spark job stucks for several minutes)

2017-10-26 Thread Donni Khan
Hi,
I'm applying preprocessing methods on big data of text by using spark-Java.
I created my own NLP pipline as a normal java code and call it in the map
function like this:

MyRDD.map(call nlp pipeline fr each row)

I run my job in a cluster 14 machines(32 Cores  and about 140G for each).
The job run correctltly, it distrbutes the documents across executors, but
the job stuck on the last task for several minutes
I looked at the job details, I found that most of documents are processed
in several executrs, but only one task stuck on the small number of
documents, it looks like the task waits for something, then after 10-20
minutes the task cntinues to process the rest documents and finish.

I also tried to test different configurations but still the same.
any help?

thanks,
Donni