Unsubscribe
the best tool to interact with Spark
Hi all, What is the best tool to interact easly with Spark? Thank you, Donni
problem with saving RandomForestClassifier model - Saprk Java
Hi SPark users, I built Random forest model by using Spark 1.6 with Java. I'm getting the following exception: User class threw exception: java.lang.UnsupportedOperationException: Pipeline write will fail on this Pipeline because it contains a stage which does not implement Writable. Does anyone know how I can fix it? Many thanks, Donni
Tuning Resource Allocation during runtime
Hi All, Is there any way to change the number of executors/cores during running Saprk Job. I have Spark Job containing two tasks: First task need many executors to run fastly. the second task has many input and output opeartions and shuffling, so it needs few executors, otherwise it taks loong time to finish. Does anyone knows if that possible in YARN? Thank you. Donni
run huge number of queries in Spark
Hi all, I want to run huge number of queries on Dataframe in Spark. I have a big data of text documents, I loded all documents into SparkDataFrame and create a temp table. dataFrame.registerTempTable("table1"); I have more than 50,000 terms, I want to get the document frequency for each by using the "table1". I use the follwing: DataFrame df=sqlContext.sql("select count(ID) from table1 where text like '%"+term+"%'"); but this scenario needs much time to finish because I have t run it from Spark Driver for each term. Does anyone has idea how I can run all queries in distributed way? Thank you && Best Regards, Donni
Re: Calculate co-occurring terms
Hi again, I found example in Scala <https://stackoverflow.com/questions/43797758/calculate-co-occurrence-terms-with-spark-using-scala?rq=1> but I don't have any experience with scala? can anyone convert it to java please? Thank you, Donni On Fri, Mar 23, 2018 at 8:57 AM, Donni Khan <prince.don...@googlemail.com> wrote: > Hi, > > I have a collection of text documents, I extracted the list of significat > terms from that collection. > I want to calculate co-occurance matrix for the extracted terms by using > spark. > > I actually stored the the collection of text document in a DataFrame, > > StructType schema = *new* StructType(*new* StructField[] { > > *new* StructField("ID", DataTypes.*StringType*, *false*, > > Metadata.*empty*()), > > *new* StructField("text", DataTypes.*StringType*, *false*, > > Metadata.*empty*()) }); > > // Create a DataFrame *wrt* a new schema > > DataFrame preProcessedDF = sqlContext.createDataFrame(jrdd, schema); > > I can extract the list of terms from "preProcessedDF " into a List or RDD > or DataFrame. > for each (term_i,term_j) I want to calculate the realted frequency from > the original dataset "preProcessedDF " > > anyone has scalbale soloution? > > thank you, > Donni > > > > > > > > >
Calculate co-occurring terms
Hi, I have a collection of text documents, I extracted the list of significat terms from that collection. I want to calculate co-occurance matrix for the extracted terms by using spark. I actually stored the the collection of text document in a DataFrame, StructType schema = *new* StructType(*new* StructField[] { *new* StructField("ID", DataTypes.*StringType*, *false*, Metadata.*empty*()), *new* StructField("text", DataTypes.*StringType*, *false*, Metadata.*empty*()) }); // Create a DataFrame *wrt* a new schema DataFrame preProcessedDF = sqlContext.createDataFrame(jrdd, schema); I can extract the list of terms from "preProcessedDF " into a List or RDD or DataFrame. for each (term_i,term_j) I want to calculate the realted frequency from the original dataset "preProcessedDF " anyone has scalbale soloution? thank you, Donni
high TFIDF value terms
Hi, anyone knows how I can get the high TFIDF value terms by using Spark(Java)? IDF idf = *new* IDF().setInputCol("TF").setOutputCol("IDF"); IDFModel idfModel = idf.fit(featurizedData); DataFrame tfidf = idfModel.transform(featurizedData); Thanks; Donni
Singular Value Decomposition (SVD) in Spark Java
Hi, I would like to use the *Singular Value Decomposition* (SVD) to extract the important concepts from a collection of text documents. I applied all preprcessing pipeline( Tokenizer, IDFModel, Matrix, ... ) then I applied SVD SingularValueDecompositionsvd = rowMatrix.computeSVD(5, *true*, 1.0E-9d); Vect Matrix V = svd.V(); I actually want to convert the results of SVD into the text(related feature). Does anyone know how can I get the original featurs from the Java Spark Matrix ? Thank you. Doni
cosine similarity implementation in Java Spark
Hi all, Is there any Implemenation of cosine similarity supports Java? Thanks, Donni
cosine similarity in Java Spark
Hi all, Is there any Implemenation of cosine similarity supports Java? Thanks, Donni
Cosine Similarity between documents - Rows
I have spark job to compute the similarity between text documents: RowMatrix rowMatrix = new RowMatrix(vectorsRDD.rdd()); CoordinateMatrix rowsimilarity=rowMatrix.columnSimilarities(0.5);JavaRDD entries = rowsimilarity.entries().toJavaRDD(); List list = entries.collect(); for(MatrixEntry s : list) System.out.println(s); the MatrixEntry(i, j, value) represents the similarity between columns(let's say the features of documents). But how can I show the similarity between rows? suppose I have five documents Doc1, Doc5, We would like to show the similarity between all those documnts. How do I get that? any help? Thank you Donni
cosine similarity between rows
I have spark job to compute the similarity between text documents: RowMatrix rowMatrix = new RowMatrix(vectorsRDD.rdd()); CoordinateMatrix rowsimilarity=rowMatrix.columnSimilarities(0.5);JavaRDD entries = rowsimilarity.entries().toJavaRDD(); List list = entries.collect(); for(MatrixEntry s : list) System.out.println(s); the MatrixEntry(i, j, value) represents the similarity between columns(let's say the features of documents).But how can I show the similarity between rows? suppose I have five documents Doc1, Doc5, I would like to show the similarity between all those documnts. How do I get that? any help?
text processing in spark (Spark job stucks for several minutes)
Hi, I'm applying preprocessing methods on big data of text by using spark-Java. I created my own NLP pipline as a normal java code and call it in the map function like this: MyRDD.map(call nlp pipeline fr each row) I run my job in a cluster 14 machines(32 Cores and about 140G for each). The job run correctltly, it distrbutes the documents across executors, but the job stuck on the last task for several minutes I looked at the job details, I found that most of documents are processed in several executrs, but only one task stuck on the small number of documents, it looks like the task waits for something, then after 10-20 minutes the task cntinues to process the rest documents and finish. I also tried to test different configurations but still the same. any help? thanks, Donni