Exception: JDK-8154035 using Whole text files api

2017-07-05 Thread Reth RM
Hi, Using sc.wholeTextFiles to read warc file (example file here ). Spark reporting an error with stack trace pasted here : https://pastebin.com/qfmM2eKk Looks like its same as bug reported here:

Re: Spark job profiler results showing high TCP cpu time

2017-06-28 Thread Reth RM
elo Vanzin <van...@cloudera.com> > wrote: > >> That thread looks like the connection between the Spark process and >> jvisualvm. It's expected to show high up when doing sampling if the >> app is not doing much else. >> >> On Fri, Jun 23, 2017 at 10:46 AM, R

Spark job profiler results showing high TCP cpu time

2017-06-23 Thread Reth RM
Running a spark job on local machine and profiler results indicate that highest time spent in *sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.* Screenshot of profiler result can be seen here : https://jpst.it/10i-V Spark job(program) is performing IO (sc.wholeTextFile method of spark apis),

KMean clustering resulting Skewed Issue

2017-03-24 Thread Reth RM
Hi, I am using spark k mean for clustering records that consist of news documents, vectors are created by applying tf-idf. Dataset that I am using for testing right now is the gold-truth classified http://qwone.com/~jason/20Newsgroups/ Issue is all the documents are getting assigned to same

Mlib: TF-IDF Computation Improvement

2016-12-14 Thread Reth RM
Hi, Is my understanding correct that, right now, the way TF-IDF is computed is 3 steps. 1) Apply HashingTF on records and generate TF vectors. 2) Then IDF model is created with input TF vectors - which calculates DF(document frequencies of each term), 3) Finally TF vectors are transformed to

Optimization for Processing a million of HTML files

2016-12-12 Thread Reth RM
Hi, I have millions of html files in a directory, using "wholeTextFiles" api to load them and process further. Right now, testing it with 40k records and at the time of loading files(wholeTextFiles), it waits for minimum of 8-9 minutes. What are some recommended optimizations? Should consider any

Mapping KMean trained-data to respective records

2016-11-23 Thread Reth RM
I am using wholeTextFiles api to load bunch of text files and (caching this object) mapping its text content to tf-idf vectors and then applying kmean on these vectors. The k-mean model after training, predicts the clusterId of trained data by taking list of training data, question is how to map

Re: Clustering Webpages using KMean and Spark Apis : GC limit exceed.

2016-11-04 Thread Reth RM
at 11:13 AM, Reth RM <reth.ik...@gmail.com> wrote: > Hi, > > Can you please guide me through parallelizing the task of extracting > webpages text, converting text to doc vectors and finally applying k-mean. > I get a "GC overhead limit exceeded at java.util.Arrays.copyOfR

Clustering Webpages using KMean and Spark Apis : GC limit exceed.

2016-11-04 Thread Reth RM
Hi, Can you please guide me through parallelizing the task of extracting webpages text, converting text to doc vectors and finally applying k-mean. I get a "GC overhead limit exceeded at java.util.Arrays.copyOfRange" at task 3 below. detail stack trace : https://jpst.it/P33P Right now webpage

importing org.apache.spark.Logging class

2016-10-27 Thread Reth RM
Updated spark to version 2.0.0 and have issue with importing org.apache.spark.Logging Any suggested fix for this issue?

Re: K-Mean retrieving Cluster Members

2016-10-17 Thread Reth RM
)); } } ); On Mon, Oct 17, 2016 at 10:56 AM, Reth RM <reth.ik...@gmail.com> wrote: > Could you please point me to sample code to retrieve the cluster members > of K mean? > > The below code prints cluster centers. * I > needed cluster members belonging to each center.* > > >

K-Mean retrieving Cluster Members

2016-10-17 Thread Reth RM
Could you please point me to sample code to retrieve the cluster members of K mean? The below code prints cluster centers. * I needed cluster members belonging to each center.* val clusters = KMeans.train(parsedData, numClusters, numIterations) clusters.clusterCenters.foreach(println)

Re: Newbie Q: Issue related to connecting Spark Master Standalone through Scala app

2016-09-27 Thread Reth RM
hat you are trying? It is probably > intellij issue > > On Tue, Sep 27, 2016 at 3:59 PM, Reth RM <reth.ik...@gmail.com> wrote: > >> Hi, >> >> I have issue connecting spark master, receiving a RuntimeException: >> java.io.InvalidClassException: org.apache.spar

Newbie Q: Issue related to connecting Spark Master Standalone through Scala app

2016-09-26 Thread Reth RM
Hi, I have issue connecting spark master, receiving a RuntimeException: java.io.InvalidClassException: org.apache.spark.rpc.netty.RequestMessage. Followed the steps mentioned below. Can you please point me to where am I doing wrong? 1. Downloaded spark (version spark-2.0.0-bin-hadoop2.7) 2.