from:"jatinpreet"

Spark SQL : Join operation failure

2017-02-21 Thread jatinpreet

Hi, I am having a hard time running outer join operation on two parquet datasets. The dataset size is large ~500GB with a lot of culumns in tune of 1000. As per YARN administer imposed limits in the queue, I can have a total of 20 vcores and 8GB memory per executor. I specified meory overhead an

Re: CHAID Decision Trees

2015-08-25 Thread Jatinpreet Singh

ra/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22CHAID%22> > so AFAIK no, only random forests and GBTs using entropy or GINI for > information gain is supported. > > On Tue, Aug 25, 2015 at 9:39 AM, jatinpreet wrote: > >> Hi, >> >> I wish to know if MLli

CHAID Decision Trees

2015-08-25 Thread jatinpreet

Hi, I wish to know if MLlib supports CHAID regression and classifcation trees. If yes, how can I build them in spark? Thanks, Jatin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CHAID-Decision-Trees-tp24449.html Sent from the Apache Spark User List mail

High GC time

2015-03-17 Thread jatinpreet

Hi, I am getting very high GC time in my jobs. For smaller/real-time load, this becomes a real problem. Below are the details of a task I just ran. What could be the cause of such skewed GC times? 36 26010 SUCCESS PROCESS_LOCAL 2 / Slave1 2015/03/17 11:18:44 20 s11

Re: Some tasks taking too much time to complete in a stage

2015-02-19 Thread Jatinpreet Singh

er saying there are no actual records). You need > to ensure your data is more evenly distributed before this step. > > On Thu, Feb 19, 2015 at 10:53 AM, jatinpreet wrote: > >> Hi, >> >> I am running Spark 1.2.1 for compute intensive jobs comprising of multiple &g

Some tasks taking too much time to complete in a stage

2015-02-19 Thread jatinpreet

Hi, I am running Spark 1.2.1 for compute intensive jobs comprising of multiple tasks. I have observed that most tasks complete very quickly, but there are always one or two tasks that take a lot of time to complete thereby increasing the overall stage time. What could be the reason for this? Foll

Unknown sample in Naive Baye's

2015-02-10 Thread jatinpreet

Hi, I am using MLlib's Naive Baye's classifier to classify textual data. I am accessing the posterior probabilities through a hack for each class. Once I have trained the model, I want to remove documents whose confidence of classification is low. Say for a document, if the highest class probabi

OptionalDataException during Naive Bayes Training

2015-01-09 Thread jatinpreet

Hi, I am using Spark Version 1.1 in standalone mode in the cluster. Sometimes, during Naive Baye's training, I get OptionalDataException at line, map at NaiveBayes.scala:109 I am getting following exception on the console, java.io.OptionalDataException: java.io.ObjectInputStream.readO

Clustering text data with MLlib

2014-12-29 Thread jatinpreet

Hi, I wish to cluster a set of textual documents into undefined number of classes. The clustering algorithm provided in MLlib i.e. K-means requires me to give a pre-defined number of classes. Is there any algorithm which is intelligent enough to identify how many classes should be made based on

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-28 Thread jatinpreet

where > the dot product comes from. > > Your bug is probably something lower-level and simple. I'd debug the > Spark example and print exactly its values for the log priors and > conditional probabilities, and the matrix operations, and yours too, > and see where the diffe

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-27 Thread jatinpreet

Hi, I have been running through some troubles while converting the code to Java. I have done the matrix operations as directed and tried to find the maximum score for each category. But the predicted category is mostly different from the prediction done by MLlib. I am fetching iterators of the pi

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-26 Thread jatinpreet

Hi Sean, The values brzPi and brzTheta are of the form breeze.linalg.DenseVector. So would I have to convert them back to simple vectors and use a library to perform addition/multiplication? If yes, can you please point me to the conversion logic and vector operation library for Java? Thanks, Ja

Accessing posterior probability of Naive Baye's prediction

2014-11-25 Thread jatinpreet

Hi, I am trying to access the posterior probability of Naive Baye's prediction with MLlib using Java. As the member variables brzPi and brzTheta are private, I applied a hack to access the values through reflection. I am using Java and couldn't find a way to use the breeze library with Java. If I

Re: Spark serialization issues with third-party libraries

2014-11-24 Thread jatinpreet

Thanks Arush! Your example is nice and easy to understand. I am implementing it through Java though. Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-serialization-issues-with-third-party-libraries-tp19454p19624.h

Re: Spark serialization issues with third-party libraries

2014-11-23 Thread jatinpreet

Thanks Sean, I was actually using instances created elsewhere inside my RDD transformations which as I understand is against Spark programming model. I was referred to a talk about UIMA and Spark integration from this year's Spark summit, which had a workaround for this problem. I just had to make

Spark serialization issues with third-party libraries

2014-11-20 Thread jatinpreet

Hi, I am planning to use UIMA library to process data in my RDDs. I have had bad experiences while using third party libraries inside worker tasks. The system gets plagued with Serialization issues. But as UIMA classes are not necessarily Serializable, I am not sure if it will work. Please expla

Re: Naive Baye's classification confidence

2014-11-20 Thread jatinpreet

I believe assuming uniform priors is the way to go for my use case. I am not sure about how to 'drop the prior term' with Mllib. I am just providing the samples as they come after creating term vectors for each sample. But I guess I can Google that information. I appreciate all the help. Spark

Re: Naive Baye's classification confidence

2014-11-20 Thread jatinpreet

Sean, My last sentence didn't come out right. Let me try to explain my question again. For instance, I have two categories, C1 and C2. I have trained 100 samples for C1 and 10 samples for C2. Now, I predict two samples one each of C1 and C2, namely S1 and S2 respectively. I get the following pre

Re: Naive Baye's classification confidence

2014-11-20 Thread jatinpreet

Thanks a lot Sean. You are correct in assuming that my examples fall under a single category. It is interesting to see that the posterior probability can actually be treated as something that is stable enough to have a constant threshold value on per class basis. It would, I assume, keep changing

Naive Baye's classification confidence

2014-11-19 Thread jatinpreet

I have been trying the Naive Baye's implementation of Spark's MLlib.During testing phase, I wish to eliminate data with low confidence of prediction. My data set primarily consists of form based documents like reports and application forms. They contain key-value pair type text and hence I assume

Re: MLlib Naive Bayes classifier confidence

2014-11-10 Thread jatinpreet

Thanks, I will try it out and raise a request for making the variables accessible. An unrelated question, do you think the probability value thus calculated will be a good measure of confidence in prediction? I have been reading mixed opinions about the same. Jatin - Novice Big Data Progra

Re: MLlib Naive Bayes classifier confidence

2014-11-10 Thread jatinpreet

Thanks for the answer. The variables brzPi and brzTheta are declared private. I am writing my code with Java otherwise I could have replicated the scala class and performed desired computation, which is as I observed a multiplication of brzTheta with test vector and adding this value to brzPi. A

MLlib Naive Bayes classifier confidence

2014-11-09 Thread jatinpreet

Hi, Is there a way to get the confidence value of a prediction with MLlib's implementation of Naive Baye's classification. I wish to eliminate the samples that were classified with low confidence. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-s

Re: Spark cluster stability

2014-11-03 Thread jatinpreet

Great! Thanks for the information. I will try it out. - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-cluster-stability-tp17929p17956.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Spark cluster stability

2014-11-02 Thread jatinpreet

Hi, I am running a small 6 node spark cluster for testing purposes. Recently, one of the node's physical memory was filled up by temporary files and there was no space left on the disk. Due to this my Spark jobs started failing even though on the Spark Web UI the was shown 'Alive'. Once I logged o

Serialize/deserialize Naive Bayes model and index files

2014-10-15 Thread jatinpreet

Hi, I am trying to persist the files generated as a result of Naive bayes training with MLlib. These comprise of the model file, label index(own class) and term dictionary(own class). I need to save them on an HDFS location and then deserialize when needed for prediction. How can I do the same wi

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-24 Thread jatinpreet

Hi, I was able to get the training running in local mode with default settings, there was a problem with document labels which were quite large(not 20 as suggested earlier). I am currently training 175000 documents on a single node with 2GB of executor memory and 5GB of driver memory successfull

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet

Xiangrui, Thanks for replying. I am using the subset of newsgroup20 data. I will send you the vectorized data for analysis shortly. I have tried running in local mode as well but I get the same OOM exception. I started with 4GB of data but then moved to smaller set to verify that everything was

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet

I get the following stacktrace if it is of any help. 14/09/23 15:46:02 INFO scheduler.DAGScheduler: failed: Set() 14/09/23 15:46:02 INFO scheduler.DAGScheduler: Missing parents for Stage 7: List() 14/09/23 15:46:02 INFO scheduler.DAGScheduler: Submitting Stage 7 (MapPartitionsRDD[24] at combineByK

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet

Xiangrui, Yes, the total number of terms is 43839. I have also tried running it using different values of parallelism ranging from 1/core to 10/core. I also used multiple configurations like setting spark.storage.memoryFaction and spark.shuffle.memoryFraction to default values. The point to note

Out of memory exception in MLlib's naive baye's classification training

2014-09-22 Thread jatinpreet

Hi,I have been facing an unusual issue with Naive Baye's training. I run out of heap space with even with limited data during training phase. I am trying to run the same on a rudimentary cluster of two development machines in standalone mode.I am reading data from an HBase table, converting them in

Re: New API for TFIDF generation in Spark 1.1.0

2014-09-20 Thread jatinpreet

Thanks Xangrui and RJ for the responses. RJ, I have created a Jira for the same. It would be great if you could look into this. Following is the link to the improvement task, https://issues.apache.org/jira/browse/SPARK-3614 Let me know if I can be of any help and please keep me posted! Thanks, J

New API for TFIDF generation in Spark 1.1.0

2014-09-18 Thread jatinpreet

Hi, I have been running into memory overflow issues while creating TFIDF vectors to be used in document classification using MLlib's Naive Baye's classification implementation. http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ Memory overfl

Re: Accuracy hit in classification with Spark

2014-09-14 Thread jatinpreet

Hi, I have been able to get the same accuracy with MLlib as Mahout's. The pre-processing phase of Mahout was the reason behind the accuracy mismatch. After studying and applying the same logic in my code, it worked like a charm. Thanks, Jatin - Novice Big Data Programmer -- View this mess

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet

I have also ran some tests on the other algorithms available with MLlib but got dismal accuracy. Is the method of creating LabeledPoint RDD different for other algorithms such as, LinearRegressionWithSGD? Any help is appreciated. - Novice Big Data Programmer -- View this message in context:

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet

Thanks for the information Xiangrui. I am using the following example to classify documents. http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ I am not sure if this is the best way to convert textual data into vectors. Can you please confirm

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet

Hi, I tried running the classification program on the famous newsgroup data. This had an even more drastic effect on the accuracy, as it dropped from ~82% in Mahout to ~72% in Spark MLlib. Please help me in this regard as I have to use Spark in a production system very soon and this is a blocker

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet

Hi, I tried running the classification program on the famous newsgroup data. This had an even more drastic effect on the accuracy, as it dropped from ~82% in Mahout to ~72% in Spark MLlib. Please help me in this regard as I have to use Spark in a production system very soon and this is a blocker

Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet

Hi, I had been using Mahout's Naive Bayes algorithm to classify document data. For a specific train and test set, I was getting accuracy in the range of 86%. When I shifted to Spark's MLlib, the accuracy dropped to the vicinity of 82%. I am using same version of Lucene and logic to generate TFIDF

Spark on Hadoop with Java 8

2014-08-26 Thread jatinpreet

Hi, I am contemplating the use of Hadoop with Java 8 in a production system. I will be using Apache Spark for doing most of the computations on data stored in HBase. Although Hadoop seems to support JDK 8 with some tweaks, the official HBase site states the following for version 0.98, Running wi

40 matches

Mail list logo