Spark SQL : Join operation failure

2017-02-21 Thread jatinpreet
Hi, I am having a hard time running outer join operation on two parquet datasets. The dataset size is large ~500GB with a lot of culumns in tune of 1000. As per YARN administer imposed limits in the queue, I can have a total of 20 vcores and 8GB memory per executor. I specified meory overhead

CHAID Decision Trees

2015-08-25 Thread jatinpreet
Hi, I wish to know if MLlib supports CHAID regression and classifcation trees. If yes, how can I build them in spark? Thanks, Jatin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CHAID-Decision-Trees-tp24449.html Sent from the Apache Spark User List

Re: CHAID Decision Trees

2015-08-25 Thread Jatinpreet Singh
://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22CHAID%22 so AFAIK no, only random forests and GBTs using entropy or GINI for information gain is supported. On Tue, Aug 25, 2015 at 9:39 AM, jatinpreet jatinpr...@gmail.com wrote: Hi, I wish to know if MLlib supports CHAID

High GC time

2015-03-17 Thread jatinpreet
Hi, I am getting very high GC time in my jobs. For smaller/real-time load, this becomes a real problem. Below are the details of a task I just ran. What could be the cause of such skewed GC times? 36 26010 SUCCESS PROCESS_LOCAL 2 / Slave1 2015/03/17 11:18:44 20 s

Re: Some tasks taking too much time to complete in a stage

2015-02-19 Thread Jatinpreet Singh
some header saying there are no actual records). You need to ensure your data is more evenly distributed before this step. On Thu, Feb 19, 2015 at 10:53 AM, jatinpreet jatinpr...@gmail.com wrote: Hi, I am running Spark 1.2.1 for compute intensive jobs comprising of multiple tasks. I have

OptionalDataException during Naive Bayes Training

2015-01-09 Thread jatinpreet
Hi, I am using Spark Version 1.1 in standalone mode in the cluster. Sometimes, during Naive Baye's training, I get OptionalDataException at line, map at NaiveBayes.scala:109 I am getting following exception on the console, java.io.OptionalDataException:

Clustering text data with MLlib

2014-12-29 Thread jatinpreet
Hi, I wish to cluster a set of textual documents into undefined number of classes. The clustering algorithm provided in MLlib i.e. K-means requires me to give a pre-defined number of classes. Is there any algorithm which is intelligent enough to identify how many classes should be made based on

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-28 Thread jatinpreet
is probably something lower-level and simple. I'd debug the Spark example and print exactly its values for the log priors and conditional probabilities, and the matrix operations, and yours too, and see where the difference is. On Thu, Nov 27, 2014 at 11:37 AM, jatinpreet [hidden email] http

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-27 Thread jatinpreet
Hi, I have been running through some troubles while converting the code to Java. I have done the matrix operations as directed and tried to find the maximum score for each category. But the predicted category is mostly different from the prediction done by MLlib. I am fetching iterators of the

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-26 Thread jatinpreet
Hi Sean, The values brzPi and brzTheta are of the form breeze.linalg.DenseVectorDouble. So would I have to convert them back to simple vectors and use a library to perform addition/multiplication? If yes, can you please point me to the conversion logic and vector operation library for Java?

Accessing posterior probability of Naive Baye's prediction

2014-11-25 Thread jatinpreet
Hi, I am trying to access the posterior probability of Naive Baye's prediction with MLlib using Java. As the member variables brzPi and brzTheta are private, I applied a hack to access the values through reflection. I am using Java and couldn't find a way to use the breeze library with Java. If

Re: Spark serialization issues with third-party libraries

2014-11-24 Thread jatinpreet
Thanks Arush! Your example is nice and easy to understand. I am implementing it through Java though. Jatin - Novice Big Data Programmer -- View this message in context:

Re: Spark serialization issues with third-party libraries

2014-11-23 Thread jatinpreet
Thanks Sean, I was actually using instances created elsewhere inside my RDD transformations which as I understand is against Spark programming model. I was referred to a talk about UIMA and Spark integration from this year's Spark summit, which had a workaround for this problem. I just had to make

Re: Naive Baye's classification confidence

2014-11-20 Thread jatinpreet
Thanks a lot Sean. You are correct in assuming that my examples fall under a single category. It is interesting to see that the posterior probability can actually be treated as something that is stable enough to have a constant threshold value on per class basis. It would, I assume, keep changing

Re: Naive Baye's classification confidence

2014-11-20 Thread jatinpreet
Sean, My last sentence didn't come out right. Let me try to explain my question again. For instance, I have two categories, C1 and C2. I have trained 100 samples for C1 and 10 samples for C2. Now, I predict two samples one each of C1 and C2, namely S1 and S2 respectively. I get the following

Re: Naive Baye's classification confidence

2014-11-20 Thread jatinpreet
I believe assuming uniform priors is the way to go for my use case. I am not sure about how to 'drop the prior term' with Mllib. I am just providing the samples as they come after creating term vectors for each sample. But I guess I can Google that information. I appreciate all the help. Spark

Spark serialization issues with third-party libraries

2014-11-20 Thread jatinpreet
Hi, I am planning to use UIMA library to process data in my RDDs. I have had bad experiences while using third party libraries inside worker tasks. The system gets plagued with Serialization issues. But as UIMA classes are not necessarily Serializable, I am not sure if it will work. Please

Naive Baye's classification confidence

2014-11-19 Thread jatinpreet
I have been trying the Naive Baye's implementation of Spark's MLlib.During testing phase, I wish to eliminate data with low confidence of prediction. My data set primarily consists of form based documents like reports and application forms. They contain key-value pair type text and hence I assume

Re: MLlib Naive Bayes classifier confidence

2014-11-10 Thread jatinpreet
Thanks for the answer. The variables brzPi and brzTheta are declared private. I am writing my code with Java otherwise I could have replicated the scala class and performed desired computation, which is as I observed a multiplication of brzTheta with test vector and adding this value to brzPi.

Re: MLlib Naive Bayes classifier confidence

2014-11-10 Thread jatinpreet
Thanks, I will try it out and raise a request for making the variables accessible. An unrelated question, do you think the probability value thus calculated will be a good measure of confidence in prediction? I have been reading mixed opinions about the same. Jatin - Novice Big Data

Re: Spark cluster stability

2014-11-03 Thread jatinpreet
Great! Thanks for the information. I will try it out. - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-cluster-stability-tp17929p17956.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark cluster stability

2014-11-02 Thread jatinpreet
Hi, I am running a small 6 node spark cluster for testing purposes. Recently, one of the node's physical memory was filled up by temporary files and there was no space left on the disk. Due to this my Spark jobs started failing even though on the Spark Web UI the was shown 'Alive'. Once I logged

Serialize/deserialize Naive Bayes model and index files

2014-10-15 Thread jatinpreet
Hi, I am trying to persist the files generated as a result of Naive bayes training with MLlib. These comprise of the model file, label index(own class) and term dictionary(own class). I need to save them on an HDFS location and then deserialize when needed for prediction. How can I do the same

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-24 Thread jatinpreet
Hi, I was able to get the training running in local mode with default settings, there was a problem with document labels which were quite large(not 20 as suggested earlier). I am currently training 175000 documents on a single node with 2GB of executor memory and 5GB of driver memory

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet
Xiangrui, Yes, the total number of terms is 43839. I have also tried running it using different values of parallelism ranging from 1/core to 10/core. I also used multiple configurations like setting spark.storage.memoryFaction and spark.shuffle.memoryFraction to default values. The point to note

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet
I get the following stacktrace if it is of any help. 14/09/23 15:46:02 INFO scheduler.DAGScheduler: failed: Set() 14/09/23 15:46:02 INFO scheduler.DAGScheduler: Missing parents for Stage 7: List() 14/09/23 15:46:02 INFO scheduler.DAGScheduler: Submitting Stage 7 (MapPartitionsRDD[24] at

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet
Xiangrui, Thanks for replying. I am using the subset of newsgroup20 data. I will send you the vectorized data for analysis shortly. I have tried running in local mode as well but I get the same OOM exception. I started with 4GB of data but then moved to smaller set to verify that everything

Re: New API for TFIDF generation in Spark 1.1.0

2014-09-20 Thread jatinpreet
Thanks Xangrui and RJ for the responses. RJ, I have created a Jira for the same. It would be great if you could look into this. Following is the link to the improvement task, https://issues.apache.org/jira/browse/SPARK-3614 Let me know if I can be of any help and please keep me posted! Thanks,

New API for TFIDF generation in Spark 1.1.0

2014-09-18 Thread jatinpreet
Hi, I have been running into memory overflow issues while creating TFIDF vectors to be used in document classification using MLlib's Naive Baye's classification implementation. http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ Memory

Re: Accuracy hit in classification with Spark

2014-09-15 Thread jatinpreet
Hi, I have been able to get the same accuracy with MLlib as Mahout's. The pre-processing phase of Mahout was the reason behind the accuracy mismatch. After studying and applying the same logic in my code, it worked like a charm. Thanks, Jatin - Novice Big Data Programmer -- View this

Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Hi, I had been using Mahout's Naive Bayes algorithm to classify document data. For a specific train and test set, I was getting accuracy in the range of 86%. When I shifted to Spark's MLlib, the accuracy dropped to the vicinity of 82%. I am using same version of Lucene and logic to generate

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Hi, I tried running the classification program on the famous newsgroup data. This had an even more drastic effect on the accuracy, as it dropped from ~82% in Mahout to ~72% in Spark MLlib. Please help me in this regard as I have to use Spark in a production system very soon and this is a blocker

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Hi, I tried running the classification program on the famous newsgroup data. This had an even more drastic effect on the accuracy, as it dropped from ~82% in Mahout to ~72% in Spark MLlib. Please help me in this regard as I have to use Spark in a production system very soon and this is a blocker

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Thanks for the information Xiangrui. I am using the following example to classify documents. http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ I am not sure if this is the best way to convert textual data into vectors. Can you please confirm

Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
I have also ran some tests on the other algorithms available with MLlib but got dismal accuracy. Is the method of creating LabeledPoint RDD different for other algorithms such as, LinearRegressionWithSGD? Any help is appreciated. - Novice Big Data Programmer -- View this message in

Spark on Hadoop with Java 8

2014-08-27 Thread jatinpreet
Hi, I am contemplating the use of Hadoop with Java 8 in a production system. I will be using Apache Spark for doing most of the computations on data stored in HBase. Although Hadoop seems to support JDK 8 with some tweaks, the official HBase site states the following for version 0.98, Running