Hi,
I am having a hard time running outer join operation on two parquet
datasets. The dataset size is large ~500GB with a lot of culumns in tune of
1000.
As per YARN administer imposed limits in the queue, I can have a total of 20
vcores and 8GB memory per executor.
I specified meory overhead an
ra/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22CHAID%22>
> so AFAIK no, only random forests and GBTs using entropy or GINI for
> information gain is supported.
>
> On Tue, Aug 25, 2015 at 9:39 AM, jatinpreet wrote:
>
>> Hi,
>>
>> I wish to know if MLli
Hi,
I wish to know if MLlib supports CHAID regression and classifcation trees.
If yes, how can I build them in spark?
Thanks,
Jatin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/CHAID-Decision-Trees-tp24449.html
Sent from the Apache Spark User List mail
Hi,
I am getting very high GC time in my jobs. For smaller/real-time load, this
becomes a real problem.
Below are the details of a task I just ran. What could be the cause of such
skewed GC times?
36 26010 SUCCESS PROCESS_LOCAL 2 / Slave1 2015/03/17
11:18:44 20 s11
er saying there are no actual records). You need
> to ensure your data is more evenly distributed before this step.
>
> On Thu, Feb 19, 2015 at 10:53 AM, jatinpreet wrote:
>
>> Hi,
>>
>> I am running Spark 1.2.1 for compute intensive jobs comprising of multiple
&g
Hi,
I am running Spark 1.2.1 for compute intensive jobs comprising of multiple
tasks. I have observed that most tasks complete very quickly, but there are
always one or two tasks that take a lot of time to complete thereby
increasing the overall stage time. What could be the reason for this?
Foll
Hi,
I am using MLlib's Naive Baye's classifier to classify textual data. I am
accessing the posterior probabilities through a hack for each class.
Once I have trained the model, I want to remove documents whose confidence
of classification is low. Say for a document, if the highest class
probabi
Hi,
I am using Spark Version 1.1 in standalone mode in the cluster. Sometimes,
during Naive Baye's training, I get OptionalDataException at line,
map at NaiveBayes.scala:109
I am getting following exception on the console,
java.io.OptionalDataException:
java.io.ObjectInputStream.readO
Hi,
I wish to cluster a set of textual documents into undefined number of
classes. The clustering algorithm provided in MLlib i.e. K-means requires me
to give a pre-defined number of classes.
Is there any algorithm which is intelligent enough to identify how many
classes should be made based on
where
> the dot product comes from.
>
> Your bug is probably something lower-level and simple. I'd debug the
> Spark example and print exactly its values for the log priors and
> conditional probabilities, and the matrix operations, and yours too,
> and see where the diffe
Hi,
I have been running through some troubles while converting the code to Java.
I have done the matrix operations as directed and tried to find the maximum
score for each category. But the predicted category is mostly different from
the prediction done by MLlib.
I am fetching iterators of the pi
Hi Sean,
The values brzPi and brzTheta are of the form
breeze.linalg.DenseVector. So would I have to convert them back to
simple vectors and use a library to perform addition/multiplication?
If yes, can you please point me to the conversion logic and vector operation
library for Java?
Thanks,
Ja
Hi,
I am trying to access the posterior probability of Naive Baye's prediction
with MLlib using Java. As the member variables brzPi and brzTheta are
private, I applied a hack to access the values through reflection.
I am using Java and couldn't find a way to use the breeze library with Java.
If I
Thanks Arush! Your example is nice and easy to understand. I am implementing
it through Java though.
Jatin
-
Novice Big Data Programmer
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-serialization-issues-with-third-party-libraries-tp19454p19624.h
Thanks Sean, I was actually using instances created elsewhere inside my RDD
transformations which as I understand is against Spark programming model. I
was referred to a talk about UIMA and Spark integration from this year's
Spark summit, which had a workaround for this problem. I just had to make
Hi,
I am planning to use UIMA library to process data in my RDDs. I have had bad
experiences while using third party libraries inside worker tasks. The
system gets plagued with Serialization issues. But as UIMA classes are not
necessarily Serializable, I am not sure if it will work.
Please expla
I believe assuming uniform priors is the way to go for my use case.
I am not sure about how to 'drop the prior term' with Mllib. I am just
providing the samples as they come after creating term vectors for each
sample. But I guess I can Google that information.
I appreciate all the help. Spark
Sean,
My last sentence didn't come out right. Let me try to explain my question
again.
For instance, I have two categories, C1 and C2. I have trained 100 samples
for C1 and 10 samples for C2.
Now, I predict two samples one each of C1 and C2, namely S1 and S2
respectively. I get the following pre
Thanks a lot Sean. You are correct in assuming that my examples fall under a
single category.
It is interesting to see that the posterior probability can actually be
treated as something that is stable enough to have a constant threshold
value on per class basis. It would, I assume, keep changing
I have been trying the Naive Baye's implementation of Spark's MLlib.During
testing phase, I wish to eliminate data with low confidence of prediction.
My data set primarily consists of form based documents like reports and
application forms. They contain key-value pair type text and hence I assume
Thanks, I will try it out and raise a request for making the variables
accessible.
An unrelated question, do you think the probability value thus calculated
will be a good measure of confidence in prediction? I have been reading
mixed opinions about the same.
Jatin
-
Novice Big Data Progra
Thanks for the answer. The variables brzPi and brzTheta are declared private.
I am writing my code with Java otherwise I could have replicated the scala
class and performed desired computation, which is as I observed a
multiplication of brzTheta with test vector and adding this value to brzPi.
A
Hi,
Is there a way to get the confidence value of a prediction with MLlib's
implementation of Naive Baye's classification. I wish to eliminate the
samples that were classified with low confidence.
Thanks,
Jatin
-
Novice Big Data Programmer
--
View this message in context:
http://apache-s
Great! Thanks for the information. I will try it out.
-
Novice Big Data Programmer
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-cluster-stability-tp17929p17956.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
Hi,
I am running a small 6 node spark cluster for testing purposes. Recently,
one of the node's physical memory was filled up by temporary files and there
was no space left on the disk. Due to this my Spark jobs started failing
even though on the Spark Web UI the was shown 'Alive'. Once I logged o
Hi,
I am trying to persist the files generated as a result of Naive bayes
training with MLlib. These comprise of the model file, label index(own
class) and term dictionary(own class). I need to save them on an HDFS
location and then deserialize when needed for prediction.
How can I do the same wi
Hi,
I was able to get the training running in local mode with default settings,
there was a problem with document labels which were quite large(not 20 as
suggested earlier).
I am currently training 175000 documents on a single node with 2GB of
executor memory and 5GB of driver memory successfull
Xiangrui, Thanks for replying.
I am using the subset of newsgroup20 data. I will send you the vectorized
data for analysis shortly.
I have tried running in local mode as well but I get the same OOM exception.
I started with 4GB of data but then moved to smaller set to verify that
everything was
I get the following stacktrace if it is of any help.
14/09/23 15:46:02 INFO scheduler.DAGScheduler: failed: Set()
14/09/23 15:46:02 INFO scheduler.DAGScheduler: Missing parents for Stage 7:
List()
14/09/23 15:46:02 INFO scheduler.DAGScheduler: Submitting Stage 7
(MapPartitionsRDD[24] at combineByK
Xiangrui,
Yes, the total number of terms is 43839. I have also tried running it using
different values of parallelism ranging from 1/core to 10/core. I also used
multiple configurations like setting spark.storage.memoryFaction and
spark.shuffle.memoryFraction to default values. The point to note
Hi,I have been facing an unusual issue with Naive Baye's training. I run out
of heap space with even with limited data during training phase. I am trying
to run the same on a rudimentary cluster of two development machines in
standalone mode.I am reading data from an HBase table, converting them in
Thanks Xangrui and RJ for the responses.
RJ, I have created a Jira for the same. It would be great if you could look
into this. Following is the link to the improvement task,
https://issues.apache.org/jira/browse/SPARK-3614
Let me know if I can be of any help and please keep me posted!
Thanks,
J
Hi,
I have been running into memory overflow issues while creating TFIDF vectors
to be used in document classification using MLlib's Naive Baye's
classification implementation.
http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
Memory overfl
Hi,
I have been able to get the same accuracy with MLlib as Mahout's. The
pre-processing phase of Mahout was the reason behind the accuracy mismatch.
After studying and applying the same logic in my code, it worked like a
charm.
Thanks,
Jatin
-
Novice Big Data Programmer
--
View this mess
I have also ran some tests on the other algorithms available with MLlib but
got dismal accuracy. Is the method of creating LabeledPoint RDD different
for other algorithms such as, LinearRegressionWithSGD?
Any help is appreciated.
-
Novice Big Data Programmer
--
View this message in context:
Thanks for the information Xiangrui. I am using the following example to
classify documents.
http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
I am not sure if this is the best way to convert textual data into vectors.
Can you please confirm
Hi,
I tried running the classification program on the famous newsgroup data.
This had an even more drastic effect on the accuracy, as it dropped from
~82% in Mahout to ~72% in Spark MLlib.
Please help me in this regard as I have to use Spark in a production system
very soon and this is a blocker
Hi,
I tried running the classification program on the famous newsgroup data.
This had an even more drastic effect on the accuracy, as it dropped from
~82% in Mahout to ~72% in Spark MLlib.
Please help me in this regard as I have to use Spark in a production system
very soon and this is a blocker
Hi,
I had been using Mahout's Naive Bayes algorithm to classify document data.
For a specific train and test set, I was getting accuracy in the range of
86%. When I shifted to Spark's MLlib, the accuracy dropped to the vicinity
of 82%.
I am using same version of Lucene and logic to generate TFIDF
Hi,
I am contemplating the use of Hadoop with Java 8 in a production system. I
will be using Apache Spark for doing most of the computations on data stored
in HBase.
Although Hadoop seems to support JDK 8 with some tweaks, the official HBase
site states the following for version 0.98,
Running wi
40 matches
Mail list logo