Spark SQL : Join operation failure

2017-02-21 Thread jatinpreet
Hi,

I am having a hard time running outer join operation on two parquet
datasets. The dataset size is large ~500GB with a lot of culumns in tune of
1000.

As per YARN administer imposed limits in the queue, I can have a total of 20
vcores and 8GB memory per executor.

I specified meory overhead and increased number of shuffle partitions to no
avail. This is how I submitted the job with pyspark,

spark-submit --master yarn-cluster --executor-memory 5500m --num-executors
19 --executor-cores 1 --conf spark.yarn.executor.memoryOverhead=2000 --conf
spark.sql.shuffle.partitions=2048 --driver-memory 7g --queue
./

The relevant code is, 

cm_go.registerTempTable("x")
ko.registerTempTable("y")
joined_df = sqlCtx.sql("select * from x FULL OUTER JOIN y ON field1=field2")
joined_df.write.save("/user/data/output")


I am getting errors like these:

ExecutorLostFailure (executor 5 exited caused by one of the running tasks)
Reason: Container marked as failed:
container_e36_1487531133522_0058_01_06 on host: dn2.bigdatalab.org. Exit
status: 52. Diagnostics: Exception from container-launch.
Container id: container_e36_1487531133522_0058_01_06
Exit code: 52
Stack trace: ExitCodeException exitCode=52: 
at org.apache.hadoop.util.Shell.runCommand(Shell.java:933)
at org.apache.hadoop.util.Shell.run(Shell.java:844)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1123)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:225)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 52

--

FetchFailed(null, shuffleId=0, mapId=-1, reduceId=508, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:695)
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:691)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:691)
at
org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:145)
at
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
at
org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

)



I would appreciate if someone can help me out on this.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-operation-failure-tp28414.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



CHAID Decision Trees

2015-08-25 Thread jatinpreet
Hi,

I wish to know if MLlib supports CHAID regression and classifcation trees.
If yes, how can I  build them in spark?

Thanks,
Jatin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/CHAID-Decision-Trees-tp24449.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: CHAID Decision Trees

2015-08-25 Thread Jatinpreet Singh
Hi Feynman,

Thanks for the information. Is there a way to depict decision tree as a
visualization for large amounts of data using any other technique/library?

Thanks,
Jatin

On Tue, Aug 25, 2015 at 11:42 PM, Feynman Liang fli...@databricks.com
wrote:

 Nothing is in JIRA
 https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22CHAID%22
 so AFAIK no, only random forests and GBTs using entropy or GINI for
 information gain is supported.

 On Tue, Aug 25, 2015 at 9:39 AM, jatinpreet jatinpr...@gmail.com wrote:

 Hi,

 I wish to know if MLlib supports CHAID regression and classifcation trees.
 If yes, how can I  build them in spark?

 Thanks,
 Jatin



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/CHAID-Decision-Trees-tp24449.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
Regards,
Jatinpreet Singh


High GC time

2015-03-17 Thread jatinpreet
Hi,

I am getting very high GC time in my jobs. For smaller/real-time load, this
becomes a real problem.

Below are the details of a task I just ran. What could be the cause of such
skewed GC times?

36  26010   SUCCESS PROCESS_LOCAL   2 / Slave1  2015/03/17 
11:18:44 20 s11 s
132.7 KB135.8 KB
37  26020   SUCCESS PROCESS_LOCAL   2 / Slave1  2015/03/17 
11:18:44 15 s11 s
79.4 KB 82.5 KB 
38  26030   SUCCESS PROCESS_LOCAL   1 / Slave2  2015/03/17 
11:18:44 2 s 0.7 s
0.0 B   37.8 KB 
39  26040   SUCCESS PROCESS_LOCAL   0 / slave3  2015/03/17 
11:18:45 21 s18 s
77.9 KB 79.8 KB 
40  26050   SUCCESS PROCESS_LOCAL   2 / Slave1  2015/03/17 
11:18:45 14 s10 s
73.0 KB 74.9 KB 
41  26060   SUCCESS PROCESS_LOCAL   2 / Slave1  2015/03/17 
11:18:45 14 s10 s
74.4 KB 76.5 KB 
42  26070   SUCCESS PROCESS_LOCAL   0 / Slave3  2015/03/17 
11:18:45 12 s12 s
10.9 KB 12.8 KB

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/High-GC-time-tp22104.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Some tasks taking too much time to complete in a stage

2015-02-19 Thread Jatinpreet Singh
Hi Imran,

Thanks for pointing that out. My data comes from the HBase connector of
Spark. I do not govern the distribution of data myself. HBase decides to
put the data on any of the region servers. Is there a way to distribute
data evenly? And I am especially interested in running even small loads
very quickly apart from bulk loads.

Thanks,
Jatin

On Thu, Feb 19, 2015 at 10:28 PM, Imran Rashid iras...@cloudera.com wrote:

 almost all your data is going to one task.  You can see that the shuffle
 read for task 0 is 153.3 KB, and for most other tasks its just 26B (which
 is probably just some header saying there are no actual records).  You need
 to ensure your data is more evenly distributed before this step.

 On Thu, Feb 19, 2015 at 10:53 AM, jatinpreet jatinpr...@gmail.com wrote:

 Hi,

 I am running Spark 1.2.1 for compute intensive jobs comprising of multiple
 tasks. I have observed that most tasks complete very quickly, but there
 are
 always one or two tasks that take a lot of time to complete thereby
 increasing the overall stage time. What could be the reason for this?

 Following are the statistics for one such stage. As you can see, the task
 with index 0 takes 1.1 minutes whereas others completed much more quickly.

 Aggregated Metrics by Executor
 Executor ID Address Task Time   Total Tasks Failed
 TasksSucceeded Tasks
 Input   Output  Shuffle ReadShuffle Write   Shuffle Spill (Memory)
 Shuffle
 Spill (Disk)
 0   slave1:5631146 s13  0   13  0.0 B   0.0 B
  0.0 B   0.0 B   0.0 B   0.0 B
 1   slave2:426482.1 min 13  0   13  0.0 B
  0.0 B   384.3 KB0.0 B   0.0 B
 0.0 B
 2   slave3:4432223 s12  0   12  0.0 B   0.0 B
  136.4 KB0.0 B   0.0 B   0.0
 B
 3   slave4:3798744 s12  0   12  0.0 B   0.0 B
  213.9 KB0.0 B   0.0 B   0.0
 B
 Tasks
 Index   ID  Attempt Status  Locality Level  Executor ID / Host
 Launch Time
 DurationGC Time Shuffle ReadErrors
 0   213 0   SUCCESS PROCESS_LOCAL   1 / slave2
 2015/02/19 11:40:05 1.1 min
 1 s 153.3 KB
 5   218 0   SUCCESS PROCESS_LOCAL   3 / slave4
 2015/02/19 11:40:05 23 ms
 26.0 B
 1   214 0   SUCCESS PROCESS_LOCAL   3 / slave4
 2015/02/19 11:40:05 2 s 0.9
 s   13.8 KB
 4   217 0   SUCCESS PROCESS_LOCAL   1 / slave2
 2015/02/19 11:40:05 26 ms
 26.0 B
 3   216 0   SUCCESS PROCESS_LOCAL   0 / slave1
 2015/02/19 11:40:05 11 ms
 0.0 B
 2   215 0   SUCCESS PROCESS_LOCAL   2 / slave3
 2015/02/19 11:40:05 27 ms
 26.0 B
 7   220 0   SUCCESS PROCESS_LOCAL   0 / slave1
 2015/02/19 11:40:05 11 ms
 0.0 B
 10  223 0   SUCCESS PROCESS_LOCAL   2 / slave3
 2015/02/19 11:40:05 23 ms
 26.0 B
 6   219 0   SUCCESS PROCESS_LOCAL   2 / slave3
 2015/02/19 11:40:05 23 ms
 26.0 B
 9   222 0   SUCCESS PROCESS_LOCAL   3 / slave4
 2015/02/19 11:40:05 23 ms
 26.0 B
 8   221 0   SUCCESS PROCESS_LOCAL   1 / slave2
 2015/02/19 11:40:05 23 ms
 26.0 B
 11  224 0   SUCCESS PROCESS_LOCAL   0 / slave1
 2015/02/19 11:40:05 10 ms
 0.0 B
 14  227 0   SUCCESS PROCESS_LOCAL   2 / slave3
 2015/02/19 11:40:05 24 ms
 26.0 B
 13  226 0   SUCCESS PROCESS_LOCAL   3 / slave4
 2015/02/19 11:40:05 23 ms
 26.0 B
 16  229 0   SUCCESS PROCESS_LOCAL   1 / slave2
 2015/02/19 11:40:05 22 ms
 26.0 B
 12  225 0   SUCCESS PROCESS_LOCAL   1 / slave2
 2015/02/19 11:40:05 22 ms
 26.0 B
 15  228 0   SUCCESS PROCESS_LOCAL   0 / slave1
 2015/02/19 11:40:05 10 ms
 0.0 B
 17  230 0   SUCCESS PROCESS_LOCAL   3 / slave4
 2015/02/19 11:40:05 22 ms
 26.0 B
 23  236 0   SUCCESS PROCESS_LOCAL   0 / slave1
 2015/02/19 11:40:05 10 ms
 0.0 B
 22  235 0   SUCCESS PROCESS_LOCAL   2 / slave3
 2015/02/19 11:40:05 21 ms
 26.0 B
 19  232 0   SUCCESS PROCESS_LOCAL   0 / slave1
 2015/02/19 11:40:05 10 ms
 0.0 B
 21  234 0   SUCCESS PROCESS_LOCAL   3 / slave4
 2015/02/19 11:40:05 25 ms
 26.0 B
 18  231 0   SUCCESS PROCESS_LOCAL   2 / slave3
 2015/02/19 11:40:05 24 ms
 26.0 B
 20  233 0   SUCCESS PROCESS_LOCAL   1 / slave2
 2015/02/19 11:40:05 28 ms
 26.0 B
 25  238 0   SUCCESS PROCESS_LOCAL   3 / slave4
 2015/02/19 11:40:05 20 ms
 26.0 B
 28  241 0   SUCCESS PROCESS_LOCAL   1 / slave2
 2015/02/19 11:40:05 27 ms
 26.0 B
 27  240 0   SUCCESS PROCESS_LOCAL   0 / slave1
 2015/02/19 11:40:05 10 ms
 0.0 B


 Thanks



 --
 View

OptionalDataException during Naive Bayes Training

2015-01-09 Thread jatinpreet
Hi,

I am using Spark Version 1.1 in standalone mode in the cluster. Sometimes,
during Naive Baye's training, I get OptionalDataException at line, 

map at NaiveBayes.scala:109

I am getting following exception on the console,

java.io.OptionalDataException: 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1371)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
java.util.HashMap.readObject(HashMap.java:1394)
sun.reflect.GeneratedMethodAccessor626.invoke(Unknown Source)
   
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:483)
   
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
   
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

What could be the reason behind this?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OptionalDataException-during-Naive-Bayes-Training-tp21059.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Clustering text data with MLlib

2014-12-29 Thread jatinpreet
Hi,

I wish to cluster a set of textual documents into undefined number of
classes. The clustering algorithm provided in MLlib i.e. K-means requires me
to give a pre-defined number of classes. 

Is there any algorithm which is intelligent enough to identify how many
classes should be made based on the input documents. I want to utilize the
speed and agility of Spark in the process.

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Accessing posterior probability of Naive Baye's prediction

2014-11-28 Thread jatinpreet
Thanks Sean, it did turn out to be a simple mistake after all. I appreciate
your help.

Jatin

On Thu, Nov 27, 2014 at 7:52 PM, sowen [via Apache Spark User List] 
ml-node+s1001560n19975...@n3.nabble.com wrote:

 No, the feature vector is not converted. It contains count n_i of how
 often each term t_i occurs (or a TF-IDF transformation of those). You
 are finding the class c such that P(c) * P(t_1|c)^n_1 * ... is
 maximized.

 In log space it's log(P(c)) + n_1*log(P(t_1|c)) + ...

 So your n_1 counts (or TF-IDF values) are used as-is and this is where
 the dot product comes from.

 Your bug is probably something lower-level and simple. I'd debug the
 Spark example and print exactly its values for the log priors and
 conditional probabilities, and the matrix operations, and yours too,
 and see where the difference is.

 On Thu, Nov 27, 2014 at 11:37 AM, jatinpreet [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19975i=0 wrote:

  Hi,
 
  I have been running through some troubles while converting the code to
 Java.
  I have done the matrix operations as directed and tried to find the
 maximum
  score for each category. But the predicted category is mostly different
 from
  the prediction done by MLlib.
 
  I am fetching iterators of the pi, theta and testData to do my
 calculations.
  pi and theta are in  log space while my testData vector is not, could
 that
  be a problem because I didn't see explicit conversion in Mllib also?
 
  For example, for two categories and 5 features, I am doing the following
  operation,
 
  [1,2] + [1 2 3 4 5  ] * [1,2,3,4,5]
 [6 7 8 9 10]
  These are simple element-wise matrix multiplication and addition
 operators.

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19975i=1
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19975i=2



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p19975.html
  To unsubscribe from Accessing posterior probability of Naive Baye's
 prediction, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=19828code=amF0aW5wcmVldEBnbWFpbC5jb218MTk4Mjh8MTY0NDI0MzIyNw==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
Regards,
Jatinpreet Singh




-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p20011.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-27 Thread jatinpreet
Hi,

I have been running through some troubles while converting the code to Java.
I have done the matrix operations as directed and tried to find the maximum
score for each category. But the predicted category is mostly different from
the prediction done by MLlib.

I am fetching iterators of the pi, theta and testData to do my calculations.
pi and theta are in  log space while my testData vector is not, could that
be a problem because I didn't see explicit conversion in Mllib also?

For example, for two categories and 5 features, I am doing the following
operation,

[1,2] + [1 2 3 4 5  ] * [1,2,3,4,5]
   [6 7 8 9 10]
These are simple element-wise matrix multiplication and addition operators.

Following is the code,

IteratorTuple2lt;Object, Object piIterator =
piValue.iterator();
IteratorTuple2lt;Tuple2lt;Object, Object, Object
thetaIterator = thetaValue.iterator();
IteratorTuple2lt;Object, Object testDataIterator = null;
  
double[] scores = new double[piValue.size()];
while (piIterator.hasNext()) {
double score = 0.0;
// reset to index 0
testDataIterator = testData.toBreeze().iterator();

while (testDataIterator.hasNext()) {
Tuple2Object, Object testTuple =
testDataIterator.next();
Tuple2Tuple2lt;Object, Object, Object thetaTuple =
thetaIterator.next();
 
score += ((double) testTuple2._2 * (double)
thetaTuple2._2);
}

Tuple2Object, Object piTuple = piIterator.next();
score += (double) piTuple._2;
scores[(int) piTuple._1] = score;
if (maxScore  score) {
predictedCategory = (int) piTuple._1;
maxScore = score;
}
}


Where am I going wrong?

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p19968.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Accessing posterior probability of Naive Baye's prediction

2014-11-26 Thread jatinpreet
Hi Sean,

The values brzPi and brzTheta are of the form
breeze.linalg.DenseVectorDouble. So would I have to convert them back to
simple vectors and use a library to perform addition/multiplication?

If yes, can you please point me to the conversion logic and vector operation
library for Java?

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p19858.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Accessing posterior probability of Naive Baye's prediction

2014-11-25 Thread jatinpreet
Hi,

I am trying to access the posterior probability of Naive Baye's prediction
with MLlib using Java. As the member variables brzPi and brzTheta are
private, I applied a hack to access the values through reflection.

I am using Java and couldn't find a way to use the breeze library with Java.
If I am correct the relevant calculation is given through line number 66 in
NaiveBayesModel class,

labels(brzArgmax(brzPi + brzTheta * testData.toBreeze))

Here the element-wise additions and multiplication of DenseVectors are given
as operators which are not directly accessible in Java. Also, the use of
brzArgmax is not very clear with Java for me.

Can anyone please help me convert the above mentioned calculation from Scala
to Java. 

PS: I have raised a improvement request on Jira for making these variables
directly accessible from outside.

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark serialization issues with third-party libraries

2014-11-24 Thread jatinpreet
Thanks Arush! Your example is nice and easy to understand. I am implementing
it through Java though.

Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-serialization-issues-with-third-party-libraries-tp19454p19624.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark serialization issues with third-party libraries

2014-11-23 Thread jatinpreet
Thanks Sean, I was actually using instances created elsewhere inside my RDD
transformations which as I understand is against Spark programming model. I
was referred to a talk about UIMA and Spark integration from this year's
Spark summit, which had a workaround for this problem. I just had to make
some class members transient.

http://spark-summit.org/2014/talk/leveraging-uima-in-spark

Thanks



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-serialization-issues-with-third-party-libraries-tp19454p19589.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Naive Baye's classification confidence

2014-11-20 Thread jatinpreet
Thanks a lot Sean. You are correct in assuming that my examples fall under a
single category.

It is interesting to see that the posterior probability can actually be
treated as something that is stable enough to have a constant threshold
value on per class basis. It would, I assume, keep changing for a sample as
I add/remove documents in the training set and thus warrant corresponding
change in the threshold.

Also, I have seen the class prediction probabilities to range from 0.003 to
0.8 for correct classifications in my sample data. This is a wide spectrum,
so is there a way to change that? Maybe by replicating the samples for the
classes I get low confidence but accurate classification for.

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Naive-Baye-s-classification-confidence-tp19341p19358.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Naive Baye's classification confidence

2014-11-20 Thread jatinpreet
Sean,

My last sentence didn't come out right. Let me try to explain my question
again.

For instance, I have two categories, C1 and C2. I have trained 100 samples
for C1 and 10 samples for C2.

Now, I predict two samples one each of C1 and C2, namely S1 and S2
respectively. I get the following prediction results,

S1= Category: C1, Probability: 0.7
S2= Category: C2, Probability: 0.04

Now, both the predictions are correct but their probabilities are far apart.
Can I improve the prediction probability by taking the 10 samples I have of
C2 and replicating each of them 10 times making the total count equal to 100
which is same as C1.

Can I expect this to increase the probability of sample S2 after training
the new set? Is this a viable approach? 

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Naive-Baye-s-classification-confidence-tp19341p19366.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Naive Baye's classification confidence

2014-11-20 Thread jatinpreet
I believe assuming uniform priors is the way to go for my use case. 

I am not sure about how to 'drop the prior term' with Mllib. I am just
providing the samples as they come after creating term vectors for each
sample. But I guess I can Google that information. 

I appreciate all the help. Spark community is amazing! 



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Naive-Baye-s-classification-confidence-tp19341p19370.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark serialization issues with third-party libraries

2014-11-20 Thread jatinpreet
Hi,

I am planning to use UIMA library to process data in my RDDs. I have had bad
experiences while using third party libraries inside worker tasks. The
system gets plagued with Serialization issues. But as UIMA classes are not
necessarily Serializable, I am not sure if it will work. 

Please explain which classes need to be Serializable and which of them can
be left as it is? A clear understanding will help me a lot.

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-serialization-issues-with-third-party-libraries-tp19454.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Naive Baye's classification confidence

2014-11-19 Thread jatinpreet
I have been trying the Naive Baye's implementation of Spark's MLlib.During
testing phase, I wish to eliminate data with low confidence of prediction.

My data set primarily consists of form based documents like reports and
application forms. They contain key-value pair type text and hence I assume
the independence condition holds better than with natural language.

About the quality of priors, I am not doing anything special. I am training
more or less equal number of samples for each class and have left the heavy
lifting to be done by MLlib.

Given these facts, does it make sense to have confidence thresholds defined
for each category above which I will get correct results consistently?

Thanks
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Naive-Baye-s-classification-confidence-tp19341.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLlib Naive Bayes classifier confidence

2014-11-10 Thread jatinpreet
Thanks for the answer. The variables brzPi and brzTheta are declared private.
I am writing my code with Java otherwise I could have replicated the scala
class and performed desired computation, which is as I observed  a
multiplication of brzTheta  with test vector and adding this value to brzPi.

Any suggestions of a way out other than replicating the whole functionality
of Naive Baye's model in Java? That would be a time consuming process.



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Naive-Bayes-classifier-confidence-tp18456p18472.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLlib Naive Bayes classifier confidence

2014-11-10 Thread jatinpreet
Thanks, I will try it out and raise a request for making the variables
accessible.

An unrelated question, do you think the probability value thus calculated
will be a good measure of confidence in prediction? I have been reading
mixed opinions about the same.

Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Naive-Bayes-classifier-confidence-tp18456p18497.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark cluster stability

2014-11-03 Thread jatinpreet
Great! Thanks for the information. I will try it out.



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-cluster-stability-tp17929p17956.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark cluster stability

2014-11-02 Thread jatinpreet
Hi,

I am running a small 6 node spark cluster for testing purposes. Recently,
one of the node's physical memory was filled up by temporary files and there
was no space left on the disk. Due to this my Spark jobs started failing
even though on the Spark Web UI the was shown 'Alive'. Once I logged on to
the machine and cleaned up some trash, I was able to run the jobs again.

My question is, how reliable my Spark cluster can be if issues like these
can bring down my jobs? I would have expected Spark to not use this node or
at least distribute this work to other nodes. But as the node was still
alive, it tried to run tasks on it regardless.

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-cluster-stability-tp17929.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Serialize/deserialize Naive Bayes model and index files

2014-10-15 Thread jatinpreet
Hi,

I am trying to persist the files generated as a result of Naive bayes
training with MLlib. These comprise of the model file, label index(own
class) and term dictionary(own class). I need to save them on an HDFS
location and then deserialize when needed for prediction.

How can I do the same with Spark? Also, I have the option of saving these
instances in HBase in binary form. Which approach makes more sense?

Thanks,
Jatin





-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Serialize-deserialize-Naive-Bayes-model-and-index-files-tp16513.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-24 Thread jatinpreet
Hi,

I was able to get the training running in local mode with default settings,
there was a problem with document labels which were quite large(not 20 as
suggested earlier). 

I am currently training 175000 documents on a single node with 2GB of
executor memory and 5GB of driver memory successfully. If I increase the
number of documents, I get the OOM error. I wish to understand what
generally the bottlenecks are for naive bayes, is it the executor or the
driver memory? Also, what are the things to keep in mind while training huge
sets of data so that I can have a bullet proof classification system,
slowing down in case of low memory is fine but not exceptions.

As a side note, is there any classification algorithm in MLlib which can
just append the new training data to an existing model? With naive bayes, I
need to have all the data available at once for training.

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-exception-in-MLlib-s-naive-baye-s-classification-training-tp14809p15052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet
Xiangrui, 

Yes, the total number of terms is 43839. I have also tried running it using
different values of parallelism ranging from 1/core to 10/core. I also used
multiple configurations like setting spark.storage.memoryFaction and
spark.shuffle.memoryFraction to default values. The point to note here is
that I am not using caching or persisting the RDDs and therefore I set the
storage fraction to 0.

The driver data available under executors tab is as follows for 3GB of
allocated memory:

Memory: 0.0 B Used (1781.8 MB Total)
Disk: 0.0 B Used
Executor ID Address RDD Blocks  Memory Used Disk Used   Active 
TasksFailed
Tasks   Complete Tasks  Total Tasks Task Time   Shuffle ReadShuffle 
Write
driverephesoft29:594940   0.0 B / 1781.8 MB   0.0 B   
1   0   4   5   19.3 s  0.0 B
27.5 MB


Memory used value is always 0 for the driver. Is there something fishy here?

The out of memory exception occurs in NaiveBayes.scala at combineByKey (line
91) or collect (line 96) based on the heap size allocated. In the memory
profiler, the program runs fine until TFIDF creation, but when training
starts, the memory usage goes up until the point of failure.

I want to understand if the OOM exception is occurring on driver or the
worker node.It should not be worker node, because as I understand, spark
automatically spills the data from memory to disk if available memory is not
adequate. Then why do I get these errors at all? If it is the driver, then
how do I calculate the total memory requirements as 3-4 GB ram for training
approximately 13 MB of training data  with 43839 terms is preposterous.

My expectation was that with spark was that if the memory is available it
would be much faster than Mahout, but if enough memory is not there, then it
would only be slower and not throw exceptions. Mahout ran fine with much
larger data, and it too had to collect a lot of data on a single node during
training.

May be I am not getting the point here due to my limited knowledge of Spark.
Please help me out with this and point me the right direction.

Thanks,
Jatin




-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-exception-in-MLlib-s-naive-baye-s-classification-training-tp14809p14879.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet
I get the following stacktrace if it is of any help.

14/09/23 15:46:02 INFO scheduler.DAGScheduler: failed: Set()
14/09/23 15:46:02 INFO scheduler.DAGScheduler: Missing parents for Stage 7:
List()
14/09/23 15:46:02 INFO scheduler.DAGScheduler: Submitting Stage 7
(MapPartitionsRDD[24] at combineByKey at NaiveBayes.scala:91), which is now
runnable
14/09/23 15:46:02 INFO executor.Executor: Finished task ID 7
14/09/23 15:46:02 INFO scheduler.DAGScheduler: Submitting 1 missing tasks
from Stage 7 (MapPartitionsRDD[24] at combineByKey at NaiveBayes.scala:91)
14/09/23 15:46:02 INFO scheduler.TaskSchedulerImpl: Adding task set 7.0 with
1 tasks
14/09/23 15:46:02 INFO scheduler.TaskSetManager: Starting task 7.0:0 as TID
8 on executor localhost: localhost (PROCESS_LOCAL)
14/09/23 15:46:02 INFO scheduler.TaskSetManager: Serialized task 7.0:0 as
535061 bytes in 1 ms
14/09/23 15:46:02 INFO executor.Executor: Running task ID 8
14/09/23 15:46:02 INFO storage.BlockManager: Found block broadcast_0 locally
14/09/23 15:46:03 INFO
storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight:
50331648, targetRequestSize: 10066329
14/09/23 15:46:03 INFO
storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty
blocks out of 1 blocks
14/09/23 15:46:03 INFO
storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote
fetches in 1 ms
14/09/23 15:46:04 WARN collection.ExternalAppendOnlyMap: Spilling in-memory
map of 452 MB to disk (1 time so far)
14/09/23 15:46:07 WARN collection.ExternalAppendOnlyMap: Spilling in-memory
map of 452 MB to disk (2 times so far)
14/09/23 15:46:09 WARN collection.ExternalAppendOnlyMap: Spilling in-memory
map of 438 MB to disk (3 times so far)
14/09/23 15:46:12 WARN collection.ExternalAppendOnlyMap: Spilling in-memory
map of 479 MB to disk (4 times so far)
14/09/23 15:46:22 ERROR executor.Executor: Exception in task ID 8
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:71)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
14/09/23 15:46:22 WARN scheduler.TaskSetManager: Lost TID 8 (task 7.0:0)
14/09/23 15:46:22 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught
exception in thread Thread[Executor task launch worker-1,5,main]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:71)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
14/09/23 15:46:22 WARN scheduler.TaskSetManager: Loss was due to
java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-23 Thread jatinpreet
Xiangrui, Thanks for replying. 

I am using the subset of newsgroup20 data. I will send you the vectorized
data for analysis shortly. 

I have tried running in local mode as well but I get the same OOM exception.
I started with 4GB of data but then moved to smaller set to verify that
everything was fine but I get the error on this small data too. I ultimately
want the system to handle any amount of data we throw at it without OOM
exceptions.

My concern is how spark will behave with a lot of data during training and
prediction phase. I need to know exactly what memory requirements are there
for a given set of data and where it is needed (driver or executor). If
there are any guidelines for it, that would be great. 

Thanks, 
Jatin 



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-exception-in-MLlib-s-naive-baye-s-classification-training-tp14809p14969.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: New API for TFIDF generation in Spark 1.1.0

2014-09-20 Thread jatinpreet
Thanks Xangrui and RJ for the responses.

RJ, I have created a Jira for the same. It would be great if you could look
into this. Following is the link to the improvement task,
https://issues.apache.org/jira/browse/SPARK-3614

Let me know if I can be of any help and please keep me posted!

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543p14737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



New API for TFIDF generation in Spark 1.1.0

2014-09-18 Thread jatinpreet
Hi,

I have been running into memory overflow issues while creating TFIDF vectors
to be used in document classification using MLlib's Naive Baye's
classification implementation. 

http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/

Memory overflow and GC issues occur while collecting idfs for all the terms.
To give an idea of scale, I am reading around 615,000(around 4GB of text
data) small sized documents from HBase  and running the spark program with 8
cores and 6GB of executor memory. I have tried increasing the parallelism
level and shuffle memory fraction but to no avail.

The new TFIDF generation APIs caught my eye in the latest Spark version
1.1.0. The example given in the official documentation mentions creation of
TFIDF vectors based of Hashing Trick. I want to know if it will solve the
mentioned problem by benefiting from reduced memory consumption. 

Also, the example does not state how to create labeled points for a corpus
of pre-classified document data. For example, my training input looks
something like this,

DocumentType  |  Content
-
D1   |  This is Doc1 sample.
D1   |  This also belongs to Doc1.
D1   |  Yet another Doc1 sample.
D2   |  Doc2 sample.
D2   |  Sample content for Doc2.
D3   |  The only sample for Doc3.
D4   |  Doc4 sample looks like this.
D4   |  This is Doc4 sample content.

I want to create labeled points from this sample data for training. And once
the Naive Bayes model is created, I generate TFIDFs for the test documents
and predict the document type.

If the new API can solve my issue, how can I generate labelled points using
the new APIs? An example would be great.

Also, I have a special requirement of ignoring terms that occur in less than
two documents. This has important implications for the accuracy of my use
case and needs to be accommodated while generating TFIDFs.

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Accuracy hit in classification with Spark

2014-09-15 Thread jatinpreet
Hi,

I have been able to get the same accuracy with MLlib as Mahout's. The
pre-processing phase of Mahout was the reason  behind the accuracy mismatch.
After studying and applying the same logic in my code, it worked like a
charm.

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accuracy-hit-in-classification-with-Spark-tp13773p14221.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Hi,

I had been using Mahout's Naive Bayes algorithm to classify document data.
For a specific train and test set, I was getting accuracy in the range of
86%. When I shifted to Spark's MLlib, the accuracy dropped to the vicinity
of 82%.

I am using same version of Lucene and logic to generate TFIDF vectors. I
tried fiddling with the smoothing parameter but to no avail. 

My question is if the underlying algorithm is same in both Mahout and MLlib,
why this accuracy dip is being observed?



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accuracy-hit-in-classification-with-Spark-tp13773.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Hi,

I tried running the classification program on the famous newsgroup data.
This had an even more drastic effect on the accuracy, as it dropped from
~82% in Mahout to ~72% in Spark MLlib.

Please help me in this regard as I have to use Spark in a production system
very soon and this is a blocker for me.

Thanks,
Jatin 



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accuracy-hit-in-classification-with-Spark-tp13773p13792.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Hi,

I tried running the classification program on the famous newsgroup data.
This had an even more drastic effect on the accuracy, as it dropped from
~82% in Mahout to ~72% in Spark MLlib.

Please help me in this regard as I have to use Spark in a production system
very soon and this is a blocker for me.

Thanks,
Jatin 



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accuracy-hit-in-classification-with-Spark-tp13773p13793.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
Thanks for the information Xiangrui. I am using the following example to
classify documents.

http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/

I am not sure if this is the best way to convert textual data into vectors.
Can you please confirm if this is the ideal solution as I could not identify
any shortcomings.

Also, I am splitting the data into 70/30 sets, which is same for Mahout so
it should not have an impact on accuracy.

Thanks,
Jatin




-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accuracy-hit-in-classification-with-Spark-tp13773p13811.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Accuracy hit in classification with Spark

2014-09-09 Thread jatinpreet
I have also ran some tests on the other algorithms available with MLlib but
got dismal accuracy. Is the method of creating LabeledPoint RDD different
for other algorithms such as, LinearRegressionWithSGD?

Any help is appreciated.



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accuracy-hit-in-classification-with-Spark-tp13773p13812.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark on Hadoop with Java 8

2014-08-27 Thread jatinpreet
Hi,

I am contemplating the use of Hadoop with Java 8 in a production system. I
will be using Apache Spark for doing most of the computations on data stored
in HBase.

Although Hadoop seems to support JDK 8 with some tweaks, the official HBase
site states the following for version 0.98,

Running with JDK 8 works but is not well tested. Building with JDK 8 would
require removal of the deprecated remove() method of the PoolMap class and
is under consideration. See ee HBASE-7608 for more information about JDK 8
support.

I am inclined towards using JDK 8 specifically for the support of lambda
expressions which will take a lot of verbosity out of my Spark
programs(Scala learning curve is a deterrent for me as a possible bottleneck
for future talent acquisition).

Is it a good idea to use Spark/Hadoop/HBase combo with Java 8 at the moment?

Thanks,
Jatin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Hadoop-with-Java-8-tp12883.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org