Spark SQL : Join operation failure
Hi, I am having a hard time running outer join operation on two parquet datasets. The dataset size is large ~500GB with a lot of culumns in tune of 1000. As per YARN administer imposed limits in the queue, I can have a total of 20 vcores and 8GB memory per executor. I specified meory overhead and increased number of shuffle partitions to no avail. This is how I submitted the job with pyspark, spark-submit --master yarn-cluster --executor-memory 5500m --num-executors 19 --executor-cores 1 --conf spark.yarn.executor.memoryOverhead=2000 --conf spark.sql.shuffle.partitions=2048 --driver-memory 7g --queue ./ The relevant code is, cm_go.registerTempTable("x") ko.registerTempTable("y") joined_df = sqlCtx.sql("select * from x FULL OUTER JOIN y ON field1=field2") joined_df.write.save("/user/data/output") I am getting errors like these: ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_e36_1487531133522_0058_01_06 on host: dn2.bigdatalab.org. Exit status: 52. Diagnostics: Exception from container-launch. Container id: container_e36_1487531133522_0058_01_06 Exit code: 52 Stack trace: ExitCodeException exitCode=52: at org.apache.hadoop.util.Shell.runCommand(Shell.java:933) at org.apache.hadoop.util.Shell.run(Shell.java:844) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1123) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:225) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 52 -- FetchFailed(null, shuffleId=0, mapId=-1, reduceId=508, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:695) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:691) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:691) at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:145) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49) at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ) I would appreciate if someone can help me out on this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-operation-failure-tp28414.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
CHAID Decision Trees
Hi, I wish to know if MLlib supports CHAID regression and classifcation trees. If yes, how can I build them in spark? Thanks, Jatin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CHAID-Decision-Trees-tp24449.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: CHAID Decision Trees
Hi Feynman, Thanks for the information. Is there a way to depict decision tree as a visualization for large amounts of data using any other technique/library? Thanks, Jatin On Tue, Aug 25, 2015 at 11:42 PM, Feynman Liang fli...@databricks.com wrote: Nothing is in JIRA https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22CHAID%22 so AFAIK no, only random forests and GBTs using entropy or GINI for information gain is supported. On Tue, Aug 25, 2015 at 9:39 AM, jatinpreet jatinpr...@gmail.com wrote: Hi, I wish to know if MLlib supports CHAID regression and classifcation trees. If yes, how can I build them in spark? Thanks, Jatin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CHAID-Decision-Trees-tp24449.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Regards, Jatinpreet Singh
High GC time
Hi, I am getting very high GC time in my jobs. For smaller/real-time load, this becomes a real problem. Below are the details of a task I just ran. What could be the cause of such skewed GC times? 36 26010 SUCCESS PROCESS_LOCAL 2 / Slave1 2015/03/17 11:18:44 20 s11 s 132.7 KB135.8 KB 37 26020 SUCCESS PROCESS_LOCAL 2 / Slave1 2015/03/17 11:18:44 15 s11 s 79.4 KB 82.5 KB 38 26030 SUCCESS PROCESS_LOCAL 1 / Slave2 2015/03/17 11:18:44 2 s 0.7 s 0.0 B 37.8 KB 39 26040 SUCCESS PROCESS_LOCAL 0 / slave3 2015/03/17 11:18:45 21 s18 s 77.9 KB 79.8 KB 40 26050 SUCCESS PROCESS_LOCAL 2 / Slave1 2015/03/17 11:18:45 14 s10 s 73.0 KB 74.9 KB 41 26060 SUCCESS PROCESS_LOCAL 2 / Slave1 2015/03/17 11:18:45 14 s10 s 74.4 KB 76.5 KB 42 26070 SUCCESS PROCESS_LOCAL 0 / Slave3 2015/03/17 11:18:45 12 s12 s 10.9 KB 12.8 KB Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/High-GC-time-tp22104.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Some tasks taking too much time to complete in a stage
Hi Imran, Thanks for pointing that out. My data comes from the HBase connector of Spark. I do not govern the distribution of data myself. HBase decides to put the data on any of the region servers. Is there a way to distribute data evenly? And I am especially interested in running even small loads very quickly apart from bulk loads. Thanks, Jatin On Thu, Feb 19, 2015 at 10:28 PM, Imran Rashid iras...@cloudera.com wrote: almost all your data is going to one task. You can see that the shuffle read for task 0 is 153.3 KB, and for most other tasks its just 26B (which is probably just some header saying there are no actual records). You need to ensure your data is more evenly distributed before this step. On Thu, Feb 19, 2015 at 10:53 AM, jatinpreet jatinpr...@gmail.com wrote: Hi, I am running Spark 1.2.1 for compute intensive jobs comprising of multiple tasks. I have observed that most tasks complete very quickly, but there are always one or two tasks that take a lot of time to complete thereby increasing the overall stage time. What could be the reason for this? Following are the statistics for one such stage. As you can see, the task with index 0 takes 1.1 minutes whereas others completed much more quickly. Aggregated Metrics by Executor Executor ID Address Task Time Total Tasks Failed TasksSucceeded Tasks Input Output Shuffle ReadShuffle Write Shuffle Spill (Memory) Shuffle Spill (Disk) 0 slave1:5631146 s13 0 13 0.0 B 0.0 B 0.0 B 0.0 B 0.0 B 0.0 B 1 slave2:426482.1 min 13 0 13 0.0 B 0.0 B 384.3 KB0.0 B 0.0 B 0.0 B 2 slave3:4432223 s12 0 12 0.0 B 0.0 B 136.4 KB0.0 B 0.0 B 0.0 B 3 slave4:3798744 s12 0 12 0.0 B 0.0 B 213.9 KB0.0 B 0.0 B 0.0 B Tasks Index ID Attempt Status Locality Level Executor ID / Host Launch Time DurationGC Time Shuffle ReadErrors 0 213 0 SUCCESS PROCESS_LOCAL 1 / slave2 2015/02/19 11:40:05 1.1 min 1 s 153.3 KB 5 218 0 SUCCESS PROCESS_LOCAL 3 / slave4 2015/02/19 11:40:05 23 ms 26.0 B 1 214 0 SUCCESS PROCESS_LOCAL 3 / slave4 2015/02/19 11:40:05 2 s 0.9 s 13.8 KB 4 217 0 SUCCESS PROCESS_LOCAL 1 / slave2 2015/02/19 11:40:05 26 ms 26.0 B 3 216 0 SUCCESS PROCESS_LOCAL 0 / slave1 2015/02/19 11:40:05 11 ms 0.0 B 2 215 0 SUCCESS PROCESS_LOCAL 2 / slave3 2015/02/19 11:40:05 27 ms 26.0 B 7 220 0 SUCCESS PROCESS_LOCAL 0 / slave1 2015/02/19 11:40:05 11 ms 0.0 B 10 223 0 SUCCESS PROCESS_LOCAL 2 / slave3 2015/02/19 11:40:05 23 ms 26.0 B 6 219 0 SUCCESS PROCESS_LOCAL 2 / slave3 2015/02/19 11:40:05 23 ms 26.0 B 9 222 0 SUCCESS PROCESS_LOCAL 3 / slave4 2015/02/19 11:40:05 23 ms 26.0 B 8 221 0 SUCCESS PROCESS_LOCAL 1 / slave2 2015/02/19 11:40:05 23 ms 26.0 B 11 224 0 SUCCESS PROCESS_LOCAL 0 / slave1 2015/02/19 11:40:05 10 ms 0.0 B 14 227 0 SUCCESS PROCESS_LOCAL 2 / slave3 2015/02/19 11:40:05 24 ms 26.0 B 13 226 0 SUCCESS PROCESS_LOCAL 3 / slave4 2015/02/19 11:40:05 23 ms 26.0 B 16 229 0 SUCCESS PROCESS_LOCAL 1 / slave2 2015/02/19 11:40:05 22 ms 26.0 B 12 225 0 SUCCESS PROCESS_LOCAL 1 / slave2 2015/02/19 11:40:05 22 ms 26.0 B 15 228 0 SUCCESS PROCESS_LOCAL 0 / slave1 2015/02/19 11:40:05 10 ms 0.0 B 17 230 0 SUCCESS PROCESS_LOCAL 3 / slave4 2015/02/19 11:40:05 22 ms 26.0 B 23 236 0 SUCCESS PROCESS_LOCAL 0 / slave1 2015/02/19 11:40:05 10 ms 0.0 B 22 235 0 SUCCESS PROCESS_LOCAL 2 / slave3 2015/02/19 11:40:05 21 ms 26.0 B 19 232 0 SUCCESS PROCESS_LOCAL 0 / slave1 2015/02/19 11:40:05 10 ms 0.0 B 21 234 0 SUCCESS PROCESS_LOCAL 3 / slave4 2015/02/19 11:40:05 25 ms 26.0 B 18 231 0 SUCCESS PROCESS_LOCAL 2 / slave3 2015/02/19 11:40:05 24 ms 26.0 B 20 233 0 SUCCESS PROCESS_LOCAL 1 / slave2 2015/02/19 11:40:05 28 ms 26.0 B 25 238 0 SUCCESS PROCESS_LOCAL 3 / slave4 2015/02/19 11:40:05 20 ms 26.0 B 28 241 0 SUCCESS PROCESS_LOCAL 1 / slave2 2015/02/19 11:40:05 27 ms 26.0 B 27 240 0 SUCCESS PROCESS_LOCAL 0 / slave1 2015/02/19 11:40:05 10 ms 0.0 B Thanks -- View
OptionalDataException during Naive Bayes Training
Hi, I am using Spark Version 1.1 in standalone mode in the cluster. Sometimes, during Naive Baye's training, I get OptionalDataException at line, map at NaiveBayes.scala:109 I am getting following exception on the console, java.io.OptionalDataException: java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1371) java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) java.util.HashMap.readObject(HashMap.java:1394) sun.reflect.GeneratedMethodAccessor626.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:483) java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) What could be the reason behind this? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OptionalDataException-during-Naive-Bayes-Training-tp21059.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Clustering text data with MLlib
Hi, I wish to cluster a set of textual documents into undefined number of classes. The clustering algorithm provided in MLlib i.e. K-means requires me to give a pre-defined number of classes. Is there any algorithm which is intelligent enough to identify how many classes should be made based on the input documents. I want to utilize the speed and agility of Spark in the process. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Accessing posterior probability of Naive Baye's prediction
Thanks Sean, it did turn out to be a simple mistake after all. I appreciate your help. Jatin On Thu, Nov 27, 2014 at 7:52 PM, sowen [via Apache Spark User List] ml-node+s1001560n19975...@n3.nabble.com wrote: No, the feature vector is not converted. It contains count n_i of how often each term t_i occurs (or a TF-IDF transformation of those). You are finding the class c such that P(c) * P(t_1|c)^n_1 * ... is maximized. In log space it's log(P(c)) + n_1*log(P(t_1|c)) + ... So your n_1 counts (or TF-IDF values) are used as-is and this is where the dot product comes from. Your bug is probably something lower-level and simple. I'd debug the Spark example and print exactly its values for the log priors and conditional probabilities, and the matrix operations, and yours too, and see where the difference is. On Thu, Nov 27, 2014 at 11:37 AM, jatinpreet [hidden email] http://user/SendEmail.jtp?type=nodenode=19975i=0 wrote: Hi, I have been running through some troubles while converting the code to Java. I have done the matrix operations as directed and tried to find the maximum score for each category. But the predicted category is mostly different from the prediction done by MLlib. I am fetching iterators of the pi, theta and testData to do my calculations. pi and theta are in log space while my testData vector is not, could that be a problem because I didn't see explicit conversion in Mllib also? For example, for two categories and 5 features, I am doing the following operation, [1,2] + [1 2 3 4 5 ] * [1,2,3,4,5] [6 7 8 9 10] These are simple element-wise matrix multiplication and addition operators. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19975i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19975i=2 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p19975.html To unsubscribe from Accessing posterior probability of Naive Baye's prediction, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=19828code=amF0aW5wcmVldEBnbWFpbC5jb218MTk4Mjh8MTY0NDI0MzIyNw== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Jatinpreet Singh - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p20011.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Accessing posterior probability of Naive Baye's prediction
Hi, I have been running through some troubles while converting the code to Java. I have done the matrix operations as directed and tried to find the maximum score for each category. But the predicted category is mostly different from the prediction done by MLlib. I am fetching iterators of the pi, theta and testData to do my calculations. pi and theta are in log space while my testData vector is not, could that be a problem because I didn't see explicit conversion in Mllib also? For example, for two categories and 5 features, I am doing the following operation, [1,2] + [1 2 3 4 5 ] * [1,2,3,4,5] [6 7 8 9 10] These are simple element-wise matrix multiplication and addition operators. Following is the code, IteratorTuple2lt;Object, Object piIterator = piValue.iterator(); IteratorTuple2lt;Tuple2lt;Object, Object, Object thetaIterator = thetaValue.iterator(); IteratorTuple2lt;Object, Object testDataIterator = null; double[] scores = new double[piValue.size()]; while (piIterator.hasNext()) { double score = 0.0; // reset to index 0 testDataIterator = testData.toBreeze().iterator(); while (testDataIterator.hasNext()) { Tuple2Object, Object testTuple = testDataIterator.next(); Tuple2Tuple2lt;Object, Object, Object thetaTuple = thetaIterator.next(); score += ((double) testTuple2._2 * (double) thetaTuple2._2); } Tuple2Object, Object piTuple = piIterator.next(); score += (double) piTuple._2; scores[(int) piTuple._1] = score; if (maxScore score) { predictedCategory = (int) piTuple._1; maxScore = score; } } Where am I going wrong? Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p19968.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Accessing posterior probability of Naive Baye's prediction
Hi Sean, The values brzPi and brzTheta are of the form breeze.linalg.DenseVectorDouble. So would I have to convert them back to simple vectors and use a library to perform addition/multiplication? If yes, can you please point me to the conversion logic and vector operation library for Java? Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p19858.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Accessing posterior probability of Naive Baye's prediction
Hi, I am trying to access the posterior probability of Naive Baye's prediction with MLlib using Java. As the member variables brzPi and brzTheta are private, I applied a hack to access the values through reflection. I am using Java and couldn't find a way to use the breeze library with Java. If I am correct the relevant calculation is given through line number 66 in NaiveBayesModel class, labels(brzArgmax(brzPi + brzTheta * testData.toBreeze)) Here the element-wise additions and multiplication of DenseVectors are given as operators which are not directly accessible in Java. Also, the use of brzArgmax is not very clear with Java for me. Can anyone please help me convert the above mentioned calculation from Scala to Java. PS: I have raised a improvement request on Jira for making these variables directly accessible from outside. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark serialization issues with third-party libraries
Thanks Arush! Your example is nice and easy to understand. I am implementing it through Java though. Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-serialization-issues-with-third-party-libraries-tp19454p19624.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark serialization issues with third-party libraries
Thanks Sean, I was actually using instances created elsewhere inside my RDD transformations which as I understand is against Spark programming model. I was referred to a talk about UIMA and Spark integration from this year's Spark summit, which had a workaround for this problem. I just had to make some class members transient. http://spark-summit.org/2014/talk/leveraging-uima-in-spark Thanks - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-serialization-issues-with-third-party-libraries-tp19454p19589.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Naive Baye's classification confidence
Thanks a lot Sean. You are correct in assuming that my examples fall under a single category. It is interesting to see that the posterior probability can actually be treated as something that is stable enough to have a constant threshold value on per class basis. It would, I assume, keep changing for a sample as I add/remove documents in the training set and thus warrant corresponding change in the threshold. Also, I have seen the class prediction probabilities to range from 0.003 to 0.8 for correct classifications in my sample data. This is a wide spectrum, so is there a way to change that? Maybe by replicating the samples for the classes I get low confidence but accurate classification for. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Naive-Baye-s-classification-confidence-tp19341p19358.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Naive Baye's classification confidence
Sean, My last sentence didn't come out right. Let me try to explain my question again. For instance, I have two categories, C1 and C2. I have trained 100 samples for C1 and 10 samples for C2. Now, I predict two samples one each of C1 and C2, namely S1 and S2 respectively. I get the following prediction results, S1= Category: C1, Probability: 0.7 S2= Category: C2, Probability: 0.04 Now, both the predictions are correct but their probabilities are far apart. Can I improve the prediction probability by taking the 10 samples I have of C2 and replicating each of them 10 times making the total count equal to 100 which is same as C1. Can I expect this to increase the probability of sample S2 after training the new set? Is this a viable approach? Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Naive-Baye-s-classification-confidence-tp19341p19366.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Naive Baye's classification confidence
I believe assuming uniform priors is the way to go for my use case. I am not sure about how to 'drop the prior term' with Mllib. I am just providing the samples as they come after creating term vectors for each sample. But I guess I can Google that information. I appreciate all the help. Spark community is amazing! - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Naive-Baye-s-classification-confidence-tp19341p19370.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark serialization issues with third-party libraries
Hi, I am planning to use UIMA library to process data in my RDDs. I have had bad experiences while using third party libraries inside worker tasks. The system gets plagued with Serialization issues. But as UIMA classes are not necessarily Serializable, I am not sure if it will work. Please explain which classes need to be Serializable and which of them can be left as it is? A clear understanding will help me a lot. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-serialization-issues-with-third-party-libraries-tp19454.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Naive Baye's classification confidence
I have been trying the Naive Baye's implementation of Spark's MLlib.During testing phase, I wish to eliminate data with low confidence of prediction. My data set primarily consists of form based documents like reports and application forms. They contain key-value pair type text and hence I assume the independence condition holds better than with natural language. About the quality of priors, I am not doing anything special. I am training more or less equal number of samples for each class and have left the heavy lifting to be done by MLlib. Given these facts, does it make sense to have confidence thresholds defined for each category above which I will get correct results consistently? Thanks Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Naive-Baye-s-classification-confidence-tp19341.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLlib Naive Bayes classifier confidence
Thanks for the answer. The variables brzPi and brzTheta are declared private. I am writing my code with Java otherwise I could have replicated the scala class and performed desired computation, which is as I observed a multiplication of brzTheta with test vector and adding this value to brzPi. Any suggestions of a way out other than replicating the whole functionality of Naive Baye's model in Java? That would be a time consuming process. - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Naive-Bayes-classifier-confidence-tp18456p18472.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLlib Naive Bayes classifier confidence
Thanks, I will try it out and raise a request for making the variables accessible. An unrelated question, do you think the probability value thus calculated will be a good measure of confidence in prediction? I have been reading mixed opinions about the same. Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Naive-Bayes-classifier-confidence-tp18456p18497.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark cluster stability
Great! Thanks for the information. I will try it out. - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-cluster-stability-tp17929p17956.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark cluster stability
Hi, I am running a small 6 node spark cluster for testing purposes. Recently, one of the node's physical memory was filled up by temporary files and there was no space left on the disk. Due to this my Spark jobs started failing even though on the Spark Web UI the was shown 'Alive'. Once I logged on to the machine and cleaned up some trash, I was able to run the jobs again. My question is, how reliable my Spark cluster can be if issues like these can bring down my jobs? I would have expected Spark to not use this node or at least distribute this work to other nodes. But as the node was still alive, it tried to run tasks on it regardless. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-cluster-stability-tp17929.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Serialize/deserialize Naive Bayes model and index files
Hi, I am trying to persist the files generated as a result of Naive bayes training with MLlib. These comprise of the model file, label index(own class) and term dictionary(own class). I need to save them on an HDFS location and then deserialize when needed for prediction. How can I do the same with Spark? Also, I have the option of saving these instances in HBase in binary form. Which approach makes more sense? Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Serialize-deserialize-Naive-Bayes-model-and-index-files-tp16513.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Out of memory exception in MLlib's naive baye's classification training
Hi, I was able to get the training running in local mode with default settings, there was a problem with document labels which were quite large(not 20 as suggested earlier). I am currently training 175000 documents on a single node with 2GB of executor memory and 5GB of driver memory successfully. If I increase the number of documents, I get the OOM error. I wish to understand what generally the bottlenecks are for naive bayes, is it the executor or the driver memory? Also, what are the things to keep in mind while training huge sets of data so that I can have a bullet proof classification system, slowing down in case of low memory is fine but not exceptions. As a side note, is there any classification algorithm in MLlib which can just append the new training data to an existing model? With naive bayes, I need to have all the data available at once for training. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-exception-in-MLlib-s-naive-baye-s-classification-training-tp14809p15052.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Out of memory exception in MLlib's naive baye's classification training
Xiangrui, Yes, the total number of terms is 43839. I have also tried running it using different values of parallelism ranging from 1/core to 10/core. I also used multiple configurations like setting spark.storage.memoryFaction and spark.shuffle.memoryFraction to default values. The point to note here is that I am not using caching or persisting the RDDs and therefore I set the storage fraction to 0. The driver data available under executors tab is as follows for 3GB of allocated memory: Memory: 0.0 B Used (1781.8 MB Total) Disk: 0.0 B Used Executor ID Address RDD Blocks Memory Used Disk Used Active TasksFailed Tasks Complete Tasks Total Tasks Task Time Shuffle ReadShuffle Write driverephesoft29:594940 0.0 B / 1781.8 MB 0.0 B 1 0 4 5 19.3 s 0.0 B 27.5 MB Memory used value is always 0 for the driver. Is there something fishy here? The out of memory exception occurs in NaiveBayes.scala at combineByKey (line 91) or collect (line 96) based on the heap size allocated. In the memory profiler, the program runs fine until TFIDF creation, but when training starts, the memory usage goes up until the point of failure. I want to understand if the OOM exception is occurring on driver or the worker node.It should not be worker node, because as I understand, spark automatically spills the data from memory to disk if available memory is not adequate. Then why do I get these errors at all? If it is the driver, then how do I calculate the total memory requirements as 3-4 GB ram for training approximately 13 MB of training data with 43839 terms is preposterous. My expectation was that with spark was that if the memory is available it would be much faster than Mahout, but if enough memory is not there, then it would only be slower and not throw exceptions. Mahout ran fine with much larger data, and it too had to collect a lot of data on a single node during training. May be I am not getting the point here due to my limited knowledge of Spark. Please help me out with this and point me the right direction. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-exception-in-MLlib-s-naive-baye-s-classification-training-tp14809p14879.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Out of memory exception in MLlib's naive baye's classification training
I get the following stacktrace if it is of any help. 14/09/23 15:46:02 INFO scheduler.DAGScheduler: failed: Set() 14/09/23 15:46:02 INFO scheduler.DAGScheduler: Missing parents for Stage 7: List() 14/09/23 15:46:02 INFO scheduler.DAGScheduler: Submitting Stage 7 (MapPartitionsRDD[24] at combineByKey at NaiveBayes.scala:91), which is now runnable 14/09/23 15:46:02 INFO executor.Executor: Finished task ID 7 14/09/23 15:46:02 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 7 (MapPartitionsRDD[24] at combineByKey at NaiveBayes.scala:91) 14/09/23 15:46:02 INFO scheduler.TaskSchedulerImpl: Adding task set 7.0 with 1 tasks 14/09/23 15:46:02 INFO scheduler.TaskSetManager: Starting task 7.0:0 as TID 8 on executor localhost: localhost (PROCESS_LOCAL) 14/09/23 15:46:02 INFO scheduler.TaskSetManager: Serialized task 7.0:0 as 535061 bytes in 1 ms 14/09/23 15:46:02 INFO executor.Executor: Running task ID 8 14/09/23 15:46:02 INFO storage.BlockManager: Found block broadcast_0 locally 14/09/23 15:46:03 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/09/23 15:46:03 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 14/09/23 15:46:03 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 1 ms 14/09/23 15:46:04 WARN collection.ExternalAppendOnlyMap: Spilling in-memory map of 452 MB to disk (1 time so far) 14/09/23 15:46:07 WARN collection.ExternalAppendOnlyMap: Spilling in-memory map of 452 MB to disk (2 times so far) 14/09/23 15:46:09 WARN collection.ExternalAppendOnlyMap: Spilling in-memory map of 438 MB to disk (3 times so far) 14/09/23 15:46:12 WARN collection.ExternalAppendOnlyMap: Spilling in-memory map of 479 MB to disk (4 times so far) 14/09/23 15:46:22 ERROR executor.Executor: Exception in task ID 8 java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:71) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 14/09/23 15:46:22 WARN scheduler.TaskSetManager: Lost TID 8 (task 7.0:0) 14/09/23 15:46:22 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-1,5,main] java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:71) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 14/09/23 15:46:22 WARN scheduler.TaskSetManager: Loss was due to java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at
Re: Out of memory exception in MLlib's naive baye's classification training
Xiangrui, Thanks for replying. I am using the subset of newsgroup20 data. I will send you the vectorized data for analysis shortly. I have tried running in local mode as well but I get the same OOM exception. I started with 4GB of data but then moved to smaller set to verify that everything was fine but I get the error on this small data too. I ultimately want the system to handle any amount of data we throw at it without OOM exceptions. My concern is how spark will behave with a lot of data during training and prediction phase. I need to know exactly what memory requirements are there for a given set of data and where it is needed (driver or executor). If there are any guidelines for it, that would be great. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-exception-in-MLlib-s-naive-baye-s-classification-training-tp14809p14969.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: New API for TFIDF generation in Spark 1.1.0
Thanks Xangrui and RJ for the responses. RJ, I have created a Jira for the same. It would be great if you could look into this. Following is the link to the improvement task, https://issues.apache.org/jira/browse/SPARK-3614 Let me know if I can be of any help and please keep me posted! Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543p14737.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
New API for TFIDF generation in Spark 1.1.0
Hi, I have been running into memory overflow issues while creating TFIDF vectors to be used in document classification using MLlib's Naive Baye's classification implementation. http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ Memory overflow and GC issues occur while collecting idfs for all the terms. To give an idea of scale, I am reading around 615,000(around 4GB of text data) small sized documents from HBase and running the spark program with 8 cores and 6GB of executor memory. I have tried increasing the parallelism level and shuffle memory fraction but to no avail. The new TFIDF generation APIs caught my eye in the latest Spark version 1.1.0. The example given in the official documentation mentions creation of TFIDF vectors based of Hashing Trick. I want to know if it will solve the mentioned problem by benefiting from reduced memory consumption. Also, the example does not state how to create labeled points for a corpus of pre-classified document data. For example, my training input looks something like this, DocumentType | Content - D1 | This is Doc1 sample. D1 | This also belongs to Doc1. D1 | Yet another Doc1 sample. D2 | Doc2 sample. D2 | Sample content for Doc2. D3 | The only sample for Doc3. D4 | Doc4 sample looks like this. D4 | This is Doc4 sample content. I want to create labeled points from this sample data for training. And once the Naive Bayes model is created, I generate TFIDFs for the test documents and predict the document type. If the new API can solve my issue, how can I generate labelled points using the new APIs? An example would be great. Also, I have a special requirement of ignoring terms that occur in less than two documents. This has important implications for the accuracy of my use case and needs to be accommodated while generating TFIDFs. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Accuracy hit in classification with Spark
Hi, I have been able to get the same accuracy with MLlib as Mahout's. The pre-processing phase of Mahout was the reason behind the accuracy mismatch. After studying and applying the same logic in my code, it worked like a charm. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accuracy-hit-in-classification-with-Spark-tp13773p14221.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Accuracy hit in classification with Spark
Hi, I had been using Mahout's Naive Bayes algorithm to classify document data. For a specific train and test set, I was getting accuracy in the range of 86%. When I shifted to Spark's MLlib, the accuracy dropped to the vicinity of 82%. I am using same version of Lucene and logic to generate TFIDF vectors. I tried fiddling with the smoothing parameter but to no avail. My question is if the underlying algorithm is same in both Mahout and MLlib, why this accuracy dip is being observed? - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accuracy-hit-in-classification-with-Spark-tp13773.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Accuracy hit in classification with Spark
Hi, I tried running the classification program on the famous newsgroup data. This had an even more drastic effect on the accuracy, as it dropped from ~82% in Mahout to ~72% in Spark MLlib. Please help me in this regard as I have to use Spark in a production system very soon and this is a blocker for me. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accuracy-hit-in-classification-with-Spark-tp13773p13792.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Accuracy hit in classification with Spark
Hi, I tried running the classification program on the famous newsgroup data. This had an even more drastic effect on the accuracy, as it dropped from ~82% in Mahout to ~72% in Spark MLlib. Please help me in this regard as I have to use Spark in a production system very soon and this is a blocker for me. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accuracy-hit-in-classification-with-Spark-tp13773p13793.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Accuracy hit in classification with Spark
Thanks for the information Xiangrui. I am using the following example to classify documents. http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ I am not sure if this is the best way to convert textual data into vectors. Can you please confirm if this is the ideal solution as I could not identify any shortcomings. Also, I am splitting the data into 70/30 sets, which is same for Mahout so it should not have an impact on accuracy. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accuracy-hit-in-classification-with-Spark-tp13773p13811.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Accuracy hit in classification with Spark
I have also ran some tests on the other algorithms available with MLlib but got dismal accuracy. Is the method of creating LabeledPoint RDD different for other algorithms such as, LinearRegressionWithSGD? Any help is appreciated. - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accuracy-hit-in-classification-with-Spark-tp13773p13812.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark on Hadoop with Java 8
Hi, I am contemplating the use of Hadoop with Java 8 in a production system. I will be using Apache Spark for doing most of the computations on data stored in HBase. Although Hadoop seems to support JDK 8 with some tweaks, the official HBase site states the following for version 0.98, Running with JDK 8 works but is not well tested. Building with JDK 8 would require removal of the deprecated remove() method of the PoolMap class and is under consideration. See ee HBASE-7608 for more information about JDK 8 support. I am inclined towards using JDK 8 specifically for the support of lambda expressions which will take a lot of verbosity out of my Spark programs(Scala learning curve is a deterrent for me as a possible bottleneck for future talent acquisition). Is it a good idea to use Spark/Hadoop/HBase combo with Java 8 at the moment? Thanks, Jatin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Hadoop-with-Java-8-tp12883.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org