答复: Spark RDD Disk Persistance

2014-07-08 Thread Lizhengbing (bing, BIPA)
You might let your data stored in tachyon 发件人: Jahagirdar, Madhu [mailto:madhu.jahagir...@philips.com] 发送时间: 2014年7月8日 10:16 收件人: user@spark.apache.org 主题: Spark RDD Disk Persistance Should i use Disk based Persistance for RDD's and if the machine goes down during the program execution, next

Error and doubts in using Mllib Naive bayes for text clasification

2014-07-08 Thread Rahul Bhojwani
Hello, I am a novice.I want to classify the text into two classes. For this purpose I want to use Naive Bayes model. I am using Python for it. Here are the problems I am facing: *Problem 1:* I wanted to use all words as features for the bag of words model. Which means my features will be count

Spark SQL registerAsTable requires a Java Class?

2014-07-08 Thread Ionized
The Java API requires a Java Class to register as table. // Apply a schema to an RDD of JavaBeans and register it as a table.JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);schemaPeople.registerAsTable(people); If instead of JavaRDDPerson I had JavaRDDList (along with the

Spark: All masters are unresponsive!

2014-07-08 Thread Sameer Tilak
Hi All, I am having a few issues with stability and scheduling. When I use spark shell to submit my application. I get the following error message and spark shell crashes. I have a small 4-node cluster for PoC. I tried both manual and scripts-based cluster set up. I tried using FQDN as well for

Re: Spark: All masters are unresponsive!

2014-07-08 Thread Akhil Das
Are you sure this is your master URL spark://pzxnvm2018:7077 ? You can look it up in the WebUI (mostly http://pzxnvm2018:8080) top left corner. Also make sure you are able to telnet pzxnvm2018 7077 from the machines where you are running the spark shell. Thanks Best Regards On Tue, Jul 8, 2014

Re: Spark Installation

2014-07-08 Thread Sean Owen
On Tue, Jul 8, 2014 at 4:07 AM, Srikrishna S srikrishna...@gmail.com wrote: Hi All, Does anyone know what the command line arguments to mvn are to generate the pre-built binary for spark on Hadoop 2-CHD5. I would like to pull in a recent bug fix in spark-master and rebuild the binaries in

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-08 Thread Sean Owen
On Tue, Jul 8, 2014 at 2:01 AM, DB Tsai dbt...@dbtsai.com wrote: Actually, the one needed to install the jar to each individual node is standalone mode which works for both MR1 and MR2. Cloudera and Hortonworks currently support spark in this way as far as I know. (CDH5 uses Spark on YARN.)

Re: Spark SQL registerAsTable requires a Java Class?

2014-07-08 Thread Michael Armbrust
This is on the roadmap for the next release (1.1) JIRA: SPARK-2179 https://issues.apache.org/jira/browse/SPARK-2179 On Mon, Jul 7, 2014 at 11:48 PM, Ionized ioni...@gmail.com wrote: The Java API requires a Java Class to register as table. // Apply a schema to an RDD of JavaBeans and

error when spark access hdfs with Kerberos enable

2014-07-08 Thread 许晓炜
Hi all, I encounter a strange issue when using spark 1.0 to access hdfs with Kerberos I just have one spark test node for spark and HADOOP_CONF_DIR is set to the location containing the hdfs configuration files(hdfs-site.xml and core-site.xml) When I use spark-shell with local mode, the access

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-08 Thread Juan Rodríguez Hortalá
Hi Tobias, thanks for your help. I understand that with that code we obtain a database connection per partition, but I also suspect that with that code a new database connection is created per each execution of the function used as argument for mapPartitions(). That would be very inefficient

RE: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-08 Thread Shao, Saisai
I think you can maintain a connection pool or keep the connection as a long-lived object in executor side (like lazily creating a singleton object in object { } in Scala), so your task can get this connection each time executing a task, not creating a new one, that would be good for your

Re: Reading text file vs streaming text files

2014-07-08 Thread M Singh
Hi Akhil: Thanks for your response. Mans On Thursday, July 3, 2014 9:16 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi Singh! For this use-case its better to have a Streaming context listening to that directory in hdfs where the files are being dropped and you can set the Streaming

Re: Java sample for using cassandra-driver-spark

2014-07-08 Thread M Singh
Hi Piotr: It would be great if we can have an api to support batch updates (counter + non-counter). Thanks Mans On Monday, July 7, 2014 11:36 AM, Piotr Kołaczkowski pkola...@datastax.com wrote: Hi, we're planning to add a basic Java-API very soon, possibly this week. There's a ticket

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-08 Thread Juan Rodríguez Hortalá
Hi Jerry, thanks for your answer. I'm using Spark Streaming for Java, and I only have rudimentary knowledge about Scala, how could I recreate in Java the lazy creation of a singleton object that you propose for Scala? Maybe a static class member in Java for the connection would be the solution?

Task's Scheduler Delay in web ui

2014-07-08 Thread haopu
What's the meaning of a Task's Scheduler Delay in the web ui? And what could cause that delay? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Task-s-Scheduler-Delay-in-web-ui-tp9019.html Sent from the Apache Spark User List mailing list archive at

Re: Graphx traversal and merge interesting edges

2014-07-08 Thread HHB
Hi Ankur, I was trying out the PatterMatcher it works for smaller path, but I see that for the longer ones it continues to run forever... Here's what I am trying: https://gist.github.com/hihellobolke/dd2dc0fcebba485975d1 (The example of 3 share traders transacting in appl shares) The first

Spark MapReduce job to work with Hive

2014-07-08 Thread Darq Moth
Please let me know if the following can be done in Spark: In terms of MapReduce I need: 1) Map function: 1.1) Get Hive record. 1.2) Create a key from some fileds of the record. Register with framework my own key comparison function. This function will make decision about key equality by

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread Martin Gammelsæter
Digging a bit more I see that there is yet another jetty instance that is causing the problem, namely the BroadcastManager has one. I guess this one isn't very wise to disable... It might very well be that the WebUI is a problem as well, but I guess the code doesn't get far enough. Any ideas on

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
In addition to Scalding and Scrunch, there is Scoobi. Unlike the others, it is only Scala (it doesn't wrap a Java framework). All three have fairly similar APIs and aren't too different from Spark. For example, instead of RDD you have DList (distributed list) or PCollection (parallel collection) -

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
I don't have those numbers off-hand. Though the shuffle spill to disk was coming to several gigabytes per node, if I recall correctly. The MapReduce pipeline takes about 2-3 hours I think for the full 60 day data set. Spark chugs along fine for awhile and then hangs. We restructured the flow a

Is MLlib NaiveBayes implementation for Spark 0.9.1 correct?

2014-07-08 Thread Rahul Bhojwani
Hi, I wanted to use Naive Bayes for a text classification problem.I am using Spark 0.9.1. I was just curious to ask that is the Naive Bayes implementation in Spark 0.9.1 correct? Or are there any bugs in the Spark 0.9.1 implementation which are taken care in Spark 1.0. My question is specific

How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-07-08 Thread Rahul Bhojwani
Hi, I am using the MLlib Naive Bayes for a text classification problem. I have very less amount of training data. And then the data will be coming continuously and I need to classify it as either A or B. I am training the MLlib Naive Bayes model using the training data but next time when data

got java.lang.AssertionError when run sbt/sbt compile

2014-07-08 Thread bai阿蒙
Hi guys, when i try to compile the latest source by sbt/sbt compile, I got an error. Can any one help me? The following is the detail: it may cause by TestSQLContext.scala [error] [error] while compiling: /disk3/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread Koert Kuipers
do you control your cluster and spark deployment? if so, you can try to rebuild with jetty 9.x On Tue, Jul 8, 2014 at 9:39 AM, Martin Gammelsæter martingammelsae...@gmail.com wrote: Digging a bit more I see that there is yet another jetty instance that is causing the problem, namely the

Re: Comparative study

2014-07-08 Thread Kevin Markey
When you say "large data sets", how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's

Scheduling in spark

2014-07-08 Thread rapelly kartheek
Hi, I am a post graduate student, new to spark. I want to understand how Spark scheduler works. I just have theoretical understanding of DAG scheduler and the underlying task scheduler. I want to know, given a job to the framework, after the DAG scheduler phase, how the scheduling happens??

java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Konstantin Kudryavtsev
Hi all, I faced with the next exception during map step: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded) java.lang.reflect.Array.newInstance(Array.java:70)

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread DB Tsai
We're doing similar thing to lunch spark job in tomcat, and I opened a JIRA for this. There are couple technical discussions there. https://issues.apache.org/jira/browse/SPARK-2100 In this end, we realized that spark uses jetty not only for Spark WebUI, but also for distributing the jars and

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I

Please add Talend to Powered By Spark page

2014-07-08 Thread Daniel Kulp
We are looking to add a note about Talend Open Studio's support for Spark components to the page at: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark Name: Talend Open Studio URL: http://www.talendforge.org/exchange/ Description: Talend Labs are building open source tooling

Re: NoSuchMethodError in KafkaReciever

2014-07-08 Thread Michael Chang
To be honest I'm a scala newbie too. I just copied it from createStream. I assume it's the canonical way to convert a java map (JMap) to a scala map (Map) On Mon, Jul 7, 2014 at 1:40 PM, mcampbell michael.campb...@gmail.com wrote: xtrahotsauce wrote I had this same problem as well. I

how to convert RDD to PairRDDFunctions ?

2014-07-08 Thread Konstantin Kudryavtsev
Hi all, sorry for fooly question, but how can I get PairRDDFunctions RDD? I'm doing it to perform leftOuterJoin aftewards currently I do in this was (it seems incorrect): val parRDD = new PairRDDFunctions( oldRdd.map(i = (i.key, i)) ) I guess this constructor is definitely wrong... Thank you,

Re: how to convert RDD to PairRDDFunctions ?

2014-07-08 Thread Sean Owen
If your RDD contains pairs, like an RDD[(String,Integer)] or something, then you get to use the functions in PairRDDFunctions as if they were declared on RDD. On Tue, Jul 8, 2014 at 6:25 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi all, sorry for fooly question, but

Re: how to convert RDD to PairRDDFunctions ?

2014-07-08 Thread Mark Hamstra
See Working with Key-Value Pairs http://spark.apache.org/docs/latest/programming-guide.html. In particular: In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in tuples in the language, created by simply writing (a, b)), as long as you import

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
I believe our full 60 days of data contains over ten million unique entities. Across 10 days I'm not sure, but it should be in the millions. I haven't verified that myself though. So that's the scale of the RDD we're writing to disk (each entry is entityId - profile). I think it's hard to know

Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Aaron Davidson
There is a difference from actual GC overhead, which can be reduced by reusing objects, versus this error, which actually means you ran out of memory. This error can probably be relieved by increasing your executor heap size, unless your data is corrupt and it is allocating huge arrays, or you are

Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Jerry Lam
Hi Konstantin, I just ran into the same problem. I mitigated the issue by reducing the number of cores when I executed the job which otherwise it won't be able to finish. Unlike many people believes, it might not means that you were running out of memory. A better answer can be found here:

Re: Comparative study

2014-07-08 Thread Kevin Markey
It seems to me that you're not taking full advantage of the lazy evaluation, especially persisting to disk only.  While it might be true that the cumulative size of the RDDs looks like it's 300GB, only a small portion of that should be resident at any one time.  We've

Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Aaron Davidson
This seems almost equivalent to a heap size error -- since GCs are stop-the-world events, the fact that we were unable to release more than 2% of the heap suggests that almost all the memory is *currently in use *(i.e., live). Decreasing the number of cores is another solution which decreases

Join two Spark Streaming

2014-07-08 Thread Bill Jay
Hi all, I am working on a pipeline that needs to join two Spark streams. The input is a stream of integers. And the output is the number of integer's appearance divided by the total number of unique integers. Suppose the input is: 1 2 3 1 2 2 There are 3 unique integers and 1 appears twice.

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
To clarify, we are not persisting to disk. That was just one of the experiments we did because of some issues we had along the way. At this time, we are NOT using persist but cannot get the flow to complete in Standalone Cluster mode. We do not have a YARN-capable cluster at this time. We agree

Further details on spark cluster set up

2014-07-08 Thread Sameer Tilak
Hi All, I used ip addresses in my scripts (spark-env.sh) and slaves contain ip addresses of master and slave nodes respectively. However, I still have no luck. Here is the relevant log file snippet: Master node log:14/07/08 10:56:19 ERROR EndpointWriter: AssociationError

Re: Use Spark Streaming to update result whenever data come

2014-07-08 Thread Bill Jay
Hi Tobias, Thanks for the suggestion. I have tried to add more nodes from 300 to 400. It seems the running time did not get improved. On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, can't you just add more nodes in order to speed up the processing? Tobias

Re: Comparative study

2014-07-08 Thread Kevin Markey
Nothing particularly custom.  We've tested with small (4 node) development clusters, single-node pseudoclusters, and bigger, using plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master, Spark local, Spark Yarn (client and cluster) modes, with total

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
How wide are the rows of data, either the raw input data or any generated intermediate data? We are at a loss as to why our flow doesn't complete. We banged our heads against it for a few weeks. -Suren On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey kevin.mar...@oracle.com wrote: Nothing

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
Also, our exact same flow but with 1 GB of input data completed fine. -Suren On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: How wide are the rows of data, either the raw input data or any generated intermediate data? We are at a loss as to why our flow

CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Sameer Tilak
Dear All, When I look inside the following directory on my worker node:$SPARK_HOME/work/app-20140708110707-0001/3 I see the following error message: log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).log4j:WARN Please initialize the log4j system

Re: CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Aaron Davidson
Hmm, looks like the Executor is trying to connect to the driver on localhost, from this line: 14/07/08 11:07:13 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@localhost:39701/user/CoarseGrainedScheduler What is your setup? Standalone mode with 4 separate machines? Are

Re: Spark Installation

2014-07-08 Thread Srikrishna S
Hi All, I tried the make distribution script and it worked well. I was able to compile the spark binary on our CDH5 cluster. Once I compiled Spark, I copied over the binaries in the dist folder to all the other machines in the cluster. However, I run into an issue while submit a job in

Re: error when spark access hdfs with Kerberos enable

2014-07-08 Thread Marcelo Vanzin
Someone might be able to correct me if I'm wrong, but I don't believe standalone mode supports kerberos. You'd have to use Yarn for that. On Tue, Jul 8, 2014 at 1:40 AM, 许晓炜 xuxiao...@qiyi.com wrote: Hi all, I encounter a strange issue when using spark 1.0 to access hdfs with Kerberos I

Error: Could not delete temporary files.

2014-07-08 Thread Rahul Bhojwani
HI, I am getting this error. Can anyone help out to explain why is this error coming. Exception in thread delete Spark temp dir C:\Users\shawn\AppData\Local\Temp\spark-27f60467-36d4-4081-aaf5-d0ad42dda560 java.io.IOException: Failed to delete:

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
This is generally a side effect of your executor being killed. For example, Yarn will do that if you're going over the requested memory limits. On Tue, Jul 8, 2014 at 12:17 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: HI, I am getting this error. Can anyone help out to explain why is

Re: Error: Could not delete temporary files.

2014-07-08 Thread Rahul Bhojwani
Hi Marcelo. Thanks for the quick reply. Can you suggest me how to increase the memory limits or how to tackle this problem. I am a novice. If you want I can post my code here. Thanks On Wed, Jul 9, 2014 at 12:50 AM, Marcelo Vanzin van...@cloudera.com wrote: This is generally a side effect of

RE: CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread sstilak
Hi Aaron, I have 4 nodes - 1 master and 3 workers. I am not setting up driver public dns name anywhere. I didn't see that step in the documentation -- may be I missed it. Can you please point me in the right direction? Sent via the Samsung GALAXY S®4, an ATT 4G LTE smartphone div

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
Note I didn't say that was your problem - it would be if (i) you're running your job on Yarn and (ii) you look at the Yarn NodeManager logs and see that it's actually killing your process. I just said that the exception shows up in those kinds of situations. You haven't provided enough

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
We kind of hijacked Santos' original thread, so apologies for that and let me try to get back to Santos' original question on Map/Reduce versus Spark. I would say it's worth migrating from M/R, with the following thoughts. Just my opinion but I would summarize the latest emails in this thread as

Re: Comparative study

2014-07-08 Thread Sean Owen
On Tue, Jul 8, 2014 at 8:32 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Libraries like Scoobi, Scrunch and Scalding (and their associated Java versions) provide a Spark-like wrapper around Map/Reduce but my guess is that, since they are limited to Map/Reduce under the covers, they

Re: Error: Could not delete temporary files.

2014-07-08 Thread Rahul Bhojwani
Here I am adding my code. If you can have a look to help me out. Thanks ### import tokenizer import gettingWordLists as gl from pyspark.mllib.classification import NaiveBayes from numpy import array from pyspark import SparkContext, SparkConf conf =

Re: Error: Could not delete temporary files.

2014-07-08 Thread Rahul Bhojwani
I have pasted the logs below: PS F:\spark-0.9.1\codes\sentiment analysis pyspark .\naive_bayes_analyser.py Running python with PYTHONPATH=F:\spark-0.9.1\spark-0.9.1\bin\..\python; SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in

Re: Scheduling in spark

2014-07-08 Thread Sujeet Varakhedi
This is a good start: http://www.eecs.berkeley.edu/~tdas/spark_docs/job-scheduling.html On Tue, Jul 8, 2014 at 9:11 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I am a post graduate student, new to spark. I want to understand how Spark scheduler works. I just have theoretical

[Spark SQL]: Convert SchemaRDD back to RDD

2014-07-08 Thread Pierre B
Hi there! 1/ Is there a way to convert a SchemaRDD (for instance loaded from a parquet file) back to a RDD of a given case class? 2/ Even better, is there a way to get the schema information from a SchemaRDD ? I am trying to figure out how to properly get the various fields of the Rows of a

Re: Comparative study

2014-07-08 Thread Reynold Xin
Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. I've personally run a workload that shuffled 300 TB of data. I've also ran something that shuffled 5TB/node and stuffed

Re: Scheduling in spark

2014-07-08 Thread Andrew Or
Here's the most updated version of the same page: http://spark.apache.org/docs/latest/job-scheduling 2014-07-08 12:44 GMT-07:00 Sujeet Varakhedi svarakh...@gopivotal.com: This is a good start: http://www.eecs.berkeley.edu/~tdas/spark_docs/job-scheduling.html On Tue, Jul 8, 2014 at 9:11

Re: Powered By Spark: Can you please add our org?

2014-07-08 Thread Reynold Xin
I added you to the list. Cheers. On Mon, Jul 7, 2014 at 6:19 PM, Alex Gaudio adgau...@gmail.com wrote: Hi, Sailthru is also using Spark. Could you please add us to the Powered By Spark https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark page when you have a chance?

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-07-08 Thread Xiangrui Meng
Hi Rahul, We plan to add online model updates with Spark Streaming, perhaps in v1.1, starting with linear methods. Please open a JIRA for Naive Bayes. For Naive Bayes, we need to update the priors and conditional probabilities, which means we should also remember the number of observations for

Re: Spark SQL registerAsTable requires a Java Class?

2014-07-08 Thread Ionized
Thanks for the heads-up. In the meantime, we'd like to test this out ASAP - are there any open PR's we could take to try it out? (or do you have an estimate on when some will be available?) On Tue, Jul 8, 2014 at 12:24 AM, Michael Armbrust mich...@databricks.com wrote: This is on the roadmap

Re: Is MLlib NaiveBayes implementation for Spark 0.9.1 correct?

2014-07-08 Thread Xiangrui Meng
Well, I believe this is a correct implementation but please let us know if you run into problems. The NaiveBayes implementation in MLlib v1.0 supports sparse data, which is usually the case for text classificiation. I would recommend upgrading to v1.0. -Xiangrui On Tue, Jul 8, 2014 at 7:20 AM,

Re: got java.lang.AssertionError when run sbt/sbt compile

2014-07-08 Thread Xiangrui Meng
try sbt/sbt clean first On Tue, Jul 8, 2014 at 8:25 AM, bai阿蒙 smallmonkey...@hotmail.com wrote: Hi guys, when i try to compile the latest source by sbt/sbt compile, I got an error. Can any one help me? The following is the detail: it may cause by TestSQLContext.scala [error] [error]

Re: Spark SQL registerAsTable requires a Java Class?

2014-07-08 Thread Michael Armbrust
Yin (cc-ed) is working on it as we speak. We'll post to the JIRA as soon as a PR is up. On Tue, Jul 8, 2014 at 1:04 PM, Ionized ioni...@gmail.com wrote: Thanks for the heads-up. In the meantime, we'd like to test this out ASAP - are there any open PR's we could take to try it out? (or do

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
I think we're missing the point a bit. Everything was actually flowing through smoothly and in a reasonable time. Until it reached the last two tasks (out of over a thousand in the final stage alone), at which point it just fell into a coma. Not so much as a cranky message in the logs. I don't

Re: [Spark SQL]: Convert SchemaRDD back to RDD

2014-07-08 Thread Michael Armbrust
On Tue, Jul 8, 2014 at 12:43 PM, Pierre B pierre.borckm...@realimpactanalytics.com wrote: 1/ Is there a way to convert a SchemaRDD (for instance loaded from a parquet file) back to a RDD of a given case class? There may be someday, but doing so will either require a lot of reflection or a

Re: Comparative study

2014-07-08 Thread Aaron Davidson
Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. +1 @Reynold Spark can handle big big data. There are known issues with informing the user about what went wrong

Re: Error and doubts in using Mllib Naive bayes for text clasification

2014-07-08 Thread Xiangrui Meng
1) The feature dimension should be a fixed number before you run NaiveBayes. If you use bag of words, you need to handle the word-to-index dictionary by yourself. You can either ignore the words that never appear in training (because they have no effect in prediction), or use hashing to randomly

Re: Help for the large number of the input data files

2014-07-08 Thread Xiangrui Meng
You can either use sc.wholeTextFiles and then a flatMap to reduce the number of partitions, or give more memory to the driver process by using --driver-memory 20g and then call RDD.repartition(small number) after you load the data in. -Xiangrui On Mon, Jul 7, 2014 at 7:38 PM, innowireless TaeYun

Re: [Spark SQL]: Convert SchemaRDD back to RDD

2014-07-08 Thread Pierre B
Cool Thanks Michael! Message sent from a mobile device - excuse typos and abbreviations Le 8 juil. 2014 à 22:17, Michael Armbrust [via Apache Spark User List] ml-node+s1001560n9084...@n3.nabble.com a écrit : On Tue, Jul 8, 2014 at 12:43 PM, Pierre B [hidden email] wrote: 1/ Is there a way

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-07-08 Thread Rahul Bhojwani
Thanks a lot Xiangrui. This will help. On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng men...@gmail.com wrote: Hi Rahul, We plan to add online model updates with Spark Streaming, perhaps in v1.1, starting with linear methods. Please open a JIRA for Naive Bayes. For Naive Bayes, we need to

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
Hi Rahul, Can you try calling sc.close() at the end of your program, so Spark can clean up after itself? On Tue, Jul 8, 2014 at 12:40 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: Here I am adding my code. If you can have a look to help me out. Thanks ### import

Re: Is MLlib NaiveBayes implementation for Spark 0.9.1 correct?

2014-07-08 Thread Rahul Bhojwani
Thanks a lot Xiangrui for the help. On Wed, Jul 9, 2014 at 1:39 AM, Xiangrui Meng men...@gmail.com wrote: Well, I believe this is a correct implementation but please let us know if you run into problems. The NaiveBayes implementation in MLlib v1.0 supports sparse data, which is usually the

Re: Error and doubts in using Mllib Naive bayes for text clasification

2014-07-08 Thread Rahul Bhojwani
Thanks Xiangrui. You have solved almost all my problems :) On Wed, Jul 9, 2014 at 1:47 AM, Xiangrui Meng men...@gmail.com wrote: 1) The feature dimension should be a fixed number before you run NaiveBayes. If you use bag of words, you need to handle the word-to-index dictionary by yourself.

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
Sorry, that would be sc.stop() (not close). On Tue, Jul 8, 2014 at 1:31 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Rahul, Can you try calling sc.close() at the end of your program, so Spark can clean up after itself? On Tue, Jul 8, 2014 at 12:40 PM, Rahul Bhojwani

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
Aaron, I don't think anyone was saying Spark can't handle this data size, given testimony from the Spark team, Bizo, etc., on large datasets. This has kept us trying different things to get our flow to work over the course of several weeks. Agreed that the first instinct should be what did I do

Re: Error: Could not delete temporary files.

2014-07-08 Thread Rahul Bhojwani
Thanks Marcelo. I was having another problem. My code was running properly and then it suddenly stopped with the error: java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.init(Unknown Source) at

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
Have you tried the obvious (increase the heap size of your JVM)? On Tue, Jul 8, 2014 at 2:02 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: Thanks Marcelo. I was having another problem. My code was running properly and then it suddenly stopped with the error:

RE: CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Sameer Tilak
Hi Aaron,Would really appreciate your help if you can point me to the documentation. Is this something that I need to do with /etc/hosts on each of the worker machines ? Or do I set SPARK_PUBLIC_DNS (if yes, what is the format?) or something else? I have the following set up: master node:

Re: Comparative study

2014-07-08 Thread Robert James
As a new user, I can definitely say that my experience with Spark has been rather raw. The appeal of interactive, batch, and in between all using more or less straight Scala is unarguable. But the experience of deploying Spark has been quite painful, mainly about gaps between compile time and

spark-1.0.0-rc11 2f1dc868 spark-shell not honoring --properties-file option?

2014-07-08 Thread Andrew Lee
Build: Spark 1.0.0 rc11 (git commit tag: 2f1dc868e5714882cf40d2633fb66772baf34789) Hi All, When I enabled the spark-defaults.conf for the eventLog, spark-shell broke while spark-submit works. I'm trying to create a separate directory per user to keep track with their own Spark job event

Re: Spark: All masters are unresponsive!

2014-07-08 Thread Andrew Or
It seems that your driver (which I'm assuming you launched on the master node) can now connect to the Master, but your executors cannot. Did you make sure that all nodes have the same conf/spark-defaults.conf, conf/spark-env.sh, and conf/slaves? It would be good if you can post the stderr of the

Re: Spark job tracker.

2014-07-08 Thread abhiguruvayya
Hello Mayur, How can I implement these methods mentioned below. Do u you have any clue on this pls et me know. public void onJobStart(SparkListenerJobStart arg0) { } @Override public void onStageCompleted(SparkListenerStageCompleted arg0) { }

Spark-streaming-kafka error

2014-07-08 Thread Bill Jay
Hi all, I used sbt to package a code that uses spark-streaming-kafka. The packaging succeeded. However, when I submitted to yarn, the job ran for 10 seconds and there was an error in the log file as follows: Caused by: java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$

issues with ./bin/spark-shell for standalone mode

2014-07-08 Thread Mikhail Strebkov
Hi! I've been using Spark compiled from 1.0 branch at some point (~2 month ago). The setup is a standalone cluster with 4 worker machines and 1 master machine. I used to run spark shell like this: ./bin/spark-shell -c 30 -em 20g -dm 10g Today I've finally updated to Spark 1.0 release. Now I

Re: issues with ./bin/spark-shell for standalone mode

2014-07-08 Thread Andrew Or
Hi Mikhail, It looks like the documentation is a little out-dated. Neither is true anymore. In general, we try to shift away from short options (-em, -dm etc.) in favor of more explicit ones (--executor-memory, --driver-memory). These options, and --cores, refer to the arguments passed in to

Re: Cannot create dir in Tachyon when running Spark with OFF_HEAP caching (FileDoesNotExistException)

2014-07-08 Thread Teng Long
More updates: Seems in TachyonBlockManager.scala(line 118) of Spark 1.1.0, the TachyonFS.mkdir() method is called, which creates a directory in Tachyon. Right after that, TachyonFS.getFile() method is called. In all the versions of Tachyon I tried (0.4.1, 0.4.0), the second method will return a

Re: issues with ./bin/spark-shell for standalone mode

2014-07-08 Thread Mikhail Strebkov
Thanks Andrew, ./bin/spark-shell --master spark://10.2.1.5:7077 --total-executor-cores 30 --executor-memory 20g --driver-memory 10g works well, just wanted to make sure that I'm not missing anything -- View this message in context:

Re: Spark-streaming-kafka error

2014-07-08 Thread Tobias Pfeiffer
Bill, have you packaged org.apache.spark % spark-streaming-kafka_2.10 % 1.0.0 into your application jar? If I remember correctly, it's not bundled with the downloadable compiled version of Spark. Tobias On Wed, Jul 9, 2014 at 8:18 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I

Re: Use Spark Streaming to update result whenever data come

2014-07-08 Thread Tobias Pfeiffer
Bill, do the additional 100 nodes receive any tasks at all? (I don't know which cluster you use, but with Mesos you could check client logs in the web interface.) You might want to try something like repartition(N) or repartition(N*2) (with N the number of your nodes) after you receive your data.

Re: Spark-streaming-kafka error

2014-07-08 Thread Bill Jay
Hi Tobias, Currently, I do not use bundle any dependency into my application jar. I will try that. Thanks a lot! Bill On Tue, Jul 8, 2014 at 5:22 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, have you packaged org.apache.spark % spark-streaming-kafka_2.10 % 1.0.0 into your

Spark Streaming using File Stream in Java

2014-07-08 Thread Aravind
Hi all, I am trying to run the NetworkWordCount.java file in the streaming examples. The example shows how to read from a network socket. But my usecase is that , I have a local log file which is a stream and continuously updated (say /Users/.../Desktop/mylog.log). I would like to write the same

Re: Comparative study

2014-07-08 Thread Keith Simmons
Santosh, To add a bit more to what Nabeel said, Spark and Impala are very different tools. Impala is *not* built on map/reduce, though it was built to replace Hive, which is map/reduce based. It has its own distributed query engine, though it does load data from HDFS, and is part of the hadoop

RE: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-08 Thread Shao, Saisai
Yes, that would be the Java equivalence to use static class member, but you should carefully program to prevent resource leakage. A good choice is to use third-party DB connection library which supports connection pool, that will alleviate your programming efforts. Thanks Jerry From: Juan

Spark Streaming and Storm

2014-07-08 Thread xichen_tju@126
hi all I am a newbie to Spark Streaming, and used Strom before.Have u test the performance both of them and which one is better? xichen_tju@126

  1   2   >