Re: A question about streaming throughput

2014-10-15 Thread Sean Owen
Hm, is this not just showing that you're rate-limited by how fast you can get events to the cluster? you have more network bottleneck between the data source and cluster in the cloud than your local cluster. On Tue, Oct 14, 2014 at 9:44 PM, danilopds danilob...@gmail.com wrote: Hi, I'm learning

Re: Spark Streaming Empty DStream / RDD and reduceByKey

2014-10-15 Thread Sean Owen
The problem is not ReduceWords, since it is already Serializable by implementing Function2. Indeed the error tells you just what is unserializable: KafkaStreamingWordCount, your driver class. Something is causing a reference to the containing class to be serialized in the closure. The best fix is

adding element into MutableList throws an error type mismatch

2014-10-15 Thread Henry Hung
Hi All, Could someone shed a light to why when adding element into MutableList can result in type mistmatch, even if I'm sure that the class type is right? Below is the sample code I run in spark 1.0.2 console, at the end of line, there is an error type mismatch: Welcome to

Re: adding element into MutableList throws an error type mismatch

2014-10-15 Thread Sean Owen
Another instance of https://issues.apache.org/jira/browse/SPARK-1199 , fixed in subsequent versions. On Wed, Oct 15, 2014 at 7:40 AM, Henry Hung ythu...@winbond.com wrote: Hi All, Could someone shed a light to why when adding element into MutableList can result in type mistmatch, even if

Re: Initial job has not accepted any resources when launching SparkPi example on a worker.

2014-10-15 Thread Theodore Si
Can anyone help me, please? 在 10/14/2014 9:58 PM, Theodore Si 写道: Hi all, I have two nodes, one as master(*host1*) and the other as worker(*host2*). I am using the standalone mode. After starting the master on host1, I run $ export MASTER=spark://host1:7077 $ bin/run-example SparkPi 10 on

Re: Default spark.deploy.recoveryMode

2014-10-15 Thread Prashant Sharma
[Removing dev lists] You are absolutely correct about that. Prashant Sharma On Tue, Oct 14, 2014 at 5:03 PM, Priya Ch learnings.chitt...@gmail.com wrote: Hi Spark users/experts, In Spark source code (Master.scala Worker.scala), when registering the worker with master, I see the usage

Re: How to create Track per vehicle using spark RDD

2014-10-15 Thread Sean Owen
You say you reduceByKey but are you really collecting all the tuples for a vehicle in a collection, like what groupByKey does already? Yes, if one vehicle has a huge amount of data that could fail. Otherwise perhaps you are simply not increasing memory from the default. Maybe you can consider

Re: submitted uber-jar not seeing spark-assembly.jar at worker

2014-10-15 Thread Sean Owen
How did you recompile and deploy Spark to your cluster? it sounds like a problem with not getting the assembly deployed correctly, rather than your app. On Tue, Oct 14, 2014 at 10:35 PM, Tamas Sandor tsan...@gmail.com wrote: Hi, I'm rookie in spark, but hope someone can help me out. I'm

Re: system.out.println with --master yarn-cluster

2014-10-15 Thread vishnu86
Examine the output (replace $YARN_APP_ID in the following with the application identifier output by the previous command) (Note: YARN_APP_LOGS_DIR is usually /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version.) $ cat $YARN_APP_LOGS_DIR/$YARN_APP_ID/container*_01/stdout.

Re: Spark can't find jars

2014-10-15 Thread Christophe Préaud
Hi Jimmy, Did you try my patch? The problem on my side was that the hadoop.tmp.dir (in hadoop core-site.xml) was not handled properly by Spark when it is set on multiple partitions/disks, i.e.: property namehadoop.tmp.dir/name

Unit testing jar request

2014-10-15 Thread Jean Charles Jabouille
Hi, we are Spark users and we use some Spark's test classes for our own application unit tests. We use LocalSparkContext and SharedSparkContext. But these classes are not included in the spark-core library. This is a good option as it's not a good idea to include test classes in the runtime

Spark on secure HDFS

2014-10-15 Thread Erik van oosten
Hi, We really would like to use Spark but we can’t because we have a secure HDFS environment (Cloudera). I understood https://issues.apache.org/jira/browse/SPARK-2541 contains a patch. Can one of the committers please take a look? Thanks! Erik. — Erik van Oosten

Spark Concepts

2014-10-15 Thread nsareen
Hi ,I'm pretty new to Big Data Spark both. I've just started POC work on spark and me my team are evaluating it with other In Memory computing tools such as GridGain, Bigmemory, Aerospike some others too, specifically to solve two sets of problems.1) Data Storage : Our current application runs

Re: Spark output to s3 extremely slow

2014-10-15 Thread Rafal Kwasny
Hi, How large is the dataset you're saving into S3? Actually saving to S3 is done in two steps: 1) writing temporary files 2) commiting them to proper directory Step 2) could be slow because S3 do not have a quick atomic move operation, you have to copy (server side but still takes time) and then

[SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-15 Thread Earthson
I don't know why the JavaSchemaRDD.baseSchemaRDD is private[sql]. And I found that DataTypeConversions is protected[sql]. Finally I find this solution: pre code jrdd.registerTempTable(transform_tmp) jrdd.sqlContext.sql(select * from transform_tmp) /code /pre Could Any One tell me

Re: Default spark.deploy.recoveryMode

2014-10-15 Thread Chitturi Padma
which means the details are not persisted and hence any failures in workers and master wouldnt start the daemons normally ..right ? On Wed, Oct 15, 2014 at 12:17 PM, Prashant Sharma [via Apache Spark User List] ml-node+s1001560n16468...@n3.nabble.com wrote: [Removing dev lists] You are

Re: Default spark.deploy.recoveryMode

2014-10-15 Thread Prashant Sharma
So if you need those features you can go ahead and setup one of Filesystem or zookeeper options. Please take a look at: http://spark.apache.org/docs/latest/spark-standalone.html. Prashant Sharma On Wed, Oct 15, 2014 at 3:25 PM, Chitturi Padma learnings.chitt...@gmail.com wrote: which means

Re: Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-15 Thread Akhil Das
I just ran the same code and it is running perfectly fine on my machine. These are the things on my end: - Spark version: 1.1.0 - Gave full path to the negative and positive files - Set twitter auth credentials in the environment. And here's the code: import org.apache.spark.SparkContext

Re: How to make operation like cogrop() , groupbykey() on pair RDD = [ [ ], [ ] , [ ] ]

2014-10-15 Thread Gen
What results do you want? If your pair is like (a, b), where a is the key and b is the value, you can try rdd1 = rdd1.flatMap(lambda l: l) and then use cogroup. Best Gen -- View this message in context:

Re: Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-15 Thread S Krishna
Hi, I am using 1.1.0. I did set my twitter credentials and I am using the full path. I did not paste this in the public post. I am running on a cluster and getting the exception. Are you running in local or standalone mode? Thanks On Oct 15, 2014 3:20 AM, Akhil Das ak...@sigmoidanalytics.com

Re: Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-15 Thread Akhil Das
I ran it in both local and standalone, it worked for me. It does throws a bind exception which is normal since we are using both SparkContext and StreamingContext. Thanks Best Regards On Wed, Oct 15, 2014 at 5:25 PM, S Krishna skrishna...@gmail.com wrote: Hi, I am using 1.1.0. I did set my

Re: jsonRDD: NoSuchMethodError

2014-10-15 Thread Michael Campbell
How did you resolve it? On Tue, Jul 15, 2014 at 3:50 AM, SK skrishna...@gmail.com wrote: The problem is resolved. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/jsonRDD-NoSuchMethodError-tp9688p9742.html Sent from the Apache Spark User List

SparkSQL: set hive.metastore.warehouse.dir in CLI doesn't work

2014-10-15 Thread Hao Ren
Hi, The following query in sparkSQL 1.1.0 CLI doesn't work. *SET hive.metastore.warehouse.dir=/home/spark/hive/warehouse ; create table test as select v1.*, v2.card_type, v2.card_upgrade_time_black, v2.card_upgrade_time_gold from customer v1 left join customer_loyalty v2 on v1.account_id =

Problem executing Spark via JBoss application

2014-10-15 Thread Mehdi Singer
Hi, I have a Spark standalone example application which is working fine. I'm now trying to integrate this application into a J2EE application, deployed on JBoss 7.1.1 and accessed via a web service. The JBoss server is installed on my local machine (Windows 7) and the master Spark is remote

How to close resources shared in executor?

2014-10-15 Thread Fengyun RAO
In order to share an HBase connection pool, we create an object Object Util { val HBaseConf = HBaseConfiguration.create val Connection= HConnectionManager.createConnection(HBaseConf) } which would be shared among tasks on the same executor. e.g. val result = rdd.map(line = { val table

Re: How to create Track per vehicle using spark RDD

2014-10-15 Thread manasdebashiskar
It is wonderful to see some idea. Now the questions: 1) What is a track segment? Ans) It is the line that contains two adjacent points when all points are arranged by time. Say a vehicle moves (t1, p1) - (t2, p2) - (t3, p3). Then the segments are (p1, p2), (p2, p3) when the time ordering is (t1

Re: How to add HBase dependencies and conf with spark-submit?

2014-10-15 Thread Fengyun RAO
+user@hbase 2014-10-15 20:48 GMT+08:00 Fengyun RAO raofeng...@gmail.com: We use Spark 1.1, and HBase 0.98.1-cdh5.1.0, and need to read and write an HBase table in Spark program. I notice there are: spark.driver.extraClassPath spark.executor.extraClassPathproperties to manage extra

Re: A question about streaming throughput

2014-10-15 Thread danilopds
Ok, I understand. But in both cases the data are in the same processing node. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/A-question-about-streaming-throughput-tp16416p16501.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to close resources shared in executor?

2014-10-15 Thread Ted Yu
Pardon me - there was typo in previous email. Calling table.close() is the recommended approach. HConnectionManager does reference counting. When all references to the underlying connection are gone, connection would be released. Cheers On Wed, Oct 15, 2014 at 7:13 AM, Ted Yu

Re: How to add HBase dependencies and conf with spark-submit?

2014-10-15 Thread Soumitra Kumar
I am writing to HBase, following are my options: export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar spark-submit \ --jars

Re: Spark Worker crashing and Master not seeing recovered worker

2014-10-15 Thread Malte
This is still happening to me on mesos. Any workarounds? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Worker-crashing-and-Master-not-seeing-recovered-worker-tp2312p16506.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-15 Thread Sean Owen
It looks like you're making the StreamingContext and SparkContext separately from the same conf. Instead, how about passing the SparkContext to the StreamingContext constructor? it seems like better practice and is a guess at the problem cause. On Tue, Oct 14, 2014 at 9:13 PM, SK

matrix operations?

2014-10-15 Thread ll
hi there... is there any other matrix operations in addition to multiply()? like addition or dot product? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/matrix-operations-tp16508.html Sent from the Apache Spark User List mailing list archive at

RowMatrix.multiply() ?

2014-10-15 Thread ll
hi.. it looks like RowMatrix.multiply() takes a local Matrix as a parameter and returns the result as a distributed RowMatrix. how do you perform this series of multiplications if A, B, C, and D are all RowMatrix? ((A x B) x C) x D) thanks! -- View this message in context:

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-15 Thread Terry Siu
Hi Yin, pqt_rdt_snappy has 76 columns. These two parquet tables were created via Hive 0.12 from existing Avro data using CREATE TABLE following by an INSERT OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition while pqt_segcust_snappy has two partitions. For

Re: SPARK_SUBMIT_CLASSPATH question

2014-10-15 Thread Greg Hill
I guess I was a little light on the details in my haste. I'm using Spark on YARN, and this is in the driver process in yarn-client mode (most notably spark-shell). I've had to manually add a bunch of JARs that I had thought it would just pick up like everything else does: export

Re: Problem executing Spark via JBoss application

2014-10-15 Thread Yana Kadiyska
From this line : Removing executor app-20141015142644-0125/0 because it is EXITED I would guess that you need to examine the executor log to see why the executor actually exited. My guess would be that the executor cannot connect back to your driver. But check the log from the executor. It should

Serialize/deserialize Naive Bayes model and index files

2014-10-15 Thread jatinpreet
Hi, I am trying to persist the files generated as a result of Naive bayes training with MLlib. These comprise of the model file, label index(own class) and term dictionary(own class). I need to save them on an HDFS location and then deserialize when needed for prediction. How can I do the same

spark-sql not coming up with Hive 0.10.0/CDH 4.6

2014-10-15 Thread Anurag Tangri
Hi, I compiled spark 1.1.0 with CDH 4.6 but when I try to get spark-sql cli up, it gives error: == [atangri@pit-uat-hdputil1 bin]$ ./spark-sql Spark assembly has been built with Hive, including Datanucleus jars on classpath Java HotSpot(TM) 64-Bit Server VM warning: ignoring option

Re: spark-sql not coming up with Hive 0.10.0/CDH 4.6

2014-10-15 Thread Anurag Tangri
I see Hive 0.10.0 metastore sql does not have a VERSION table but spark is looking for it. Anyone else faced this issue or any ideas on how to fix it ? Thanks, Anurag Tangri On Wed, Oct 15, 2014 at 10:51 AM, Anurag Tangri atan...@groupon.com wrote: Hi, I compiled spark 1.1.0 with CDH 4.6

Re: spark-sql not coming up with Hive 0.10.0/CDH 4.6

2014-10-15 Thread Marcelo Vanzin
Hi Anurag, Spark SQL (from the Spark standard distribution / sources) currently requires Hive 0.12; as you mention, CDH4 has Hive 0.10, so that's not gonna work. CDH 5.2 ships with Spark 1.1.0 and is modified so that Spark SQL can talk to the Hive 0.13.1 that is also bundled with CDH, so if

Re: spark-sql not coming up with Hive 0.10.0/CDH 4.6

2014-10-15 Thread Anurag Tangri
Hi Marcelo, Exactly. Found it few minutes ago. I ran mysql hive 12 sql on my hive 10 metastore, which created missing tables and it seems to be working now. Not sure if everything else in CDH 4.6/Hive 10 would also still be working though or not. Looks like we cannot use Spark SQL in a clean

Exception while reading SendingConnection to ConnectionManagerId

2014-10-15 Thread Jimmy Li
Hi there, I'm running spark on ec2, and am running into an error there that I don't get locally. Here's the error: 11335 [handle-read-write-executor-3] ERROR org.apache.spark.network.SendingConnection - Exception while reading SendingConnection to ConnectionManagerId([IP HERE])

Re: Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-15 Thread SK
You are right. Creating the StreamingContext from the SparkContext instead of SparkConf helped. Thanks for the help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Sentiment-Analysis-of-Twitter-streams-tp16410p16520.html Sent from the

Spark tasks still scheduled after Spark goes down

2014-10-15 Thread pkl
Hi, My setup: tomcat (running a web app which initializes SparkContext) and dedicated Spark cluster (1 master 2 workers, 1VM per each). I am able to properly start this setup where SparkContext properly initializes connection with master. I am able to execute tasks and perform required

Re: Spark Streaming Empty DStream / RDD and reduceByKey

2014-10-15 Thread Abraham Jacob
Hi All, I figured out what the problem was. Thank you Sean for pointing me in the right direction. All the jibber jabber about empty DStream / RDD was all just pure nonsense [?] . I guess the sequence of events (the fact that spark streaming started crashing just after I implemented the

Getting the value from DStream[Int]

2014-10-15 Thread SK
Hi, As a result of a reduction operation, the resultant value score is a DStream[Int] . How can I get the simple Int value? I tried score[0], and score._1, but neither worked and can't find a getValue() in the DStream API. thanks -- View this message in context:

Spark Streaming is slower than Spark

2014-10-15 Thread Tarun Garg
Hi, I am evaluating Sparking Streaming with kafka and i found that spark streaming is slower than Spark. It took more time is processing same amount of data as per the Spark Console it can process 2300 Records per seconds. Is my assumption is correct? Spark Streaming has to do a lot of this

Re: SPARK_SUBMIT_CLASSPATH question

2014-10-15 Thread Marcelo Vanzin
Hi Greg, I'm not sure exactly what it is that you're trying to achieve, but I'm pretty sure those variables are not supposed to be set by users. You should take a look at the documentation for spark.driver.extraClassPath and spark.driver.extraLibraryPath, and the equivalent options for executors.

how to set log level of spark executor on YARN(using yarn-cluster mode)

2014-10-15 Thread eric wong
Hi, I want to check the DEBUG log of spark executor on YARN(using yarn-cluster mode), but 1. yarn daemonlog setlevel DEBUG YarnChild.class 2. set log4j.properties in spark/conf folder on client node. no means above works. So how could i set the log level of spark executor* on YARN container

Re: how to set log level of spark executor on YARN(using yarn-cluster mode)

2014-10-15 Thread Marcelo Vanzin
Hi Eric, Check the Debugging Your Application section at: http://spark.apache.org/docs/latest/running-on-yarn.html Long story short: upload your log4j.properties using the --files argument of spark-submit. (Mental note: we could make the log level configurable via a system property...) On

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-15 Thread Ray
Hi Xiangrui, I am using yarn-cluster mode. The current hadoop cluster is configured to only accept yarn-cluster mode and not allow yarn-client mode. I have no prevelige to change that. Without initializing with k-means||, the job finished in 10 minutes. With k-means, it just hangs there for

Play framework

2014-10-15 Thread Mohammed Guller
Hi - Has anybody figured out how to integrate a Play application with Spark and run it on a Spark cluster using spark-submit script? I have seen some blogs about creating a simple Play app and running it locally on a dev machine with sbt run command. However, those steps don't work for

Sample codes for Spark streaming + Kafka + Scala + sbt?

2014-10-15 Thread Gary Zhao
Hi Anyone can share a project as a sample? I tried them a couple days ago but couldn't make it work. Looks like it's due to some Kafka dependency issue. I'm using sbt-assembly. Thanks Gary

Spark's shuffle file size keep increasing

2014-10-15 Thread Haopu Wang
I have a Spark application which is running Spark Streaming and Spark SQL. I observed the size of shuffle files under spark.local.dir folder keeps increase and never decreases. Eventually it will run out-of-disk-space error. The question is: when will Spark delete these shuffle files? In the

Re: Spark Concepts

2014-10-15 Thread nsareen
Anybody with good hands on with Spark, please do reply. It would help us a lot!! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Concepts-tp16477p16536.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: YARN deployment of Spark and Thrift JDBC server

2014-10-15 Thread neeraj
I would like to reiterate that I don't have Hive installed on the Hadoop cluster. I have some queries on following comment from Cheng Lian-2: The Thrift server is used to interact with existing Hive data, and thus needs Hive Metastore to access Hive catalog. In your case, you need to build Spark

Re: How to write data into Hive partitioned Parquet table?

2014-10-15 Thread Banias H
I got tipped by an expert that the error of Unsupported language features in query that I had was due to the fact that SparkSQL does not support dynamic partitions, and I can do saveAsParquetFile() for each partition. My inefficient implementation is to: //1. run the query without DISTRIBUTE BY