Re: Problems with GC and time to execute with different number of executors.

2015-02-05 Thread Guillermo Ortiz
I'm not caching the data. with each iteration I mean,, each 128mb that a executor has to process. The code is pretty simple. final Conversor c = new Conversor(null, null, null, longFields,typeFields); SparkConf conf = new SparkConf().setAppName(Simple Application); JavaSparkContext sc = new

Re: Tableau beta connector

2015-02-05 Thread Ashutosh Trivedi (MT2013030)
okay. So the queries tableau will run on the persisted data will be through SPARK SQL to improve performance and to take advantage of SPARK SQL. Thanks again Denny From: Denny Lee denny.g@gmail.com Sent: Thursday, February 5, 2015 1:27 PM To: Ashutosh

Re: pyspark - gzip output compression

2015-02-05 Thread Kane Kim
I'm getting SequenceFile doesn't work with GzipCodec without native-hadoop code! Where to get those libs and where to put it in the spark? Also can I save plain text file (like saveAsTextFile) as gzip? Thanks. On Wed, Feb 4, 2015 at 11:10 PM, Kane Kim kane.ist...@gmail.com wrote: How to save

Re: spark on ec2

2015-02-05 Thread Charles Feduke
I don't see anything that says you must explicitly restart them to load the new settings, but usually there is some sort of signal trapped [or brute force full restart] to get a configuration reload for most daemons. I'd take a guess and use the $SPARK_HOME/sbin/{stop,start}-slaves.sh scripts on

Re: spark driver behind firewall

2015-02-05 Thread Kostas Sakellis
Yes, the driver has to be able to accept incoming connections. All the executors connect back to the driver sending heartbeats, map status, metrics. It is critical and I don't know of a way around it. You could look into using something like the https://github.com/spark-jobserver/spark-jobserver

Re: Reg Job Server

2015-02-05 Thread Deep Pradhan
I read somewhere about Gatling. Can that be used to profile Spark jobs? On Fri, Feb 6, 2015 at 10:27 AM, Kostas Sakellis kos...@cloudera.com wrote: Which Spark Job server are you talking about? On Thu, Feb 5, 2015 at 8:28 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Can Spark Job

Re: Can't access remote Hive table from spark

2015-02-05 Thread Zhan Zhang
Not sure spark standalone mode. But on spark-on-yarn, it should work. You can check following link: http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/ Thanks. Zhan Zhang On Feb 5, 2015, at 5:02 PM, Cheng Lian lian.cs@gmail.commailto:lian.cs@gmail.com wrote: Please note

spark streaming from kafka real time + batch processing in java

2015-02-05 Thread Mohit Durgapal
I want to write a spark streaming consumer for kafka in java. I want to process the data in real-time as well as store the data in hdfs in year/month/day/hour/ format. I am not sure how to achieve this. Should I write separate kafka consumers, one for writing data to HDFS and one for spark

Re: Reg Job Server

2015-02-05 Thread Kostas Sakellis
Which Spark Job server are you talking about? On Thu, Feb 5, 2015 at 8:28 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Can Spark Job Server be used for profiling Spark jobs?

Re: Reg Job Server

2015-02-05 Thread Kostas Sakellis
When you say profiling, what are you trying to figure out? Why your spark job is slow? Gatling seems to be a load generating framework so I'm not sure how you'd use it (i've never used it before). Spark runs on the JVM so you can use any JVM profiling tools like YourKit. Kostas On Thu, Feb 5,

Re: Spark stalls or hangs: is this a clue? remote fetches seem to never return?

2015-02-05 Thread Xuefeng Wu
what's the dump info by jstack? Yours, Xuefeng Wu 吴雪峰 敬上 On 2015年2月6日, at 上午10:20, Michael Albert m_albert...@yahoo.com.INVALID wrote: My apologies for following up my own post, but I thought this might be of interest. I terminated the java process corresponding to executor which had

Re: how to debug this kind of error, e.g. lost executor?

2015-02-05 Thread Xuefeng Wu
could you find the shuffle files? or the files were deleted by other processes? Yours, Xuefeng Wu 吴雪峰 敬上 On 2015年2月5日, at 下午11:14, Yifan LI iamyifa...@gmail.com wrote: Hi, I am running a heavy memory/cpu overhead graphx application, I think the memory is sufficient and set RDDs’

Re: Reg Job Server

2015-02-05 Thread Deep Pradhan
Yes, I want to know, the reason about the job being slow. I will look at YourKit. Can you redirect me to that, some tutorial in how to use? Thank You On Fri, Feb 6, 2015 at 10:44 AM, Kostas Sakellis kos...@cloudera.com wrote: When you say profiling, what are you trying to figure out? Why your

Re: RE: Can't access remote Hive table from spark

2015-02-05 Thread Skanda
Hi, My spark-env.sh has the following entries with respect to classpath: export SPARK_CLASSPATH=$SPARK_CLASSPATH:/usr/lib/hive/lib/*:/etc/hive/conf/ -Skanda On Sun, Feb 1, 2015 at 11:45 AM, guxiaobo1982 guxiaobo1...@qq.com wrote: Hi Skanda, How do set up your SPARK_CLASSPATH? I add the

Re: Get filename in Spark Streaming

2015-02-05 Thread Emre Sevinc
Hello, Did you check the following? http://themodernlife.github.io/scala/spark/hadoop/hdfs/2014/09/28/spark-input-filename/ http://apache-spark-user-list.1001560.n3.nabble.com/access-hdfs-file-name-in-map-td6551.html -- Emre Sevinç On Fri, Feb 6, 2015 at 2:16 AM, Subacini B

Can we execute create table and load data commands against Hive inside HiveContext?

2015-02-05 Thread guxiaobo1982
Hi, I am playing with the following example code: public class SparkTest { public static void main(String[] args){ String appName= This is a test application; String master=spark://lix1.bh.com:7077; SparkConf

Re: Shuffle read/write issue in spark 1.2

2015-02-05 Thread Raghavendra Pandey
Even I observed the same issue. On Fri, Feb 6, 2015 at 12:19 AM, Praveen Garg praveen.g...@guavus.com wrote: Hi, While moving from spark 1.1 to spark 1.2, we are facing an issue where Shuffle read/write has been increased significantly. We also tried running the job by rolling back to

Re: How many stages in my application?

2015-02-05 Thread Kostas Sakellis
Yes, there is no way right now to know how many stages a job will generate automatically. Like Mark said, RDD#toDebugString will give you some info about the RDD DAG and from that you can determine based on the dependency types (Wide vs. narrow) if there is a stage boundary. On Thu, Feb 5, 2015

Re: Reg Job Server

2015-02-05 Thread Mark Hamstra
https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit On Thu, Feb 5, 2015 at 9:18 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, I want to know, the reason about the job being slow. I will look at YourKit. Can you redirect me to that, some

Re: Whether standalone spark support kerberos?

2015-02-05 Thread Kostas Sakellis
Standalone mode does not support talking to a kerberized HDFS. If you want to talk to a kerberized (secure) HDFS cluster i suggest you use Spark on Yarn. On Wed, Feb 4, 2015 at 2:29 AM, Jander g jande...@gmail.com wrote: Hope someone helps me. Thanks. On Wed, Feb 4, 2015 at 6:14 PM, Jander g

Re: spark on ec2

2015-02-05 Thread Kane Kim
Oh yeah, they picked up changes after restart, thanks! On Thu, Feb 5, 2015 at 8:13 PM, Charles Feduke charles.fed...@gmail.com wrote: I don't see anything that says you must explicitly restart them to load the new settings, but usually there is some sort of signal trapped [or brute force full

RE: Error KafkaStream

2015-02-05 Thread jishnu.prathap
Hi, If your message is string you will have to Change Encoder and Decoder to StringEncoder , StringDecoder. If your message Is byte[] you can use DefaultEncoder Decoder. Also Don’t forget to add import statements depending on ur encoder and decoder. import

Re: Reg Job Server

2015-02-05 Thread Deep Pradhan
I have a single node Spark standalone cluster. Will this also work for my cluster? Thank You On Fri, Feb 6, 2015 at 11:02 AM, Mark Hamstra m...@clearstorydata.com wrote: https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit On Thu, Feb 5, 2015 at 9:18

Reg Job Server

2015-02-05 Thread Deep Pradhan
Hi, Can Spark Job Server be used for profiling Spark jobs?

Spark Metrics Servlet for driver and executor

2015-02-05 Thread Judy Nash
Hi all, Looking at spark metricsServlet. What is the url exposing driver executor json response? Found master and worker successfully, but can't find url that return json for the other 2 sources. Thanks! Judy

Error KafkaStream

2015-02-05 Thread Eduardo Costa Alfaia
Hi Guys, I’m getting this error in KafkaWordCount; TaskSetManager: Lost task 0.0 in stage 4095.0 (TID 1281, 10.20.10.234): java.lang.ClassCastException: [B cannot be cast to java.lang.String

Re: Shuffle write increases in spark 1.2

2015-02-05 Thread Anubhav Srivastav
Hi Kevin, We seem to be facing the same problem as well. Were you able to find anything after that? The ticket does not seem to have progressed anywhere. Regards, Anubhav On 5 January 2015 at 10:37, 정재부 itsjb.j...@samsung.com wrote: Sure, here is a ticket.

Re: Reading from CSV file with spark-csv_2.10

2015-02-05 Thread Jerry Lam
Hi Florin, I might be wrong but timestamp looks like a keyword in SQL that the engine gets confused with. If it is a column name of your table, you might want to change it. ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types) I'm constantly working with CSV files with spark.

Re: Error KafkaStream

2015-02-05 Thread Sean Owen
Is SPARK-4905 / https://github.com/apache/spark/pull/4371/files the same issue? On Thu, Feb 5, 2015 at 7:03 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi Guys, I’m getting this error in KafkaWordCount; TaskSetManager: Lost task 0.0 in stage 4095.0 (TID 1281, 10.20.10.234):

Re: Error KafkaStream

2015-02-05 Thread Eduardo Costa Alfaia
I don’t think so Sean. On Feb 5, 2015, at 16:57, Sean Owen so...@cloudera.com wrote: Is SPARK-4905 / https://github.com/apache/spark/pull/4371/files the same issue? On Thu, Feb 5, 2015 at 7:03 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi Guys, I’m getting this error in

Re: pyspark - gzip output compression

2015-02-05 Thread Sean Owen
No, you can compress SequenceFile with gzip. If you are reading outside Hadoop then SequenceFile may not be a great choice. You can use the gzip codec with TextOutputFormat if you need to. On Feb 5, 2015 8:28 AM, Kane Kim kane.ist...@gmail.com wrote: I'm getting SequenceFile doesn't work with

Re: Use Spark as multi-threading library and deprecate web UI

2015-02-05 Thread Sean Owen
Do you mean disable the web UI? spark.ui.enabled=false Sure, it's useful with master = local[*] too. On Thu, Feb 5, 2015 at 9:30 AM, Shuai Zheng szheng.c...@gmail.com wrote: Hi All, It might sounds weird, but I think spark is perfect to be used as a multi-threading library in some cases.

ZeroMQ and pyspark.streaming

2015-02-05 Thread Sasha Kacanski
Does pyspark supports zeroMQ? I see that java does it, but I am not sure for Python? regards -- Aleksandar Kacanski

Re: MLlib - Show an element in RDD[(Int, Iterable[Array[Double]])]

2015-02-05 Thread danilopds
I solve the question with this code: import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors val data = sc.textFile(/opt/testAppSpark/data/height-weight.txt).map { line = Vectors.dense(line.split(' ').map(_.toDouble)) }.cache() val cluster =

Re: Spark job ends abruptly during setup without error message

2015-02-05 Thread Arun Luthra
I'm submitting this on a cluster, with my usual setting of, export YARN_CONF_DIR=/etc/hadoop/conf It is working again after a small change to the code so I will see if I can reproduce the error (later today). On Thu, Feb 5, 2015 at 9:17 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Are

My first experience with Spark

2015-02-05 Thread java8964
I am evaluating Spark for our production usage. Our production cluster is Hadoop 2.2.0 without Yarn. So I want to test Spark with Standalone deployment running with Hadoop. What I have in mind is to test a very complex Hive query, which joins between 6 tables, lots of nested structure with

word2vec more distributed

2015-02-05 Thread Alex Minnaar
I was wondering if there was any chance of getting a more distributed word2vec implementation. I seem to be running out of memory from big local data structures such as val syn1Global = new Array[Float](vocabSize * vectorSize) Is there anyway chance of getting a version where these are all

Errors in the workers machines

2015-02-05 Thread Spico Florin
Hello! I received the following errors in the workerLog.log files: ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@stream4:33660] - [akka.tcp://sparkExecutor@stream4:47929]: Error [Association failed with [akka.tcp://sparkExecutor@stream4:47929]] [

Re: Zipping RDDs of equal size not possible

2015-02-05 Thread Niklas Wilcke
Hi Xiangrui, I'm sorry. I didn't recognize your mail. What I did is a workaround only working for my special case. It does not scale and only works for small data sets but that is fine for me so far. Kind Regards, Niklas def securlyZipRdds[A, B: ClassTag](rdd1: RDD[A], rdd2: RDD[B]): RDD[(A,

How to design a long live spark application

2015-02-05 Thread Shuai Zheng
Hi All, I want to develop a server side application: User submit request à Server run spark application and return (this might take a few seconds). So I want to host the server to keep the long-live context, I don’t know whether this is reasonable or not. Basically I try to have a

Reading from CSV file with spark-csv_2.10

2015-02-05 Thread Spico Florin
Hello! I'm using spark-csv 2.10 with Java from the maven repository groupIdcom.databricks/groupId artifactIdspark-csv_2.10/artifactId version0.1.1/version I would like to use Spark-SQL to filter out my data. I'm using the following code: JavaSchemaRDD cars = new

Resources not uploaded when submitting job in yarn-client mode

2015-02-05 Thread weberste
Hi, I am trying to submit a job from a Windows system to a YARN cluster running on Linux (the HDP2.2 sandbox). I have copied the relevant Hadoop directories as well as the yarn-site.xml and mapred-site.xml to the Windows file system. Further, I have added winutils.exe to $HADOOP_HOME/bin. I can

how to debug this kind of error, e.g. lost executor?

2015-02-05 Thread Yifan LI
Hi, I am running a heavy memory/cpu overhead graphx application, I think the memory is sufficient and set RDDs’ StorageLevel using MEMORY_AND_DISK. But I found there were some tasks failed due to following errors: java.io.FileNotFoundException:

Use Spark as multi-threading library and deprecate web UI

2015-02-05 Thread Shuai Zheng
Hi All, It might sounds weird, but I think spark is perfect to be used as a multi-threading library in some cases. The local mode will naturally boost multiple thread when required. Because it is more restrict and less chance to have potential bug in the code (because it is more data oriental,

streaming joining multiple streams

2015-02-05 Thread Zilvinas Saltys
The challenge I have is this. There's two streams of data where an event might look like this in stream1: (time, hashkey, foo1) and in stream2: (time, hashkey, foo2) The result after joining should be (time, hashkey, foo1, foo2) .. The join happens on hashkey and the time difference can be ~30

K-Means final cluster centers

2015-02-05 Thread SK
Hi, I am trying to get the final cluster centers after running the KMeans algorithm in MLlib in order to characterize the clusters. But the KMeansModel does not have any public method to retrieve this info. There appears to be only a private method called clusterCentersWithNorm. I guess I could

Re: K-Means final cluster centers

2015-02-05 Thread Frank Austin Nothaft
Unless I misunderstood your question, you’re looking for the val clusterCenters in http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeansModel, no? Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Feb 5, 2015, at

Re: How to design a long live spark application

2015-02-05 Thread Eugen Cepoi
Yes you can submit multiple actions from different threads to the same SparkContext. It is safe. Indeed what you want to achieve is quite common. Expose some operations over a SparkContext through HTTP. I have used spray for this and it just worked fine. At bootstrap of your web app, start a

RE: How to design a long live spark application

2015-02-05 Thread Shuai Zheng
This example helps a lot J But I am thinking a below case: Assume I have a SparkContext as a global variable. Then if I use multiple threads to access/use it. Will it mess up? For example: My code: public static ListTuple2Integer, Double run(JavaSparkContext sparkContext,

RE: My first experience with Spark

2015-02-05 Thread java8964
Finally I gave up after there are too many failed retry. From the log in the worker side, it looks like failed with JVM OOM, as below: 15/02/05 17:02:03 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Driver Heartbeater,5,main]java.lang.OutOfMemoryError: Java heap

Re: K-Means final cluster centers

2015-02-05 Thread Suneel Marthi
There's a kMeansModel.clusterCenters() available if u r looking to get the centers from KMeansModel. From: SK skrishna...@gmail.com To: user@spark.apache.org Sent: Thursday, February 5, 2015 5:35 PM Subject: K-Means final cluster centers Hi, I am trying to get the final cluster

How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-05 Thread YaoPau
I have a file badFullIPs.csv of bad IP addresses used for filtering. In yarn-client mode, I simply read it off the edge node, transform it, and then broadcast it: val badIPs = fromFile(edgeDir + badfullIPs.csv) val badIPsLines = badIPs.getLines val badIpSet = badIPsLines.toSet

StreamingContext getOrCreate with queueStream

2015-02-05 Thread pnpritchard
I am trying to use the StreamingContext getOrCreate method in my app. I started by running the example ( RecoverableNetworkWordCount https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala ), which worked as

Best tools for visualizing Spark Streaming data?

2015-02-05 Thread Su She
Hello Everyone, I wanted to hear the community's thoughts on what (open - source) tools have been used to visualize data from Spark/Spark Streaming? I've taken a look at Zepellin, but had some trouble working with it. Couple questions: 1) I've looked at a couple blog posts and it seems like

Re: maven doesn't build dependencies with Scala 2.11

2015-02-05 Thread Ted Yu
Now that Kafka 0.8.2.0 has been released, adding external/kafka module works. FYI On Sun, Jan 18, 2015 at 7:36 PM, Ted Yu yuzhih...@gmail.com wrote: bq. there was no 2.11 Kafka available That's right. Adding external/kafka module resulted in: [ERROR] Failed to execute goal on project

NaiveBayes classifier causes ShuffleDependency class cast exception

2015-02-05 Thread aanilpala
I have the following code: SparkConf conf = new SparkConf().setAppName(streamer).setMaster(local[2]); conf.set(spark.driver.allowMultipleContexts, true); JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(batch_interval));

no option to add intercepts for StreamingLinearAlgorithm

2015-02-05 Thread jamborta
hi all, just wondering if there is a reason why it is not possible to add intercepts for streaming regression models? I understand that run method in the underlying GeneralizedLinearModel does not take intercept as a parameter either. Any reason for that? thanks, -- View this message in

Re: Is there a way to access Hive UDFs in a HiveContext?

2015-02-05 Thread jamborta
Hi, My guess is that Spark is not picking up the jar where the function is stored. You might have to add it to sparkcontext or the classpath manually. You can also register the function hc.registerFunction(myfunct, myfunct) then use it in the query. -- View this message in context:

Re: Tableau beta connector

2015-02-05 Thread Denny Lee
Could you clarify what you mean by build another Spark and work through Spark Submit? If you are referring to utilizing Spark spark and thrift, you could start the Spark service and then have your spark-shell, spark-submit, and/or thrift service aim at the master you have started. On Thu Feb 05

Re: word2vec: how to save an mllib model and reload it?

2015-02-05 Thread Carsten Schnober
As a Spark newbie, I've come across this thread. I'm playing with Word2Vec in our Hadoop cluster and here's my issue with classic Java serialization of the model: I don't have SSH access to the cluster master node. Here's my code for computing the model: val input =

Re: Use Spark as multi-threading library and deprecate web UI

2015-02-05 Thread Arush Kharbanda
You can use akka, that is the underlying Multithreading library Spark uses. On Thu, Feb 5, 2015 at 9:56 PM, Shuai Zheng szheng.c...@gmail.com wrote: Nice. I just try and it works. Thanks very much! And I notice there is below in the log: 15/02/05 11:19:09 INFO Remoting: Remoting started;

Spark job ends abruptly during setup without error message

2015-02-05 Thread Arun Luthra
While a spark-submit job is setting up, the yarnAppState goes into Running mode, then I get a flurry of typical looking INFO-level messages such as INFO BlockManagerMasterActor: ... INFO YarnClientSchedulerBackend: Registered executor: ... Then, spark-submit quits without any error message and

RE: Use Spark as multi-threading library and deprecate web UI

2015-02-05 Thread Shuai Zheng
Nice. I just try and it works. Thanks very much! And I notice there is below in the log: 15/02/05 11:19:09 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@NY02913D.global.local:8162] 15/02/05 11:19:10 INFO AkkaUtils: Connecting to HeartbeatReceiver:

Re: how to debug this kind of error, e.g. lost executor?

2015-02-05 Thread Yifan LI
Anyone has idea on where I can find the detailed log of that lost executor(why it was lost)? Thanks in advance! On 05 Feb 2015, at 16:14, Yifan LI iamyifa...@gmail.com wrote: Hi, I am running a heavy memory/cpu overhead graphx application, I think the memory is sufficient and set

Re: How to design a long live spark application

2015-02-05 Thread Charles Feduke
If you want to design something like Spark shell have a look at: http://zeppelin-project.org/ Its open source and may already do what you need. If not, its source code will be helpful in answering the questions about how to integrate with long running jobs that you have. On Thu Feb 05 2015 at

Re: How to design a long live spark application

2015-02-05 Thread Corey Nolet
Here's another lightweight example of running a SparkContext in a common java servlet container: https://github.com/calrissian/spark-jetty-server On Thu, Feb 5, 2015 at 11:46 AM, Charles Feduke charles.fed...@gmail.com wrote: If you want to design something like Spark shell have a look at:

Re: how to debug this kind of error, e.g. lost executor?

2015-02-05 Thread Ankur Srivastava
Li, I cannot tell you the reason for this exception but have seen these kind of errors when using HASH based shuffle manager (which is default) until v 1.2. Try the SORT shuffle manager. Hopefully that will help Thanks Ankur Anyone has idea on where I can find the detailed log of that lost

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2015-02-05 Thread Cheng Lian
Hi Jenny, You may try to use |--files $SPARK_HOME/conf/hive-site.xml --driver-class-path hive-site.xml| when submitting your application. The problem is that when running in cluster mode, the driver is actually running in a random container directory on a random executor node. By using

Re: Can't access remote Hive table from spark

2015-02-05 Thread Cheng Lian
Please note that Spark 1.2.0 /only/ support Hive 0.13.1 /or/ 0.12.0, none of other versions are supported. Best, Cheng On 1/25/15 12:18 AM, guxiaobo1982 wrote: Hi, I built and started a single node standalone Spark 1.2.0 cluster along with a single node Hive 0.14.0 instance installed by

spark driver behind firewall

2015-02-05 Thread Kane Kim
I submit spark job from machine behind firewall, I can't open any incoming connections to that box, does driver absolutely need to accept incoming connections? Is there any workaround for that case? Thanks.

RE: My first experience with Spark

2015-02-05 Thread java8964
Hi, Deb:From what I search online, changing parallelism is one option. But the failed stage already had 200 tasks, which is quite large on a one 24 core box.I know query that amount of data in one box is kind of over, but I do want to know how to config it using less memory, even it could mean

Get filename in Spark Streaming

2015-02-05 Thread Subacini B
Hi All, We have filename with timestamp say ABC_1421893256000.txt and the timestamp needs to be extracted from file name for further processing.Is there a way to get input file name picked up by spark streaming job? Thanks in advance Subacini

Re: Error KafkaStream

2015-02-05 Thread Eduardo Costa Alfaia
Hi Shao, When I changed to StringDecoder I’ve get this compiling error: [error] /sata_disk/workspace/spark-1.2.0/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaW ordCount.scala:78: not found: type StringDecoder [error] KafkaUtils.createStream[String, String,

RE: Error KafkaStream

2015-02-05 Thread Shao, Saisai
Did you include Kafka jars? This StringDecoder is under kafka/serializer, You can refer to the unit test KafkaStreamSuite in Spark to see how to use this API. Thanks Jerry From: Eduardo Costa Alfaia [mailto:e.costaalf...@unibs.it] Sent: Friday, February 6, 2015 9:44 AM To: Shao, Saisai Cc: Sean

Spark stalls or hangs: is this a clue? remote fetches seem to never return?

2015-02-05 Thread Michael Albert
Greetings! Again, thanks to all who have given suggestions.I am still trying to diagnose a problem where I have processes than run for one or several hours but intermittently stall or hang.By stall I mean that there is no CPU usage on the workers or the driver, nor network activity, nor do I

Re: StreamingContext getOrCreate with queueStream

2015-02-05 Thread Tathagata Das
I dont think your screenshots came through in the email. None the less, queueStream will not work with getOrCreate. Its mainly for testing (by generating your own RDDs) and not really useful for production usage (where you really need to checkpoint-based recovery). TD On Thu, Feb 5, 2015 at 4:12

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-05 Thread Sandy Ryza
Hi Jon, You'll need to put the file on HDFS (or whatever distributed filesystem you're running on) and load it from there. -Sandy On Thu, Feb 5, 2015 at 3:18 PM, YaoPau jonrgr...@gmail.com wrote: I have a file badFullIPs.csv of bad IP addresses used for filtering. In yarn-client mode, I

Re: Parquet compression codecs not applied

2015-02-05 Thread Cheng Lian
Hi Ayoub, The doc page isn’t wrong, but it’s indeed confusing. |spark.sql.parquet.compression.codec| is used when you’re wring Parquet file with something like |data.saveAsParquetFile(...)|. However, you are using Hive DDL in the example code. All Hive DDLs and commands like |SET| are

RE: Error KafkaStream

2015-02-05 Thread Shao, Saisai
Hi, I think you should change the `DefaultDecoder` of your type parameter into `StringDecoder`, seems you want to decode the message into String. `DefaultDecoder` is to return Array[Byte], not String, so here class casting will meet error. Thanks Jerry -Original Message- From:

Requested array size exceeds VM limit Error

2015-02-05 Thread Muttineni, Vinay
Hi, I have a 170GB data tab limited data set which I am converting into the RDD[LabeledPoint] format. I am then taking a 60% sample of this data set to be used for training a GBT model. I got the Size exceeds Integer.MAX_VALUE error which I fixed by repartitioning the data set to 1000

Re: How many stages in my application?

2015-02-05 Thread Joe Wass
Thanks Akhil and Mark. I can of course count events (assuming I can deduce the shuffle boundaries), but like I said the program isn't simple and I'd have to do this manually every time I change the code. So I rather find a way of doing this automatically if possible. On 4 February 2015 at 19:41,

Re: How many stages in my application?

2015-02-05 Thread Mark Hamstra
RDD#toDebugString will help. On Thu, Feb 5, 2015 at 1:15 AM, Joe Wass jw...@crossref.org wrote: Thanks Akhil and Mark. I can of course count events (assuming I can deduce the shuffle boundaries), but like I said the program isn't simple and I'd have to do this manually every time I change the

Re: how to specify hive connection options for HiveContext

2015-02-05 Thread Arush Kharbanda
Hi Are you trying to run a spark job from inside eclipse? and want the job to access hive configuration options.? To access hive tables? Thanks Arush On Tue, Feb 3, 2015 at 7:24 AM, guxiaobo1982 guxiaobo1...@qq.com wrote: Hi, I know two options, one for spark_submit, the other one for

Spark SQL - Not able to create schema RDD for nested Directory for specific directory names

2015-02-05 Thread Nishant Patel
Hi, I got strange behavior. When I am creating schema RDD for nested directory sometimes it work and sometime it does not work. My question is whether nested directory supported or not? My code is as below. val fileLocation = hdfs://localhost:9000/apps/hive/warehouse/hl7 val parquetRDD =

Re: Errors in the workers machines

2015-02-05 Thread Arush Kharbanda
1. For what reasons is using Spark the above ports? What internal component is triggering them? -Akka(guessing from the error log) is used to schedule tasks and to notify executors - the ports used are random by default 2. How I can get rid of these errors? - Probably the ports are not open on

Pyspark Hbase scan.

2015-02-05 Thread Castberg , René Christian
?Hi, I am trying to do a hbase scan and read it into a spark rdd using pyspark. I have successfully written data to hbase from pyspark, and been able to read a full table from hbase using the python example code. Unfortunately I am unable to find any example code for doing an HBase scan and

Re: Spark Job running on localhost on yarn cluster

2015-02-05 Thread kundan kumar
The problem got resolved after removing all the configuration files from all the slave nodes. Earlier we were running in the standalone mode and that lead to duplicating the configuration on all the slaves. Once that was done it ran as expected in cluster mode. Although performance is not up to

Re: Pyspark Hbase scan.

2015-02-05 Thread gen tang
Hi, In fact, this pull https://github.com/apache/spark/pull/3920 is to do Hbase scan. However, it is not merged yet. You can also take a look at the example code at http://spark-packages.org/package/20 which is using scala and python to read data from hbase. Hope this can be helpful. Cheers Gen

Re: How many stages in my application?

2015-02-05 Thread Mark Hamstra
And the Job page of the web UI will give you an idea of stages completed out of the total number of stages for the job. That same information is also available as JSON. Statically determining how many stages a job logically comprises is one thing, but dynamically determining how many stages

Re: Spark stalls or hangs: is this a clue? remote fetches seem to never return?

2015-02-05 Thread Michael Albert
My apologies for following up my own post, but I thought this might be of interest. I terminated the java process corresponding to executor which had opened the stderr file mentioned below (kill pid).Then my spark job completed without error (it was actually almost finished). Now I am

spark on ec2

2015-02-05 Thread Kane Kim
Hi, I'm trying to change setting as described here: http://spark.apache.org/docs/1.2.0/ec2-scripts.html export SPARK_WORKER_CORES=6 Then I ran ~/spark-ec2/copy-dir /root/spark/conf to distribute to slaves, but without any effect. Do I have to restart workers? How to do that with spark-ec2?

Re: Problems with GC and time to execute with different number of executors.

2015-02-05 Thread Guillermo Ortiz
Any idea why if I use more containers I get a lot of stopped because GC? 2015-02-05 8:59 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com: I'm not caching the data. with each iteration I mean,, each 128mb that a executor has to process. The code is pretty simple. final Conversor c = new

Re: How to design a long live spark application

2015-02-05 Thread Chip Senkbeil
Hi, You can also check out the Spark Kernel project: https://github.com/ibm-et/spark-kernel It can plug into the upcoming IPython 3.0 notebook (providing a Scala/Spark language interface) and provides an API to submit code snippets (like the Spark Shell) and get results directly back, rather

get null potiner exception newAPIHadoopRDD.map()

2015-02-05 Thread oxpeople
I modified the code Base on CassandraCQLTest. to get the area code count base on time zone. I got error on create new map Rdd. Any helping is appreciated. Thanks. ... val arecodeRdd = sc.newAPIHadoopRDD(job.getConfiguration(), classOf[CqlPagingInputFormat],

Shuffle Dependency Casting error

2015-02-05 Thread aanilpala
Hi, I am working on a text mining project and I want to use NaiveBayesClassifier of MLlib to classify some stream items. So, I have two Spark contexts one of which is a streaming context. Everything looks fine if I comment out train and predict methods, it works fine although doesn't obviously do

Re: Shuffle Dependency Casting error

2015-02-05 Thread VISHNU SUBRAMANIAN
Hi, Could you share the code snippet. Thanks, Vishnu On Thu, Feb 5, 2015 at 11:22 PM, aanilpala aanilp...@gmail.com wrote: Hi, I am working on a text mining project and I want to use NaiveBayesClassifier of MLlib to classify some stream items. So, I have two Spark contexts one of which is a

Re: get null potiner exception newAPIHadoopRDD.map()

2015-02-05 Thread Ted Yu
Is it possible that value.get((area_code)) or value.get(time_zone)) returned null ? On Thu, Feb 5, 2015 at 10:58 AM, oxpeople vincent.y@bankofamerica.com wrote: I modified the code Base on CassandraCQLTest. to get the area code count base on time zone. I got error on create new map Rdd.

Re: Spark Job running on localhost on yarn cluster

2015-02-05 Thread Kostas Sakellis
Kundan, So I think your configuration here is incorrect. We need to adjust memory and #executors. So for your case you have: Cluster setup 5 nodes 16gb RAM 8 cores. The number of executors should be the total number of nodes in your cluster - in your case 5. As for --num-executor-cores it should

Re: LeaseExpiredException while writing schemardd to hdfs

2015-02-05 Thread Petar Zecevic
Why don't you just map rdd's rows to lines and then call saveAsTextFile()? On 3.2.2015. 11:15, Hafiz Mujadid wrote: I want to write whole schemardd to single in hdfs but facing following exception

MLlib - Show an element in RDD[(Int, Iterable[Array[Double]])]

2015-02-05 Thread danilopds
Hi, I'm learning Spark and testing the Spark MLlib library with algorithm K-means. So, I created a file height-weight.txt like this: 65.0 220.0 73.0 160.0 59.0 110.0 61.0 120.0 ... And the code (executed in spark-shell): import org.apache.spark.mllib.clustering.KMeans import

  1   2   >