Re: java.io.IOException: No space left on device--regd.

2015-07-06 Thread Akhil Das
While the job is running, just look in the directory and see whats the root cause of it (is it the logs? is it the shuffle? etc). Here's a few configuration options which you can try: - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM) - Enable log rotation:

Re: java.io.IOException: No space left on device--regd.

2015-07-06 Thread Akhil Das
You can also set these in the spark-env.sh file : export SPARK_WORKER_DIR=/mnt/spark/ export SPARK_LOCAL_DIR=/mnt/spark/ Thanks Best Regards On Mon, Jul 6, 2015 at 12:29 PM, Akhil Das ak...@sigmoidanalytics.com wrote: While the job is running, just look in the directory and see whats

Re: Unable to start spark-sql

2015-07-06 Thread Akhil Das
Its complaining for a jdbc driver. Add it in your driver classpath like: ./bin/spark-sql --driver-class-path /home/akhld/sigmoid/spark/lib/mysql-connector-java-5.1.32-bin.jar Thanks Best Regards On Mon, Jul 6, 2015 at 11:42 AM, sandeep vura sandeepv...@gmail.com wrote: Hi Sparkers, I am

Re: JDBC Streams

2015-07-05 Thread Akhil Das
If you want a long running application, then go with spark streaming (which kind of blocks your resources). On the other hand, if you use job server then you can actually use the resources (CPUs) for other jobs also when your dbjob is not using them. Thanks Best Regards On Sun, Jul 5, 2015 at

Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-05 Thread Akhil Das
Looks like, it spend more time writing/transferring the 40GB of shuffle when you used kryo. And surpirsingly, JavaSerializer has 700MB of shuffle? Thanks Best Regards On Sun, Jul 5, 2015 at 12:01 PM, Gavin Liu ilovesonsofanar...@gmail.com wrote: Hi, I am using TeraSort benchmark from

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Akhil Das
With binary i think it might not be possible, although if you can download the sources and then build it then you can remove this function https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023 which initializes the SQLContext.

Re: duplicate names in sql allowed?

2015-07-03 Thread Akhil Das
I think you can open up a jira, not sure if this PR https://github.com/apache/spark/pull/2209/files (SPARK-2890 https://issues.apache.org/jira/browse/SPARK-2890) broke the validation piece. Thanks Best Regards On Fri, Jul 3, 2015 at 4:29 AM, Koert Kuipers ko...@tresata.com wrote: i am

Re: Accessing the console from spark

2015-07-03 Thread Akhil Das
Can you paste the code? Something is missing Thanks Best Regards On Fri, Jul 3, 2015 at 3:14 PM, Jem Tucker jem.tuc...@gmail.com wrote: In the driver when running spark-submit with --master yarn-client On Fri, Jul 3, 2015 at 10:23 AM Akhil Das ak...@sigmoidanalytics.com wrote: Where does

Re: build spark 1.4 source code for sparkR with maven

2015-07-03 Thread Akhil Das
Did you try: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package Thanks Best Regards On Fri, Jul 3, 2015 at 2:27 PM, 1106944...@qq.com 1106944...@qq.com wrote: Hi all, Anyone build spark 1.4 source code for sparkR with maven/sbt, what's comand ? using

Re: Accessing the console from spark

2015-07-03 Thread Akhil Das
Where does it returns null? Within the driver or in the executor? I just tried System.console.readPassword in spark-shell and it worked. Thanks Best Regards On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker jem.tuc...@gmail.com wrote: Hi, We have an application that requires a username/password to

Re: Making Unpersist Lazy

2015-07-02 Thread Akhil Das
rdd's which are no longer required will be removed from memory by spark itself (which you can consider as lazy?). Thanks Best Regards On Wed, Jul 1, 2015 at 7:48 PM, Jem Tucker jem.tuc...@gmail.com wrote: Hi, The current behavior of rdd.unpersist() appears to not be lazily executed and

Re: Convert CSV lines to List of Objects

2015-07-02 Thread Akhil Das
Have a look at the sc.wholeTextFiles, you can use it to read the whole csv contents into the value and then split it on \n and add them up to a list and return it. *sc.wholeTextFiles:* Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported

Re: output folder structure not getting commited and remains as _temporary

2015-07-01 Thread Akhil Das
Looks like a jar conflict to me. ava.lang.NoSuchMethodException: org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData.getBytesWritten() You are having multiple versions of the same jars in the classpath. Thanks Best Regards On Wed, Jul 1, 2015 at 6:58 AM, nkd kalidas.nimmaga...@gmail.com

Re: Issue with parquet write after join (Spark 1.4.0)

2015-07-01 Thread Akhil Das
It says: Caused by: java.net.ConnectException: Connection refused: slave2/...:54845 Could you look in the executor logs (stderr on slave2) and see what made it shut down? Since you are doing a join there's a high possibility of OOM etc. Thanks Best Regards On Wed, Jul 1, 2015 at 10:20 AM,

Re: Spark run errors on Raspberry Pi

2015-07-01 Thread Akhil Das
Now i'm having a strange feeling to try this on KBOX http://kevinboone.net/kbox.html :/ Thanks Best Regards On Wed, Jul 1, 2015 at 9:10 AM, Exie tfind...@prodevelop.com.au wrote: FWIW, I had some trouble getting Spark running on a Pi. My core problem was using snappy for compression as it

Re: Run multiple Spark jobs concurrently

2015-07-01 Thread Akhil Das
Have a look at https://spark.apache.org/docs/latest/job-scheduling.html Thanks Best Regards On Wed, Jul 1, 2015 at 12:01 PM, Nirmal Fernando nir...@wso2.com wrote: Hi All, Is there any additional configs that we have to do to perform $subject? -- Thanks regards, Nirmal Associate

Re: Can I do Joins across Event Streams ?

2015-07-01 Thread Akhil Das
Have a look at the window, updateStateByKey operations, if you are looking for something more sophisticated then you can actually persists these streams in an intermediate storage (say for x duration) like HBase or Cassandra or any other DB and you can do global aggregations with these. Thanks

Re: Difference between spark-defaults.conf and SparkConf.set

2015-07-01 Thread Akhil Das
.addJar works for me when i run it as a stand-alone application (without using spark-submit) Thanks Best Regards On Tue, Jun 30, 2015 at 7:47 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, running into a pretty strange issue: I'm setting spark.executor.extraClassPath

Re: Issues in reading a CSV file from local file system using spark-shell

2015-07-01 Thread Akhil Das
Since its a windows machine, you are very likely to be hitting this one https://issues.apache.org/jira/browse/SPARK-2356 Thanks Best Regards On Wed, Jul 1, 2015 at 12:36 AM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, I'm running Spark 1.4.0 without Hadoop. I'm using the binary

Re: Checkpoint support?

2015-06-30 Thread Akhil Das
Have a look at the StageInfo https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.StageInfo class, it has method stageFailed. You could make use of it. I don't understand the point of restarting the entire application. Thanks Best Regards On Tue, Jun 30, 2015 at

Re: Error while installing spark

2015-06-30 Thread Akhil Das
How much memory you have on that machine? You can increase the heap-space by *export _JAVA_OPTIONS=-Xmx2g* Thanks Best Regards On Tue, Jun 30, 2015 at 11:00 AM, Chintan Bhatt chintanbhatt...@charusat.ac.in wrote: Facing following error message while performing sbt/sbt assembly Error

Re: got java.lang.reflect.UndeclaredThrowableException when running multiply APPs in spark

2015-06-30 Thread Akhil Das
This: Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Could happen for many reasons, one of them could be because of insufficient memory. Are you running all 20 apps on the same node? How are you submitting the apps? (with spark-submit?). I see you have

Re: s3 bucket access/read file

2015-06-30 Thread Akhil Das
Try this way: val data = sc.textFile(s3n://ACCESS_KEY:SECRET_KEY@mybucket/temp/) Thanks Best Regards On Mon, Jun 29, 2015 at 11:59 PM, didi did...@gmail.com wrote: Hi *Cant read text file from s3 to create RDD * after setting the configuration val

Re: problem for submitting job

2015-06-29 Thread Akhil Das
Cool. On 29 Jun 2015 21:10, 郭谦 buptguoq...@gmail.com wrote: Akhil Das, You give me a new idea to solve the problem. Vova provides me a way to solve the problem just before Vova Shelgunovvvs...@gmail.com Sample code for submitting job from any other java app, e.g. servlet: http

Re: spilling in-memory map of 5.1 MB to disk (272 times so far)

2015-06-29 Thread Akhil Das
Here's a bunch of configuration for that https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior Thanks Best Regards On Fri, Jun 26, 2015 at 10:37 PM, igor.berman igor.ber...@gmail.com wrote: Hi, wanted to get some advice regarding tunning spark application I see for some of

Re: Master dies after program finishes normally

2015-06-29 Thread Akhil Das
Which version of spark are you using? You can try changing the heap size manually by *export _JAVA_OPTIONS=-Xmx5g * Thanks Best Regards On Fri, Jun 26, 2015 at 7:52 PM, Yifan LI iamyifa...@gmail.com wrote: Hi, I just encountered the same problem, when I run a PageRank program which has lots

Re: problem for submitting job

2015-06-29 Thread Akhil Das
You can create a SparkContext in your program and run it as a standalone application without using spark-submit. Here's something that will get you started: //Create SparkContext val sconf = new SparkConf() .setMaster(spark://spark-ak-master:7077) .setAppName(Test)

Re:

2015-06-26 Thread Akhil Das
. The input size is 512.0 MB (hadoop) / 4159106. Can this be reduced to 64 MB so as to increase the number of tasks. Similar to split size that increases the number of mappers in Hadoop M/R. On Thu, Jun 25, 2015 at 12:06 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Look in the tuning

Re: Problem Run Spark Example HBase Code Using Spark-Submit

2015-06-26 Thread Akhil Das
Try to add them in the SPARK_CLASSPATH in your conf/spark-env.sh file Thanks Best Regards On Thu, Jun 25, 2015 at 9:31 PM, Bin Wang binwang...@gmail.com wrote: I am trying to run the Spark example code HBaseTest from command line using spark-submit instead run-example, in that case, I can

Re: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Akhil Das
You just need to set your HADOOP_HOME which appears to be null in the stackstrace. If you are not having the winutils.exe, then you can download https://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip and put it there. Thanks Best Regards On Thu, Jun 25, 2015 at 11:30 PM, Ashic

Re: Spark for distributed dbms cluster

2015-06-26 Thread Akhil Das
Which distributed database are you referring here? Spark can connect with almost all those databases out there (You just need to pass the Input/Output Format classes or there are a bunch of connectors also available). Thanks Best Regards On Fri, Jun 26, 2015 at 12:07 PM, louis.hust

Re: Performing sc.paralleize (..) in workers not in the driver program

2015-06-26 Thread Akhil Das
Why do you want to do that? Thanks Best Regards On Thu, Jun 25, 2015 at 10:16 PM, shahab shahab.mok...@gmail.com wrote: Hi, Apparently, sc.paralleize (..) operation is performed in the driver program not in the workers ! Is it possible to do this in worker process for the sake of

Re: Spark 1.4 RDD to DF fails with toDF()

2015-06-26 Thread Akhil Das
Its a scala version conflict, can you paste your build.sbt file? Thanks Best Regards On Fri, Jun 26, 2015 at 7:05 AM, stati srikanth...@gmail.com wrote: Hello, When I run a spark job with spark-submit it fails with below exception for code line /*val webLogDF =

Re: Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Akhil Das
​JavaPairInputDStreamString, String messages = KafkaUtils.createDirectStream( jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet ); Here: jssc = JavaStreamingContext String.class = Key ,

Re: Spark 1.4 RDD to DF fails with toDF()

2015-06-26 Thread Akhil Das
Those provided spark libraries are compatible with scala 2.11? Thanks Best Regards On Fri, Jun 26, 2015 at 4:48 PM, Srikanth srikanth...@gmail.com wrote: Thanks Akhil for checking this out. Here is my build.sbt. name := Weblog Analysis version := 1.0 scalaVersion := 2.11.5 javacOptions

Re:

2015-06-25 Thread Akhil Das
(๏̯͡๏) deepuj...@gmail.com wrote: Its taking an hour and on Hadoop it takes 1h 30m, is there a way to make it run faster ? On Wed, Jun 24, 2015 at 11:39 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Cool. :) On 24 Jun 2015 23:44, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Its running now

Re: Can Spark1.4 work with CDH4.6

2015-06-25 Thread Akhil Das
a different guava dependency but the error does go away this way On Wed, Jun 24, 2015 at 10:04 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you try to add those jars in the SPARK_CLASSPATH and give it a try? Thanks Best Regards On Wed, Jun 24, 2015 at 12:07 AM, Yana Kadiyska yana.kadiy

Re: spark1.4 sparkR usage

2015-06-25 Thread Akhil Das
Here you go https://amplab-extras.github.io/SparkR-pkg/ Thanks Best Regards On Thu, Jun 25, 2015 at 12:39 PM, 1106944...@qq.com 1106944...@qq.com wrote: Hi all I have installed spark1.4, then want to use sparkR . assueme spark master ip= node1, how to start sparkR ? and summit job to

Re: spark1.4 sparkR usage

2015-06-25 Thread Akhil Das
Here you go https://amplab-extras.github.io/SparkR-pkg/ Thanks Best Regards On Thu, Jun 25, 2015 at 12:39 PM, 1106944...@qq.com 1106944...@qq.com wrote: Hi all I have installed spark1.4, then want to use sparkR . assueme spark master ip= node1, how to start sparkR ? and summit job to

Re: Akka failures: Driver Disassociated

2015-06-25 Thread Akhil Das
Can you look in the worker logs and see whats going on? It may happen that you ran out of diskspace etc. Thanks Best Regards On Thu, Jun 25, 2015 at 12:08 PM, barmaley o...@solver.com wrote: I'm running Spark 1.3.1 on AWS... Having long-running application (spark context) which accepts and

Re: Killing Long running tasks (stragglers)

2015-06-25 Thread Akhil Das
That totally depends on the way you extract the data. It will be helpful if you can paste your code so that we will understand it better. Thanks Best Regards On Wed, Jun 24, 2015 at 2:32 PM, William Ferrell wferr...@gmail.com wrote: Hello - I am using Apache Spark 1.2.1 via pyspark. Thanks

Re: spark1.4 sparkR usage

2015-06-25 Thread Akhil Das
, Is this the official R Package? It is written : *NOTE: The API from the upcoming Spark release (1.4) will not have the same API as described here. * Thanks, JC ᐧ 2015-06-25 10:55 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com: Here you go https://amplab-extras.github.io/SparkR-pkg/ Thanks Best

Re: Should I keep memory dedicated for HDFS and Spark on cluster nodes?

2015-06-24 Thread Akhil Das
Depending the size of the memory you are having, you ccould allocate 60-80% of the memory for the spark worker process. Datanode doesn't require too much memory. On 23 Jun 2015 21:26, maxdml max...@cs.duke.edu wrote: I'm wondering if there is a real benefit for splitting my memory in two for

Re:

2015-06-24 Thread Akhil Das
Can you look a bit more in the error logs? It could be getting killed because of OOM etc. One thing you can try is to set the spark.shuffle.blockTransferService to nio from netty. Thanks Best Regards On Wed, Jun 24, 2015 at 5:46 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have a Spark job

Re: Can Spark1.4 work with CDH4.6

2015-06-24 Thread Akhil Das
Can you try to add those jars in the SPARK_CLASSPATH and give it a try? Thanks Best Regards On Wed, Jun 24, 2015 at 12:07 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, I have been using Spark against an external Metastore service which runs Hive with Cdh 4.6 In Spark 1.2, I was

Re: kafka spark streaming with mesos

2015-06-24 Thread Akhil Das
A screenshot of your framework running would also be helpful. How many cores does it have? Did you try running it in coarse grained mode? Try to add these to the conf: sparkConf.set(spark.mesos.coarse, true) sparkConfset(spark.cores.max, 2) Thanks Best Regards On Wed, Jun 24, 2015 at 1:35 AM,

Re:

2015-06-24 Thread Akhil Das
) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) On Wed, Jun 24, 2015 at 7:16 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you look a bit more in the error logs? It could be getting

Re: Multiple executors writing file using java filewriter

2015-06-23 Thread Akhil Das
Why don't you do a normal .saveAsTextFiles? Thanks Best Regards On Mon, Jun 22, 2015 at 11:55 PM, anshu shukla anshushuk...@gmail.com wrote: Thanx for reply !! YES , Either it should write on any machine of cluster or Can you please help me ... that how to do this . Previously i was

Re: Spark job fails silently

2015-06-23 Thread Akhil Das
Looks like a hostname conflict to me. 15/06/22 17:04:45 WARN Utils: Your hostname, datasci01.dev.abc.com resolves to a loopback address: 127.0.0.1; using 10.0.3.197 instead (on interface eth0) 15/06/22 17:04:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Can you paste

Re: Any way to retrieve time of message arrival to Kafka topic, in Spark Streaming?

2015-06-23 Thread Akhil Das
May be while producing the messages, you can make it as a keyedMessage with the timestamp as key and on the consumer end you can easily identify the key (which will be the timestamp) from the message. If the network is fast enough, then i think there could be a small millisecond lag. Thanks Best

Re: What does [Stage 0: (0 + 2) / 2] mean on the console

2015-06-23 Thread Akhil Das
Well, you could that (Stage information) is an ASCII representation of the WebUI (running on port 4040). Since you set local[4] you will have 4 threads for your computation, and since you are having 2 receivers, you are left with 2 threads to process ((0 + 2) -- This 2 is your 2 threads.) And the

Re: Programming with java on spark

2015-06-23 Thread Akhil Das
Did you happened to try this? JavaPairRDDInteger, String hadoopFile = sc.hadoopFile( /sigmoid, DataInputFormat.class, LongWritable.class, Text.class) Thanks Best Regards On Tue, Jun 23, 2015 at 6:58 AM, 付雅丹 yadanfu1...@gmail.com wrote: Hello, everyone! I'm new in spark.

Re: Spark Streaming: limit number of nodes

2015-06-23 Thread Akhil Das
Use *spark.cores.max* to limit the CPU per job, then you can easily accommodate your third job also. Thanks Best Regards On Tue, Jun 23, 2015 at 5:07 PM, Wojciech Pituła w.pit...@gmail.com wrote: I have set up small standalone cluster: 5 nodes, every node has 5GB of memory an 8 cores. As you

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Akhil Das
Yes. Thanks Best Regards On Mon, Jun 22, 2015 at 8:33 PM, Murthy Chelankuri kmurt...@gmail.com wrote: I have more than one jar. can we set sc.addJar multiple times with each dependent jar ? On Mon, Jun 22, 2015 at 8:30 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Try sc.addJar instead

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Akhil Das
You can use fileStream for that, look at the XMLInputFormat https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java of mahout. It should give you full XML object as on record, (as opposed to an XML

Re: s3 - Can't make directory for path

2015-06-22 Thread Akhil Das
Could you elaborate a bit more? What do you meant by set up a standalone server? and what is leading you to that exceptions? Thanks Best Regards On Mon, Jun 22, 2015 at 2:22 AM, nizang ni...@windward.eu wrote: hi, I'm trying to setup a standalone server, and in one of my tests, I got the

Re: memory needed for each executor

2015-06-22 Thread Akhil Das
Totally depends on the use-case that you are solving with Spark, for instance there was some discussion around the same which you could read over here http://apache-spark-user-list.1001560.n3.nabble.com/How-does-one-decide-no-of-executors-cores-memory-allocation-td23326.html Thanks Best Regards

Re: JavaDStreamString read and write rdbms

2015-06-22 Thread Akhil Das
Its pretty straight forward, this would get you started http://stackoverflow.com/questions/24896233/how-to-save-apache-spark-schema-output-in-mysql-database Thanks Best Regards On Mon, Jun 22, 2015 at 12:39 PM, Manohar753 manohar.re...@happiestminds.com wrote: Hi Team, How to split and

Re: Serializer not switching

2015-06-22 Thread Akhil Das
How are you submitting the application? Could you paste the code that you are running? Thanks Best Regards On Mon, Jun 22, 2015 at 5:37 PM, Sean Barzilay sesnbarzi...@gmail.com wrote: I am trying to run a function on every line of a parquet file. The function is in an object. When I run the

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Akhil Das
Like this? val rawXmls = ssc.fileStream(path, classOf[XmlInputFormat], classOf[LongWritable], classOf[Text]) Thanks Best Regards On Mon, Jun 22, 2015 at 5:45 PM, Yong Feng fengyong...@gmail.com wrote: Thanks a lot, Akhil I saw this mail thread before, but still do not understand how

Re: Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-22 Thread Akhil Das
Option 1 should be fine, Option 2 would bound a lot on network as the data increase in time. Thanks Best Regards On Mon, Jun 22, 2015 at 5:59 PM, Ashish Soni asoni.le...@gmail.com wrote: Hi All , What is the Best Way to install and Spark Cluster along side with Hadoop Cluster , Any

Re: Spark Titan

2015-06-21 Thread Akhil Das
Have a look at http://s3.thinkaurelius.com/docs/titan/0.5.0/titan-io-format.html You could use those Input/Output formats with newAPIHadoopRDD api call. Thanks Best Regards On Sun, Jun 21, 2015 at 8:50 PM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi, How to connect TItan

Re: Local spark jars not being detected

2015-06-20 Thread Akhil Das
Not sure, but try removing the provided or create a lib directory in the project home and bring that jar over there. On 20 Jun 2015 18:08, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Hi, I'm using IntelliJ ide for my spark project. I've compiled spark 1.3.0 for scala 2.11.4 and

Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-19 Thread Akhil Das
One workaround would be remove/move the files from the input directory once you have it processed. Thanks Best Regards On Fri, Jun 19, 2015 at 5:48 AM, Haopu Wang hw...@qilinsoft.com wrote: Akhil, From my test, I can see the files in the last batch will alwyas be reprocessed upon

Re: Build spark application into uber jar

2015-06-19 Thread Akhil Das
This is how i used to build a assembly jar with sbt: Your build.sbt file would look like this: *import AssemblyKeys._* *assemblySettings* *name := FirstScala* *version := 1.0* *scalaVersion := 2.10.4* *libraryDependencies += org.apache.spark %% spark-core % 1.3.1* *libraryDependencies +=

Re: how to change /tmp folder for spark ut use sbt

2015-06-19 Thread Akhil Das
You can try setting these properties: .set(spark.local.dir,/mnt/spark/) .set(java.io.tmpdir,/mnt/spark/) Thanks Best Regards On Fri, Jun 19, 2015 at 8:28 AM, yuemeng (A) yueme...@huawei.com wrote: hi,all if i want to change the /tmp folder to any other folder for spark ut use

Re: N kafka topics vs N spark Streaming

2015-06-19 Thread Akhil Das
Like this? val add_msgs = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, Array(add).toSet) val delete_msgs = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, Array(delete).toSet) val

Re: kafka spark streaming working example

2015-06-18 Thread Akhil Das
.setMaster(local) set it to local[2] or local[*] Thanks Best Regards On Thu, Jun 18, 2015 at 5:59 PM, Bartek Radziszewski bar...@scalaric.com wrote: hi, I'm trying to run simple kafka spark streaming example over spark-shell: sc.stop import org.apache.spark.SparkConf import

Re: connect mobile app with Spark backend

2015-06-18 Thread Akhil Das
Why not something like your mobile app pushes data to your webserver which pushes the data to Kafka or Cassandra or any other database and have a Spark streaming job running all the time operating on the incoming data and pushes the calculated values back. This way, you don't have to start a spark

Re: understanding on the waiting batches and scheduling delay in Streaming UI

2015-06-18 Thread Akhil Das
Which version of spark? and what is your data source? For some reason, your processing delay is exceeding the batch duration. And its strange that you are not seeing any scheduling delay. Thanks Best Regards On Thu, Jun 18, 2015 at 7:29 AM, Mike Fang chyfan...@gmail.com wrote: Hi, I have a

Re: Machine Learning on GraphX

2015-06-18 Thread Akhil Das
This might give you a good start http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html its a bit old though. Thanks Best Regards On Thu, Jun 18, 2015 at 2:33 PM, texol t.rebo...@gmail.com wrote: Hi, I'm new to GraphX and I'd like to use Machine Learning

Re: Web UI vs History Server Bugs

2015-06-18 Thread Akhil Das
You could possibly open up a JIRA and shoot an email to the dev list. Thanks Best Regards On Wed, Jun 17, 2015 at 11:40 PM, jcai jonathon@yale.edu wrote: Hi, I am running this on Spark stand-alone mode. I find that when I examine the web UI, a couple bugs arise: 1. There is a

Re: Shuffle produces one huge partition

2015-06-17 Thread Akhil Das
Can you try repartitioning the rdd after creating the K,V. And also, while calling the rdd1.join(rdd2, Pass the # partition argument too) Thanks Best Regards On Wed, Jun 17, 2015 at 12:15 PM, Al M alasdair.mcbr...@gmail.com wrote: I have 2 RDDs I want to Join. We will call them RDD A and RDD

Re: ClassNotFound exception from closure

2015-06-17 Thread Akhil Das
Not sure why spark-submit isn't shipping your project jar (may be try with --jars), You can do a sc.addJar(/path/to/your/project.jar) also, it should solve it. Thanks Best Regards On Wed, Jun 17, 2015 at 6:37 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, running into a pretty

Re: Spark History Server pointing to S3

2015-06-16 Thread Akhil Das
Not quiet sure, but try pointing the spark.history.fs.logDirectory to your s3 Thanks Best Regards On Tue, Jun 16, 2015 at 6:26 PM, Gianluca Privitera gianluca.privite...@studio.unibo.it wrote: In Spark website it’s stated in the View After the Fact section (

Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-16 Thread Akhil Das
of reprocess some files. Thanks Best Regards On Mon, Jun 15, 2015 at 2:49 PM, Haopu Wang hw...@qilinsoft.com wrote: Akhil, thank you for the response. I want to explore more. If the application is just monitoring a HDFS folder and output the word count of each streaming batch into also HDFS

Re: tasks won't run on mesos when using fine grained

2015-06-16 Thread Akhil Das
Did you look inside all logs? Mesos logs and executor logs? Thanks Best Regards On Mon, Jun 15, 2015 at 7:09 PM, Gary Ogden gog...@gmail.com wrote: My Mesos cluster has 1.5 CPU and 17GB free. If I set: conf.set(spark.mesos.coarse, true); conf.set(spark.cores.max, 1); in the SparkConf

Re: Optimizing Streaming from Websphere MQ

2015-06-16 Thread Akhil Das
wrote: Hi Akhil, Thanks for your response. I have 10 cores which sums of all my 3 machines and I am having 5-10 receivers. I have tried to test the processed number of records per second by varying number of receivers. If I am having 10 receivers (i.e. one receiver for each core), then I

Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-16 Thread Akhil Das
You can also look into https://spark.apache.org/docs/latest/tuning.html for performance tuning. Thanks Best Regards On Mon, Jun 15, 2015 at 10:28 PM, Rex X dnsr...@gmail.com wrote: Thanks very much, Akhil. That solved my problem. Best, Rex On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das ak

Re: settings from props file seem to be ignored in mesos

2015-06-16 Thread Akhil Das
Whats in your executor (that .tgz file) conf/spark-default.conf file? Thanks Best Regards On Mon, Jun 15, 2015 at 7:14 PM, Gary Ogden gog...@gmail.com wrote: I'm loading these settings from a properties file: spark.executor.memory=256M spark.cores.max=1 spark.shuffle.consolidateFiles=true

Re: How to set up a Spark Client node?

2015-06-15 Thread Akhil Das
I'm assuming by spark-client you mean the spark driver program. In that case you can pick any machine (say Node 7), create your driver program in it and use spark-submit to submit it to the cluster or if you create the SparkContext within your driver program (specifying all the properties) then

Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-15 Thread Akhil Das
Something like this? val huge_data = sc.textFile(/path/to/first.csv).map(x = (x.split(\t)(1), x.split(\t)(0)) val gender_data = sc.textFile(/path/to/second.csv),map(x = (x.split(\t)(0), x)) val joined_data = huge_data.join(gender_data) joined_data.take(1000) Its scala btw, python api should

Re: Spark DataFrame Reduce Job Took 40s for 6000 Rows

2015-06-15 Thread Akhil Das
Have a look here https://spark.apache.org/docs/latest/tuning.html Thanks Best Regards On Mon, Jun 15, 2015 at 11:27 AM, Proust GZ Feng pf...@cn.ibm.com wrote: Hi, Spark Experts I have played with Spark several weeks, after some time testing, a reduce operation of DataFrame cost 40s on a

Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-15 Thread Akhil Das
I think it should be fine, that's the whole point of check-pointing (in case of driver failure etc). Thanks Best Regards On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang hw...@qilinsoft.com wrote: Hi, can someone help to confirm the behavior? Thank you! -Original Message- From: Haopu

Re: Reliable SQS Receiver for Spark Streaming

2015-06-13 Thread Akhil Das
Yes, if you have enabled WAL and checkpointing then after the store, you can simply delete the SQS Messages from your receiver. Thanks Best Regards On Sat, Jun 13, 2015 at 6:14 AM, Michal Čizmazia mici...@gmail.com wrote: I would like to have a Spark Streaming SQS Receiver which deletes SQS

Re: How to split log data into different files according to severity

2015-06-13 Thread Akhil Das
Are you looking for something like filter? See a similar example here https://spark.apache.org/examples.html Thanks Best Regards On Sat, Jun 13, 2015 at 3:11 PM, Hao Wang bill...@gmail.com wrote: Hi, I have a bunch of large log files on Hadoop. Each line contains a log and its severity. Is

Re: Are there ways to restrict what parameters users can set for a Spark job?

2015-06-13 Thread Akhil Das
I think the straight answer would be No, but yes you can actually hardcode these parameters if you want. Look in the SparkContext.scala https://github.com/apache/spark/blob/master/core%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2FSparkContext.scala#L364 where all these properties are being

Re: Reading file from S3, facing java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException

2015-06-12 Thread Akhil Das
Looks like your spark is not able to pick up the HADOOP_CONF. To fix this, you can actually add jets3t-0.9.0.jar to the classpath (sc.addJar(/path/to/jets3t-0.9.0.jar). Thanks Best Regards On Thu, Jun 11, 2015 at 6:44 PM, shahab shahab.mok...@gmail.com wrote: Hi, I tried to read a csv file

Re: spark stream and spark sql with data warehouse

2015-06-12 Thread Akhil Das
This is a good start, if you haven't read it already http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations Thanks Best Regards On Thu, Jun 11, 2015 at 8:17 PM, 唐思成 jadetan...@qq.com wrote: Hi all: We are trying to using spark to do some real

Re: Limit Spark Shuffle Disk Usage

2015-06-12 Thread Akhil Das
You can disable shuffle spill (spark.shuffle.spill http://spark.apache.org/docs/latest/configuration.html#shuffle-behavior) if you are having enough memory to hold that much data. I believe adding more resources would be your only choice. Thanks Best Regards On Thu, Jun 11, 2015 at 9:46 PM, Al M

Re: --jars not working?

2015-06-12 Thread Akhil Das
You can verify if the jars are shipped properly by looking at the driver UI (running on 4040) Environment tab. Thanks Best Regards On Sat, Jun 13, 2015 at 12:43 AM, Jonathan Coveney jcove...@gmail.com wrote: Spark version is 1.3.0 (will upgrade as soon as we upgrade past mesos 0.19.0)...

Re: Optimizing Streaming from Websphere MQ

2015-06-12 Thread Akhil Das
How many cores are you allocating for your job? And how many receivers are you having? It would be good if you can post your custom receiver code, it will help people to understand it better and shed some light. Thanks Best Regards On Fri, Jun 12, 2015 at 12:58 PM, Chaudhary, Umesh

Re: cannot access port 4040

2015-06-10 Thread Akhil Das
4040 is your driver port, you need to run some application. Login to your cluster start a spark-shell and try accessing 4040. Thanks Best Regards On Wed, Jun 10, 2015 at 3:51 PM, mrm ma...@skimlinks.com wrote: Hi, I am using Spark 1.3.1 standalone and I have a problem where my cluster is

Re: Join between DStream and Periodically-Changing-RDD

2015-06-10 Thread Akhil Das
RDD's are immutable, why not join two DStreams? Not sure, but you can try something like this also: kvDstream.foreachRDD(rdd = { val file = ssc.sparkContext.textFile(/sigmoid/) val kvFile = file.map(x = (x.split(,)(0), x)) rdd.join(kvFile) }) Thanks Best Regards On

Re: cannot access port 4040

2015-06-10 Thread Akhil Das
Opening your 4040 manually or ssh tunneling (ssh -L 4040:127.0.0.1:4040 master-ip, and then open localhost:4040 in browser.) will work for you then . Thanks Best Regards On Wed, Jun 10, 2015 at 5:10 PM, mrm ma...@skimlinks.com wrote: Hi Akhil, Thanks for your reply! I still cannot see port

Re: Spark's Scala shell killing itself

2015-06-10 Thread Akhil Das
May be you should update your spark version to the latest one. Thanks Best Regards On Wed, Jun 10, 2015 at 11:04 AM, Chandrashekhar Kotekar shekhar.kote...@gmail.com wrote: Hi, I have configured Spark to run on YARN. Whenever I start spark shell using 'spark-shell' command, it

Re: How to use Apache spark mllib Model output in C++ component

2015-06-10 Thread Akhil Das
Hope Swig http://www.swig.org/index.php and JNA https://github.com/twall/jna/ might help for accessing c++ libraries from Java. Thanks Best Regards On Wed, Jun 10, 2015 at 11:50 AM, mahesht mahesh.s.tup...@gmail.com wrote: There is C++ component which uses some model which we want to replace

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-10 Thread Akhil Das
standalone mode. Any ideas? Thanks Dong Lei *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Tuesday, June 9, 2015 4:46 PM *To:* Dong Lei *Cc:* user@spark.apache.org *Subject:* Re: ClassNotDefException when using spark-submit with multiple jars and files located

Re: spark streaming - checkpointing - looking at old application directory and failure to start streaming context

2015-06-10 Thread Akhil Das
Delete the checkpoint directory, you might have modified your driver program. Thanks Best Regards On Wed, Jun 10, 2015 at 9:44 PM, Ashish Nigam ashnigamt...@gmail.com wrote: Hi, If checkpoint data is already present in HDFS, driver fails to load as it is performing lookup on previous

Re: Can't access Ganglia on EC2 Spark cluster

2015-06-10 Thread Akhil Das
Looks like libphp version is 5.6 now, which version of spark are you using? Thanks Best Regards On Thu, Jun 11, 2015 at 3:46 AM, barmaley o...@solver.com wrote: Launching using spark-ec2 script results in: Setting up ganglia RSYNC'ing /etc/ganglia to slaves... ... Shutting down GANGLIA

<    1   2   3   4   5   6   7   8   9   10   >