Re: Connecting to nodes on cluster

2015-07-09 Thread Akhil Das
On Wed, Jul 8, 2015 at 7:31 PM, Ashish Dutt wrote: > Hi, > > We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two > days I have been trying to connect my laptop to the server using spark > but its been unsucessful. > The server contains data that needs to be cleaned and anal

Re: Job completed successfully without processing anything

2015-07-09 Thread Akhil Das
Looks like a configuration problem with your spark setup, are you running the driver on a different network? Can you try a simple program from spark-shell and make sure your setup is proper? (like sc.parallelize(1 to 1000).collect()) Thanks Best Regards On Thu, Jul 9, 2015 at 1:02 AM, ÐΞ€ρ@Ҝ (๏̯͡

Re: S3 vs HDFS

2015-07-09 Thread Akhil Das
S3 will obviously add a network lag, whereas in HDFS, if your spark executors are running on the same data-nodes you have the advantage of data locality. Thanks Best Regards On Thu, Jul 9, 2015 at 12:05 PM, Brandon White wrote: > Are there any significant performance differences between reading

Re: Is there a way to shutdown the derby in hive context in spark shell?

2015-07-08 Thread Akhil Das
Did you try sc.shutdown and creating a new one? Thanks Best Regards On Wed, Jul 8, 2015 at 8:12 PM, Terry Hole wrote: > I am using spark 1.4.1rc1 with default hive settings > > Thanks > - Terry > > Hi All, > > I'd like to use the hive context in spark shell, i need to recreate the > hive meta d

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Akhil Das
Have a look http://alvinalexander.com/scala/how-to-create-java-thread-runnable-in-scala, create two threads and call thread1.start(), thread2.start() Thanks Best Regards On Wed, Jul 8, 2015 at 1:06 PM, Ashish Dutt wrote: > Thanks for your reply Akhil. > How do you multithread it? > &g

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Akhil Das
Whats the point of creating them in parallel? You can multi-thread it run it in parallel though. Thanks Best Regards On Wed, Jul 8, 2015 at 5:34 AM, Brandon White wrote: > Say I have a spark job that looks like following: > > def loadTable1() { > val table1 = sqlContext.jsonFile(s"s3://textfi

Re: unable to bring up cluster with ec2 script

2015-07-08 Thread Akhil Das
Its showing connection refused, for some reason it was not able to connect to the machine either its the machine\s start up time or its with the security group. Thanks Best Regards On Wed, Jul 8, 2015 at 2:04 AM, Pagliari, Roberto wrote: > > > > > I'm following the tutorial about Apache Spark o

Re: Master doesn't start, no logs

2015-07-07 Thread Akhil Das
instances having successively run on the same > machine? > > -- > Henri Maxime Demoulin > > 2015-07-07 4:10 GMT-04:00 Akhil Das : > >> Strange. What are you having in $SPARK_MASTER_IP? It may happen that it >> is not able to bind to the given ip but again it should be in th

Re: How to implement top() and filter() on object List for JavaRDD

2015-07-07 Thread Akhil Das
Here's a simplified example: SparkConf conf = new SparkConf().setAppName( "Sigmoid").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); List user = new ArrayList(); user.add("Jack"); user.add("Jill"); user.add("Ja

Re: How to solve ThreadException in Apache Spark standalone Java Application

2015-07-07 Thread Akhil Das
Can you try adding sc.stop at the end of your program? looks like its having a hard-time closing off sparkcontext. Thanks Best Regards On Tue, Jul 7, 2015 at 4:08 PM, Hafsa Asif wrote: > Hi, > > I run the following simple Java spark standalone app with maven command > "exec:java -Dexec.mainClas

Re: How to debug java.io.OptionalDataException issues

2015-07-07 Thread Akhil Das
Did you try kryo? Wrap everything with kryo and see if you are still hitting the exception. (At least you could see a different exception stack). Thanks Best Regards On Tue, Jul 7, 2015 at 6:05 AM, Yana Kadiyska wrote: > Hi folks, suffering from a pretty strange issue: > > Is there a way to tel

Re: Master doesn't start, no logs

2015-07-07 Thread Akhil Das
Strange. What are you having in $SPARK_MASTER_IP? It may happen that it is not able to bind to the given ip but again it should be in the logs. Thanks Best Regards On Tue, Jul 7, 2015 at 12:54 AM, maxdml wrote: > Hi, > > I've been compiling spark 1.4.0 with SBT, from the source tarball availabl

Re: Unable to start spark-sql

2015-07-06 Thread Akhil Das
Its complaining for a jdbc driver. Add it in your driver classpath like: ./bin/spark-sql --driver-class-path /home/akhld/sigmoid/spark/lib/mysql-connector-java-5.1.32-bin.jar Thanks Best Regards On Mon, Jul 6, 2015 at 11:42 AM, sandeep vura wrote: > Hi Sparkers, > > I am unable to start spark

Re: java.io.IOException: No space left on device--regd.

2015-07-06 Thread Akhil Das
You can also set these in the spark-env.sh file : export SPARK_WORKER_DIR="/mnt/spark/" export SPARK_LOCAL_DIR="/mnt/spark/" Thanks Best Regards On Mon, Jul 6, 2015 at 12:29 PM, Akhil Das wrote: > While the job is running, just look in the directory and see whats the

Re: java.io.IOException: No space left on device--regd.

2015-07-05 Thread Akhil Das
While the job is running, just look in the directory and see whats the root cause of it (is it the logs? is it the shuffle? etc). Here's a few configuration options which you can try: - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM) - Enable log rotation: sparkConf.set("spar

Re: cores and resource management

2015-07-05 Thread Akhil Das
Try with *spark.cores.max*, executor cores is used when you usually run it on yarn mode. Thanks Best Regards On Mon, Jul 6, 2015 at 1:22 AM, nizang wrote: > hi, > > We're running spark 1.4.0 on ec2, with 6 machines, 4 cores each. We're > trying to run an application on a number of total-executo

Re: Spark got stuck with BlockManager after computing connected components using GraphX

2015-07-05 Thread Akhil Das
If you don't want those logs flood your screen, you can disable it simply with: import org.apache.log4j.{Level, Logger} > Logger.getLogger("org").setLevel(Level.OFF) > Logger.getLogger("akka").setLevel(Level.OFF) Thanks Best Regards On Sun, Jul 5, 2015 at 7:27 PM, Hellen wrot

Re: JDBC Streams

2015-07-05 Thread Akhil Das
If you want a long running application, then go with spark streaming (which kind of blocks your resources). On the other hand, if you use job server then you can actually use the resources (CPUs) for other jobs also when your dbjob is not using them. Thanks Best Regards On Sun, Jul 5, 2015 at 5:2

Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-05 Thread Akhil Das
Looks like, it spend more time writing/transferring the 40GB of shuffle when you used kryo. And surpirsingly, JavaSerializer has 700MB of shuffle? Thanks Best Regards On Sun, Jul 5, 2015 at 12:01 PM, Gavin Liu wrote: > Hi, > > I am using TeraSort benchmark from ehiggs's branch > https://github.

Re: Accessing the console from spark

2015-07-03 Thread Akhil Das
Can you paste the code? Something is missing Thanks Best Regards On Fri, Jul 3, 2015 at 3:14 PM, Jem Tucker wrote: > In the driver when running spark-submit with --master yarn-client > > On Fri, Jul 3, 2015 at 10:23 AM Akhil Das > wrote: > >> Where does it returns null?

Re: Accessing the console from spark

2015-07-03 Thread Akhil Das
Where does it returns null? Within the driver or in the executor? I just tried System.console.readPassword in spark-shell and it worked. Thanks Best Regards On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker wrote: > Hi, > > We have an application that requires a username/password to be entered > from

Re: build spark 1.4 source code for sparkR with maven

2015-07-03 Thread Akhil Das
Did you try: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package Thanks Best Regards On Fri, Jul 3, 2015 at 2:27 PM, 1106944...@qq.com <1106944...@qq.com> wrote: > Hi all, >Anyone build spark 1.4 source code for sparkR with maven/sbt, what's > comand ? using

Re: duplicate names in sql allowed?

2015-07-03 Thread Akhil Das
I think you can open up a jira, not sure if this PR (SPARK-2890 ) broke the validation piece. Thanks Best Regards On Fri, Jul 3, 2015 at 4:29 AM, Koert Kuipers wrote: > i am surprised this is all

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Akhil Das
With binary i think it might not be possible, although if you can download the sources and then build it then you can remove this function which initializes the SQLContext. Tha

Re: Convert CSV lines to List of Objects

2015-07-02 Thread Akhil Das
Have a look at the sc.wholeTextFiles, you can use it to read the whole csv contents into the value and then split it on \n and add them up to a list and return it. *sc.wholeTextFiles:* Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported

Re: Making Unpersist Lazy

2015-07-02 Thread Akhil Das
rdd's which are no longer required will be removed from memory by spark itself (which you can consider as lazy?). Thanks Best Regards On Wed, Jul 1, 2015 at 7:48 PM, Jem Tucker wrote: > Hi, > > The current behavior of rdd.unpersist() appears to not be lazily executed > and therefore must be pla

Re: Can I do Joins across Event Streams ?

2015-07-01 Thread Akhil Das
Have a look at the window, updateStateByKey operations, if you are looking for something more sophisticated then you can actually persists these streams in an intermediate storage (say for x duration) like HBase or Cassandra or any other DB and you can do global aggregations with these. Thanks Bes

Re: Run multiple Spark jobs concurrently

2015-07-01 Thread Akhil Das
Have a look at https://spark.apache.org/docs/latest/job-scheduling.html Thanks Best Regards On Wed, Jul 1, 2015 at 12:01 PM, Nirmal Fernando wrote: > Hi All, > > Is there any additional configs that we have to do to perform $subject? > > -- > > Thanks & regards, > Nirmal > > Associate Technical

Re: Issue with parquet write after join (Spark 1.4.0)

2015-07-01 Thread Akhil Das
It says: Caused by: java.net.ConnectException: Connection refused: slave2/...:54845 Could you look in the executor logs (stderr on slave2) and see what made it shut down? Since you are doing a join there's a high possibility of OOM etc. Thanks Best Regards On Wed, Jul 1, 2015 at 10:20 AM, Pooj

Re: Spark run errors on Raspberry Pi

2015-07-01 Thread Akhil Das
Now i'm having a strange feeling to try this on KBOX :/ Thanks Best Regards On Wed, Jul 1, 2015 at 9:10 AM, Exie wrote: > FWIW, I had some trouble getting Spark running on a Pi. > > My core problem was using snappy for compression as it comes as a pre-made > bi

Re: output folder structure not getting commited and remains as _temporary

2015-07-01 Thread Akhil Das
Looks like a jar conflict to me. ava.lang.NoSuchMethodException: org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData.getBytesWritten() You are having multiple versions of the same jars in the classpath. Thanks Best Regards On Wed, Jul 1, 2015 at 6:58 AM, nkd wrote: > I am running a spa

Re: Issues in reading a CSV file from local file system using spark-shell

2015-06-30 Thread Akhil Das
Since its a windows machine, you are very likely to be hitting this one https://issues.apache.org/jira/browse/SPARK-2356 Thanks Best Regards On Wed, Jul 1, 2015 at 12:36 AM, Sourav Mazumder < sourav.mazumde...@gmail.com> wrote: > Hi, > > I'm running Spark 1.4.0 without Hadoop. I'm using the bina

Re: Difference between spark-defaults.conf and SparkConf.set

2015-06-30 Thread Akhil Das
.addJar works for me when i run it as a stand-alone application (without using spark-submit) Thanks Best Regards On Tue, Jun 30, 2015 at 7:47 PM, Yana Kadiyska wrote: > Hi folks, running into a pretty strange issue: > > I'm setting > spark.executor.extraClassPath > spark.driver.extraClassPath >

Re: Error while installing spark

2015-06-30 Thread Akhil Das
How much memory you have on that machine? You can increase the heap-space by *export _JAVA_OPTIONS="-Xmx2g"* Thanks Best Regards On Tue, Jun 30, 2015 at 11:00 AM, Chintan Bhatt < chintanbhatt...@charusat.ac.in> wrote: > Facing following error message while performing sbt/sbt assembly > > > Error

Re: Checkpoint support?

2015-06-30 Thread Akhil Das
Have a look at the StageInfo class, it has method stageFailed. You could make use of it. I don't understand the point of restarting the entire application. Thanks Best Regards On Tue, Jun 30, 2015 at

Re: s3 bucket access/read file

2015-06-30 Thread Akhil Das
Try this way: val data = sc.textFile("s3n://ACCESS_KEY:SECRET_KEY@mybucket/temp/") Thanks Best Regards On Mon, Jun 29, 2015 at 11:59 PM, didi wrote: > Hi > > *Cant read text file from s3 to create RDD > * > > after setting the configuration > val hadoopConf=sparkContext.hadoopConfiguration; >

Re: got "java.lang.reflect.UndeclaredThrowableException" when running multiply APPs in spark

2015-06-30 Thread Akhil Das
This: Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Could happen for many reasons, one of them could be because of insufficient memory. Are you running all 20 apps on the same node? How are you submitting the apps? (with spark-submit?). I see you have driv

Re: problem for submitting job

2015-06-29 Thread Akhil Das
Cool. On 29 Jun 2015 21:10, "郭谦" wrote: > Akhil Das, > > You give me a new idea to solve the problem. > > Vova provides me a way to solve the problem just before > > Vova Shelgunov > > Sample code for submitting job from any other java app, e.g. servlet: &

Re: problem for submitting job

2015-06-29 Thread Akhil Das
You can create a SparkContext in your program and run it as a standalone application without using spark-submit. Here's something that will get you started: //Create SparkContext val sconf = new SparkConf() .setMaster("spark://spark-ak-master:7077") .setAppName("Test") .s

Re: spilling in-memory map of 5.1 MB to disk (272 times so far)

2015-06-28 Thread Akhil Das
Here's a bunch of configuration for that https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior Thanks Best Regards On Fri, Jun 26, 2015 at 10:37 PM, igor.berman wrote: > Hi, > wanted to get some advice regarding tunning spark application > I see for some of the tasks many log

Re: Master dies after program finishes normally

2015-06-28 Thread Akhil Das
Which version of spark are you using? You can try changing the heap size manually by *export _JAVA_OPTIONS="-Xmx5g" * Thanks Best Regards On Fri, Jun 26, 2015 at 7:52 PM, Yifan LI wrote: > Hi, > > I just encountered the same problem, when I run a PageRank program which > has lots of stages(iter

Re: Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Akhil Das
​JavaPairInputDStream messages = KafkaUtils.createDirectStream( jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet ); Here: jssc => JavaStreamingContext String.class => Key , Value classes

Re: Spark 1.4 RDD to DF fails with toDF()

2015-06-26 Thread Akhil Das
Those provided spark libraries are compatible with scala 2.11? Thanks Best Regards On Fri, Jun 26, 2015 at 4:48 PM, Srikanth wrote: > Thanks Akhil for checking this out. Here is my build.sbt. > > name := "Weblog Analysis" > > version := "1.0" > >

Re: Spark 1.4 RDD to DF fails with toDF()

2015-06-26 Thread Akhil Das
Its a scala version conflict, can you paste your build.sbt file? Thanks Best Regards On Fri, Jun 26, 2015 at 7:05 AM, stati wrote: > Hello, > > When I run a spark job with spark-submit it fails with below exception for > code line >/*val webLogDF = webLogRec.toDF().select("ip", "date",

Re: Spark for distributed dbms cluster

2015-06-26 Thread Akhil Das
Which distributed database are you referring here? Spark can connect with almost all those databases out there (You just need to pass the Input/Output Format classes or there are a bunch of connectors also available). Thanks Best Regards On Fri, Jun 26, 2015 at 12:07 PM, louis.hust wrote: > Hi,

Re: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Akhil Das
You just need to set your HADOOP_HOME which appears to be null in the stackstrace. If you are not having the winutils.exe, then you can download and put it there. Thanks Best Regards On Thu, Jun 25, 2015 at 11:30 PM, Ashic M

Re: Performing sc.paralleize (..) in workers not in the driver program

2015-06-26 Thread Akhil Das
Why do you want to do that? Thanks Best Regards On Thu, Jun 25, 2015 at 10:16 PM, shahab wrote: > Hi, > > Apparently, sc.paralleize (..) operation is performed in the driver > program not in the workers ! Is it possible to do this in worker process > for the sake of scalability? > > best > /Sh

Re: Problem Run Spark Example HBase Code Using Spark-Submit

2015-06-26 Thread Akhil Das
Try to add them in the SPARK_CLASSPATH in your conf/spark-env.sh file Thanks Best Regards On Thu, Jun 25, 2015 at 9:31 PM, Bin Wang wrote: > I am trying to run the Spark example code HBaseTest from command line > using spark-submit instead run-example, in that case, I can learn more how > to ru

Re:

2015-06-25 Thread Akhil Das
e input size is 512.0 MB (hadoop) / 4159106. Can this be reduced to 64 > MB so as to increase the number of tasks. Similar to split size that > increases the number of mappers in Hadoop M/R. > > On Thu, Jun 25, 2015 at 12:06 AM, Akhil Das > wrote: > >> Look in the tuning

Re: spark1.4 sparkR usage

2015-06-25 Thread Akhil Das
o, > > Is this the official R Package? > > It is written : "*NOTE: The API from the upcoming Spark release (1.4) > will not have the same API as described here. *" > > Thanks, > > JC > ᐧ > > 2015-06-25 10:55 GMT+02:00 Akhil Das : > >> Here yo

Re: Killing Long running tasks (stragglers)

2015-06-25 Thread Akhil Das
That totally depends on the way you extract the data. It will be helpful if you can paste your code so that we will understand it better. Thanks Best Regards On Wed, Jun 24, 2015 at 2:32 PM, William Ferrell wrote: > Hello - > > I am using Apache Spark 1.2.1 via pyspark. Thanks to any developers

Re: Akka failures: Driver Disassociated

2015-06-25 Thread Akhil Das
Can you look in the worker logs and see whats going on? It may happen that you ran out of diskspace etc. Thanks Best Regards On Thu, Jun 25, 2015 at 12:08 PM, barmaley wrote: > I'm running Spark 1.3.1 on AWS... Having long-running application (spark > context) which accepts and completes jobs

Re: spark1.4 sparkR usage

2015-06-25 Thread Akhil Das
Here you go https://amplab-extras.github.io/SparkR-pkg/ Thanks Best Regards On Thu, Jun 25, 2015 at 12:39 PM, 1106944...@qq.com <1106944...@qq.com> wrote: > Hi all >I have installed spark1.4, then want to use sparkR . assueme spark > master ip= node1, how to start sparkR ? and summit job t

Re: spark1.4 sparkR usage

2015-06-25 Thread Akhil Das
Here you go https://amplab-extras.github.io/SparkR-pkg/ Thanks Best Regards On Thu, Jun 25, 2015 at 12:39 PM, 1106944...@qq.com <1106944...@qq.com> wrote: > Hi all >I have installed spark1.4, then want to use sparkR . assueme spark > master ip= node1, how to start sparkR ? and summit job t

Re: Can Spark1.4 work with CDH4.6

2015-06-25 Thread Akhil Das
guava dependency but the error > does go away this way > > On Wed, Jun 24, 2015 at 10:04 AM, Akhil Das > wrote: > >> Can you try to add those jars in the SPARK_CLASSPATH and give it a try? >> >> Thanks >> Best Regards >> >> On Wed, Jun 24, 2015

Re:

2015-06-25 Thread Akhil Das
AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Its taking an hour and on Hadoop it takes 1h 30m, is there a way to make > it run faster ? > > On Wed, Jun 24, 2015 at 11:39 AM, Akhil Das > wrote: > >> Cool. :) >> On 24 Jun 2015 23:44, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote: >> >>>

Re:

2015-06-24 Thread Akhil Das
st(TransportRequestHandler.java:124) >>>> at >>>> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:97) >>>> at >>>> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandl

Re: kafka spark streaming with mesos

2015-06-24 Thread Akhil Das
A screenshot of your framework running would also be helpful. How many cores does it have? Did you try running it in coarse grained mode? Try to add these to the conf: sparkConf.set("spark.mesos.coarse", "true") sparkConfset("spark.cores.max", "2") Thanks Best Regards On Wed, Jun 24, 2015 at 1

Re:

2015-06-24 Thread Akhil Das
Can you look a bit more in the error logs? It could be getting killed because of OOM etc. One thing you can try is to set the spark.shuffle.blockTransferService to nio from netty. Thanks Best Regards On Wed, Jun 24, 2015 at 5:46 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I have a Spark job that has 7 stages. T

Re: Can Spark1.4 work with CDH4.6

2015-06-24 Thread Akhil Das
Can you try to add those jars in the SPARK_CLASSPATH and give it a try? Thanks Best Regards On Wed, Jun 24, 2015 at 12:07 AM, Yana Kadiyska wrote: > Hi folks, I have been using Spark against an external Metastore service > which runs Hive with Cdh 4.6 > > In Spark 1.2, I was able to successfull

Re: Should I keep memory dedicated for HDFS and Spark on cluster nodes?

2015-06-23 Thread Akhil Das
Depending the size of the memory you are having, you ccould allocate 60-80% of the memory for the spark worker process. Datanode doesn't require too much memory. On 23 Jun 2015 21:26, "maxdml" wrote: > I'm wondering if there is a real benefit for splitting my memory in two for > the datanode/work

Re: Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-23 Thread Akhil Das
spark.locality.wait On 22 Jun 2015 19:21, "ayan guha" wrote: I have a basic qs: how spark assigns partition to an executor? Does it respect data locality? Does this behaviour depend on cluster manager, ie yarn vs standalone? On 22 Jun 2015 22:45, "Akhil Das" wrote: > Option 1 s

Re: Spark Streaming: limit number of nodes

2015-06-23 Thread Akhil Das
Use *spark.cores.max* to limit the CPU per job, then you can easily accommodate your third job also. Thanks Best Regards On Tue, Jun 23, 2015 at 5:07 PM, Wojciech Pituła wrote: > I have set up small standalone cluster: 5 nodes, every node has 5GB of > memory an 8 cores. As you can see, node doe

Re: Multiple executors writing file using java filewriter

2015-06-23 Thread Akhil Das
, 2015 at 12:41 PM, Akhil Das > wrote: > >> Why don't you do a normal .saveAsTextFiles? >> >> Thanks >> Best Regards >> >> On Mon, Jun 22, 2015 at 11:55 PM, anshu shukla >> wrote: >> >>> Thanx for reply !! >>> >>&g

Re: What does [Stage 0:> (0 + 2) / 2] mean on the console

2015-06-23 Thread Akhil Das
Well, you could that (Stage information) is an ASCII representation of the WebUI (running on port 4040). Since you set local[4] you will have 4 threads for your computation, and since you are having 2 receivers, you are left with 2 threads to process ((0 + 2) <-- This 2 is your 2 threads.) And the

Re: Any way to retrieve time of message arrival to Kafka topic, in Spark Streaming?

2015-06-23 Thread Akhil Das
May be while producing the messages, you can make it as a keyedMessage with the timestamp as key and on the consumer end you can easily identify the key (which will be the timestamp) from the message. If the network is fast enough, then i think there could be a small millisecond lag. Thanks Best R

Re: Programming with java on spark

2015-06-23 Thread Akhil Das
Did you happened to try this? JavaPairRDD hadoopFile = sc.hadoopFile( "/sigmoid", DataInputFormat.class, LongWritable.class, Text.class) Thanks Best Regards On Tue, Jun 23, 2015 at 6:58 AM, 付雅丹 wrote: > Hello, everyone! I'm new in spark. I have already written programs i

Re: Spark job fails silently

2015-06-23 Thread Akhil Das
Looks like a hostname conflict to me. 15/06/22 17:04:45 WARN Utils: Your hostname, datasci01.dev.abc.com resolves to a loopback address: 127.0.0.1; using 10.0.3.197 instead (on interface eth0) 15/06/22 17:04:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Can you paste y

Re: Multiple executors writing file using java filewriter

2015-06-23 Thread Akhil Das
Why don't you do a normal .saveAsTextFiles? Thanks Best Regards On Mon, Jun 22, 2015 at 11:55 PM, anshu shukla wrote: > Thanx for reply !! > > YES , Either it should write on any machine of cluster or Can you please > help me ... that how to do this . Previously i was using writing using

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Akhil Das
Yes. Thanks Best Regards On Mon, Jun 22, 2015 at 8:33 PM, Murthy Chelankuri wrote: > I have more than one jar. can we set sc.addJar multiple times with each > dependent jar ? > > On Mon, Jun 22, 2015 at 8:30 PM, Akhil Das > wrote: > >> Try sc.addJar instead of setJ

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Akhil Das
Try sc.addJar instead of setJars Thanks Best Regards On Mon, Jun 22, 2015 at 8:24 PM, Murthy Chelankuri wrote: > I have been using the spark from the last 6 months with the version 1.2.0. > > I am trying to migrate to the 1.3.0 but the same problem i have written is > not wokring. > > Its givin

Re: Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-22 Thread Akhil Das
Option 1 should be fine, Option 2 would bound a lot on network as the data increase in time. Thanks Best Regards On Mon, Jun 22, 2015 at 5:59 PM, Ashish Soni wrote: > Hi All , > > What is the Best Way to install and Spark Cluster along side with Hadoop > Cluster , Any recommendation for below

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Akhil Das
Like this? val rawXmls = ssc.fileStream(path, classOf[XmlInputFormat], classOf[LongWritable], classOf[Text]) Thanks Best Regards On Mon, Jun 22, 2015 at 5:45 PM, Yong Feng wrote: > Thanks a lot, Akhil > > I saw this mail thread before, but still do not understand h

Re: Serializer not switching

2015-06-22 Thread Akhil Das
How are you submitting the application? Could you paste the code that you are running? Thanks Best Regards On Mon, Jun 22, 2015 at 5:37 PM, Sean Barzilay wrote: > I am trying to run a function on every line of a parquet file. The > function is in an object. When I run the program, I get an exce

Re: s3 - Can't make directory for path

2015-06-22 Thread Akhil Das
Could you elaborate a bit more? What do you meant by set up a standalone server? and what is leading you to that exceptions? Thanks Best Regards On Mon, Jun 22, 2015 at 2:22 AM, nizang wrote: > hi, > > I'm trying to setup a standalone server, and in one of my tests, I got the > following except

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Akhil Das
You can use fileStream for that, look at the XMLInputFormat of mahout. It should give you full XML object as on record, (as opposed to an XM

Re: memory needed for each executor

2015-06-22 Thread Akhil Das
Totally depends on the use-case that you are solving with Spark, for instance there was some discussion around the same which you could read over here http://apache-spark-user-list.1001560.n3.nabble.com/How-does-one-decide-no-of-executors-cores-memory-allocation-td23326.html Thanks Best Regards O

Re: JavaDStream read and write rdbms

2015-06-22 Thread Akhil Das
Its pretty straight forward, this would get you started http://stackoverflow.com/questions/24896233/how-to-save-apache-spark-schema-output-in-mysql-database Thanks Best Regards On Mon, Jun 22, 2015 at 12:39 PM, Manohar753 < manohar.re...@happiestminds.com> wrote: > > Hi Team, > > How to split a

Re: Spark Titan

2015-06-21 Thread Akhil Das
Have a look at http://s3.thinkaurelius.com/docs/titan/0.5.0/titan-io-format.html You could use those Input/Output formats with newAPIHadoopRDD api call. Thanks Best Regards On Sun, Jun 21, 2015 at 8:50 PM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > How to connect TItan dat

Re: Local spark jars not being detected

2015-06-20 Thread Akhil Das
Not sure, but try removing the provided or create a lib directory in the project home and bring that jar over there. On 20 Jun 2015 18:08, "Ritesh Kumar Singh" wrote: > Hi, > > I'm using IntelliJ ide for my spark project. > I've compiled spark 1.3.0 for scala 2.11.4 and here's the one of the > co

Re: N kafka topics vs N spark Streaming

2015-06-19 Thread Akhil Das
Like this? val add_msgs = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, Array("add").toSet) val delete_msgs = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, Array("delete").toSet) val upd

Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-19 Thread Akhil Das
One workaround would be remove/move the files from the input directory once you have it processed. Thanks Best Regards On Fri, Jun 19, 2015 at 5:48 AM, Haopu Wang wrote: > Akhil, > > > > From my test, I can see the files in the last batch will alwyas be > reprocessed up

Re: how to change /tmp folder for spark ut use sbt

2015-06-19 Thread Akhil Das
You can try setting these properties: .set("spark.local.dir","/mnt/spark/") .set("java.io.tmpdir","/mnt/spark/") Thanks Best Regards On Fri, Jun 19, 2015 at 8:28 AM, yuemeng (A) wrote: > hi,all > > if i want to change the /tmp folder to any other folder for spark ut use > sbt,how can

Re: Build spark application into uber jar

2015-06-19 Thread Akhil Das
This is how i used to build a assembly jar with sbt: Your build.sbt file would look like this: *import AssemblyKeys._* *assemblySettings* *name := "FirstScala"* *version := "1.0"* *scalaVersion := "2.10.4"* *libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.1"* *libraryDepend

Re: kafka spark streaming working example

2015-06-18 Thread Akhil Das
.setMaster("local") set it to local[2] or local[*] Thanks Best Regards On Thu, Jun 18, 2015 at 5:59 PM, Bartek Radziszewski wrote: > hi, > I'm trying to run simple kafka spark streaming example over spark-shell: > > sc.stop > import org.apache.spark.SparkConf > import org.apache.spark.SparkCont

Re: connect mobile app with Spark backend

2015-06-18 Thread Akhil Das
Why not something like your mobile app pushes data to your webserver which pushes the data to Kafka or Cassandra or any other database and have a Spark streaming job running all the time operating on the incoming data and pushes the calculated values back. This way, you don't have to start a spark

Re: understanding on the "waiting batches" and "scheduling delay" in Streaming UI

2015-06-18 Thread Akhil Das
Which version of spark? and what is your data source? For some reason, your processing delay is exceeding the batch duration. And its strange that you are not seeing any scheduling delay. Thanks Best Regards On Thu, Jun 18, 2015 at 7:29 AM, Mike Fang wrote: > Hi, > > > > I have a spark streamin

Re: Web UI vs History Server Bugs

2015-06-18 Thread Akhil Das
You could possibly open up a JIRA and shoot an email to the dev list. Thanks Best Regards On Wed, Jun 17, 2015 at 11:40 PM, jcai wrote: > Hi, > > I am running this on Spark stand-alone mode. I find that when I examine the > web UI, a couple bugs arise: > > 1. There is a discrepancy between the

Re: Machine Learning on GraphX

2015-06-18 Thread Akhil Das
This might give you a good start http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html its a bit old though. Thanks Best Regards On Thu, Jun 18, 2015 at 2:33 PM, texol wrote: > Hi, > > I'm new to GraphX and I'd like to use Machine Learning algorithms on top of >

Re: ClassNotFound exception from closure

2015-06-17 Thread Akhil Das
Not sure why spark-submit isn't shipping your project jar (may be try with --jars), You can do a sc.addJar(/path/to/your/project.jar) also, it should solve it. Thanks Best Regards On Wed, Jun 17, 2015 at 6:37 AM, Yana Kadiyska wrote: > Hi folks, > > running into a pretty strange issue -- I have

Re: Shuffle produces one huge partition

2015-06-17 Thread Akhil Das
Can you try repartitioning the rdd after creating the K,V. And also, while calling the rdd1.join(rdd2, Pass the # partition argument too) Thanks Best Regards On Wed, Jun 17, 2015 at 12:15 PM, Al M wrote: > I have 2 RDDs I want to Join. We will call them RDD A and RDD B. RDD A > has > 1 billio

Re: how to maintain the offset for spark streaming if HDFS is the source

2015-06-16 Thread Akhil Das
With sparkstreaming when you use fileStream or textFileStream it will always pick up the files from the directory whose timestamp is > the current timestamp, and if you have checkpointing enabled then it would start from the last read timestamp. So you may not need to maintain the line number. Tha

Re: Spark History Server pointing to S3

2015-06-16 Thread Akhil Das
Not quiet sure, but try pointing the spark.history.fs.logDirectory to your s3 Thanks Best Regards On Tue, Jun 16, 2015 at 6:26 PM, Gianluca Privitera < gianluca.privite...@studio.unibo.it> wrote: > In Spark website it’s stated in the View After the Fact section ( > https://spark.apache.org/docs/

Re: settings from props file seem to be ignored in mesos

2015-06-16 Thread Akhil Das
Whats in your executor (that .tgz file) conf/spark-default.conf file? Thanks Best Regards On Mon, Jun 15, 2015 at 7:14 PM, Gary Ogden wrote: > I'm loading these settings from a properties file: > spark.executor.memory=256M > spark.cores.max=1 > spark.shuffle.consolidateFiles=true > spark.task.c

Re: tasks won't run on mesos when using fine grained

2015-06-16 Thread Akhil Das
Did you look inside all logs? Mesos logs and executor logs? Thanks Best Regards On Mon, Jun 15, 2015 at 7:09 PM, Gary Ogden wrote: > My Mesos cluster has 1.5 CPU and 17GB free. If I set: > > conf.set("spark.mesos.coarse", "true"); > conf.set("spark.cores.max", "1"); > > in the SparkConf object

Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-16 Thread Akhil Das
You can also look into https://spark.apache.org/docs/latest/tuning.html for performance tuning. Thanks Best Regards On Mon, Jun 15, 2015 at 10:28 PM, Rex X wrote: > Thanks very much, Akhil. > > That solved my problem. > > Best, > Rex > > > > On Mon, Jun 15, 2015

Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-16 Thread Akhil Das
lieve will kind of reprocess some files. Thanks Best Regards On Mon, Jun 15, 2015 at 2:49 PM, Haopu Wang wrote: > Akhil, thank you for the response. I want to explore more. > > > > If the application is just monitoring a HDFS folder and output the word > count of ea

Re: Optimizing Streaming from Websphere MQ

2015-06-16 Thread Akhil Das
com> wrote: > Hi Akhil, > > Thanks for your response. > > I have 10 cores which sums of all my 3 machines and I am having 5-10 > receivers. > > I have tried to test the processed number of records per second by varying > number of receivers. > > If I am having 1

Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-15 Thread Akhil Das
Something like this? val huge_data = sc.textFile("/path/to/first.csv").map(x => (x.split("\t")(1), x.split("\t")(0)) val gender_data = sc.textFile("/path/to/second.csv"),map(x => (x.split("\t")(0), x)) val joined_data = huge_data.join(gender_data) joined_data.take(1000) Its scala btw, python a

Re: How to set up a Spark Client node?

2015-06-15 Thread Akhil Das
I'm assuming by spark-client you mean the spark driver program. In that case you can pick any machine (say Node 7), create your driver program in it and use spark-submit to submit it to the cluster or if you create the SparkContext within your driver program (specifying all the properties) then you

<    1   2   3   4   5   6   7   8   9   10   >