Spark does not delete temporary directories

2015-05-06 Thread Taeyun Kim
Hi, After a spark program completes, there are 3 temporary directories remain in the temp directory. The file names are like this: spark-2e389487-40cc-4a82-a5c7-353c0feefbb7 And the Spark program runs on Windows, a snappy DLL file also remains in the temp directory. The file name is like

Spark Job triggers second attempt

2015-05-06 Thread ๏̯͡๏
How i can stop Spark to stop triggering second attempt in case the first fails. I do not want to wait for the second attempt to fail again so that i can debug faster. .set("spark.yarn.maxAppAttempts", "0") OR .set("spark.yarn.maxAppAttempts", "1") is not helping. -- Deepak

Re: AvroFiles

2015-05-06 Thread ๏̯͡๏
Hello, This is how i read Avro data. import org.apache.avro.generic.GenericData import org.apache.avro.generic.GenericRecord import org.apache.avro.mapred.AvroKey import org.apache.avro.Schema import org.apache.hadoop.io.NullWritable import org.apache.avro.mapreduce.AvroKeyInputFormat -- Read def

How can I force operations to complete and spool to disk

2015-05-06 Thread Steve Lewis
I am performing a job where I perform a number of steps in succession. One step is a map on a JavaRDD which generates objects taking up significant memory. The this is followed by a join and an aggregateByKey. The problem is that the system is running getting OutOfMemoryErrors - Most tasks work but

Spark updateStateByKey fails with class leak when using case classes - resend

2015-05-06 Thread rsearle
<> I created a simple Spark streaming program using updateStateByKey. The domain is represented by case classes for clarity, type safety, etc. Spark job continuously loads new classes, which are removed by GC to maintain a relatively constant level of active classes instances. The total memory

Re: FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-06 Thread Jianshi Huang
I'm using the default settings. Jianshi On Wed, May 6, 2015 at 7:05 PM, twinkle sachdeva wrote: > Hi, > > Can you please share your compression etc settings, which you are using. > > Thanks, > Twinkle > > On Wed, May 6, 2015 at 4:15 PM, Jianshi Huang > wrote: > >> I'm facing this error in Spar

Spark 1.3.1 and Parquet Partitions

2015-05-06 Thread vasuki
Spark 1.3.1 - i have a parquet file on hdfs partitioned by some string looking like this /dataset/city=London/data.parquet /dataset/city=NewYork/data.parquet /dataset/city=Paris/data.paruqet …. I am trying to get to load it using sqlContext using sqlcontext.parquetFile( "hdfs://some ip:8029/datas

Re: How update counter in cassandra

2015-05-06 Thread Tathagata Das
This may help. http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala On Wed, May 6, 2015 at 5:35 PM, Sergio Jiménez Barrio wrote: > I have a Counter family colums in Cassandra. I want update this counters > with a aplication in sp

YARN mode startup takes too long (10+ secs)

2015-05-06 Thread Taeyun Kim
Hi, I'm running a spark application with YARN-client or YARN-cluster mode. But it seems to take too long to startup. It takes 10+ seconds to initialize the spark context. Is this normal? Or can it be optimized? The environment is as follows: - Hadoop: Hortonworks HDP 2.2 (Hadoop 2.6) -

How update counter in cassandra

2015-05-06 Thread Sergio Jiménez Barrio
I have a Counter family colums in Cassandra. I want update this counters with a aplication in spark Streaming. How can I update counter cassandra with Spark? Thanks.

RE: 回复:回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-06 Thread java8964
It looks like you have data in these 24 partitions, or more. How many unique name in your data set? Enlarge the shuffle partitions only make sense if you have large partition groups in your data. What you described looked like either your dataset having data in these 24 partitions, or you have s

Re:

2015-05-06 Thread Shixiong Zhu
You are using Scala 2.11 with 2.10 libraries. You can change "org.apache.spark" % "spark-streaming_2.10" % "1.3.1" to "org.apache.spark" %% "spark-streaming" % "1.3.1" And sbt will use the corresponding libraries according to your Scala version. Best Regards, Shixiong Zhu 2015-05-06 16:21 GM

Re: Possible to use hive-config.xml instead of hive-site.xml for HiveContext?

2015-05-06 Thread Michael Armbrust
I don't think that works: https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration On Tue, May 5, 2015 at 6:25 PM, nitinkak001 wrote: > I am running hive queries from HiveContext, for which we need a > hive-site.xml. > > Is it possible to replace it with hive-config.xml? I tri

How to specify Worker and Master LOG folders?

2015-05-06 Thread Ulanov, Alexander
Hi, How can I specify Worker and Master LOG folders? If I set "SPARK_WORKER_DIR" in spark-env, it only affects Executor logs and shuffling folder. But Worker and Master logs still goes to something default: starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/sbin/../logs

RE: Reading large files

2015-05-06 Thread Ulanov, Alexander
With binaryRecords it loads the file by line into RDD. With binaryFiles it provides an input stream, so it is up to you if you want to load everything into memory. Though documentation does not suggest to use this function for large files. From: Vijayasarathy Kannan [mailto:kvi...@vt.edu] Sent:

RE: DataFrame DSL documentation

2015-05-06 Thread Ulanov, Alexander
+1 I had to browse spark-catalyst sources to find what is supported: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala Alexander From: Gerard Maas [mailto:gerard.m...@gmail.com] Sent: Wednesday, May 06, 2015 11:42 AM To: spark

[no subject]

2015-05-06 Thread anshu shukla
Exception with sample testing in Intellij IDE: Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class at akka.util.Collections$EmptyImmutableSeq$.(Collections.scala:15) at akka.util.Collections$EmptyImmutableSeq$.(Collections.scala) at akka.japi.Util$.

Re: Error in SparkSQL/Scala IDE

2015-05-06 Thread Michael Armbrust
Hi Iulian, The relevant code is in ScalaReflection , and it would be awesome if you could suggest how to fix this more generally. Specifically, this code is also broken when

Re: Reading large files

2015-05-06 Thread Vijayasarathy Kannan
Thanks. In both cases, does the driver need to have enough memory to contain the entire file? How do both these functions work when, for example, the binary file is 4G and available driver memory is lesser? On Wed, May 6, 2015 at 1:54 PM, Ulanov, Alexander wrote: > SparkContext has two methods

Re: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-06 Thread Jonathan Coveney
Can you check your local and remote logs? 2015-05-06 16:24 GMT-04:00 Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com>: > This problem happen in Spark 1.3.1. It happen when two jobs are running > simultaneously each in its own Spark Context. > > > > I don’t remember seeing this bug in Spar

Re: Multilabel Classification in spark

2015-05-06 Thread Peter Garbers
Thanks for all your feedback! I'm a little new to scala/spark so hopefully you'll bare with me while I try to explain how I plan to go about this and give me advice as to why this may or may not work. My terminology may be a little incorrect as well. Any feedback would be greatly appreciated. Even

RE: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-06 Thread Wang, Ningjun (LNG-NPV)
This problem happen in Spark 1.3.1. It happen when two jobs are running simultaneously each in its own Spark Context. I don’t remember seeing this bug in Spark 1.2.0. Is it a new bug introduced in Spark 1.3.1? Ningjun From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, May 06, 2015 11:

question about the TFIDF.

2015-05-06 Thread Dan Dong
Hi, All, When I try to follow the document about tfidf from: http://spark.apache.org/docs/latest/mllib-feature-extraction.html val conf = new SparkConf().setAppName("TFIDF") val sc=new SparkContext(conf) val documents=sc.textFile("hdfs://cluster-test-1:9000/user/ubuntu/textExampl

spark-shell breaks for scala 2.11 (with yarn)?

2015-05-06 Thread Koert Kuipers
hello all, i build spark 1.3.1 (for cdh 5.3 with yarn) twice: for scala 2.10 and scala 2.11. i am running on a secure cluster. the deployment configs are identical. i can launch jobs just fine on both the scala 2.10 and scala 2.11 versions. spark-shell works on the scala 2.10 version, but not on

Re: Creating topology in spark streaming

2015-05-06 Thread anshu shukla
Ohhh its filled with lot of trouble (Scala mainly) .. please please can anyone point out me to sample topology type of code that have multistep modular levels of logics with parallelisation controlled in each level . I am not finding any demo with such sample on git . On Wed, May 6, 2015

DataFrame DSL documentation

2015-05-06 Thread Gerard Maas
Hi, Where could I find good documentation on the DataFrame DSL? I'm struggling trying to combine selects, groupBy and aggregations. A language definition would also help. I perused these resources, but still have some gaps in my understanding and things are not doing what I'd expect: https://spa

RE: Reading large files

2015-05-06 Thread Ulanov, Alexander
SparkContext has two methods for reading binary files: binaryFiles (reads multiple binary files into RDD) and binaryRecords (reads separate lines of a single binary file into RDD). For example, I have a big binary file split into logical parts, so I can use “binaryFiles”. The possible problem is

Reading large files

2015-05-06 Thread Vijayasarathy Kannan
​Hi, Is there a way to read a large file, in parallel​/distributed way? I have a single large binary file which I currently read on the driver program and then distribute it to executors (using groupBy(), etc.). I want to know if there's a way to make the executors each read a specific/unique port

Re: How to add jars to standalone pyspark program

2015-05-06 Thread mj
I've worked around this by dropping the jars into a directory (spark_jars) and then creating a spark-defaults.conf file in conf containing this: spark.driver.extraClassPath/home/mj/apps/spark_jars/* -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ho

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-05-06 Thread Ravi Mody
Whoops I just saw this thread, it got caught in my spam filter. Thanks for looking into this Xiangrui and Sean. The implicit situation does seem fairly complicated to me. The cost function (not including the regularization term) is affected both by the number of ratings and by the number of user/p

Stop Cluster Mode Running App

2015-05-06 Thread James King
I submitted a Spark Application in cluster mode and now every time I stop the cluster and restart it the job resumes execution. I even killed a daemon called DriverWrapper it stops the app but it resumes again. How can stop this application from running?

Re: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-06 Thread Ted Yu
Which release of Spark are you using ? Thanks > On May 6, 2015, at 8:03 AM, Wang, Ningjun (LNG-NPV) > wrote: > > I run a job on spark standalone cluster and got the exception below > > Here is the line of code that cause problem > > val myRdd: RDD[(String, String, String)] = … // RDD of

java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-06 Thread Wang, Ningjun (LNG-NPV)
I run a job on spark standalone cluster and got the exception below Here is the line of code that cause problem val myRdd: RDD[(String, String, String)] = ... // RDD of (docid, cattegory, path) myRdd.persist(StorageLevel.MEMORY_AND_DISK_SER) val cats: Array[String] = myRdd.map(t => t._2).disti

how to use rdd.countApprox

2015-05-06 Thread Du Li
I have to count RDD's in a spark streaming app. When data goes large, count() becomes expensive. Did anybody have experience using countApprox()? How accurate/reliable is it?  The documentation is pretty modest. Suppose the timeout parameter is in milliseconds. Can I retrieve the count value by

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-06 Thread Imran Rashid
oh yeah, I think I remember we discussed this a while back ... sorry I forgot the details. If you know you don't have a graph, did you try setting "spark.kryo.referenceTracking" to false? I'm also confused on how you could hit this with a few million objects. Are you serializing them one at a ti

Spark Kryo read method never called before reducing

2015-05-06 Thread amine_901
We use Spark to build graphs of events after querying cassandra. We use mapPartition for both aggregating events and building two graphs per partition. Graphs are returned as Tuple2 as follows : val nodes = events.mapPartitions(part => { var nodeLeft : Node = null var nodeRight

Re: How to deal with code that runs before foreach block in Apache Spark?

2015-05-06 Thread Emre Sevinc
Imran, Gerard, Indeed your suggestions were correct and it helped me. Thank you for your replies. -- Emre On Tue, May 5, 2015 at 4:24 PM, Imran Rashid wrote: > Gerard is totally correct -- to expand a little more, I think what you > want to do is a solrInputDocumentJavaRDD.foreachPartition, in

Re: No space left on device??

2015-05-06 Thread Saisai Shao
I think for executor distribution, normally On YARN mode, RM tries its best to evenly distribute container if don't explicitly specify the preferred host. For standalone mode, one node only has one executor normally, so executor distribution is not a big problem normally. The problem of data skew

Re: How to add jars to standalone pyspark program

2015-05-06 Thread mj
Thank you for your response, however, I'm afraid I still can't get it to work, this is my code: jar_path = '/home/mj/apps/spark_jars/spark-csv_2.11-1.0.3.jar' spark_config = SparkConf().setMaster('local').setAppName('data_frame_test').set("spark.jars", jar_path) sc = SparkContext(conf=

Re: No space left on device??

2015-05-06 Thread Yifan LI
Yes, you are right. For now I have to say the workload/executor is distributed evenly…so, like you said, it is difficult to improve the situation. However, have you any idea of how to make a *skew* data/executor distribution? Best, Yifan LI > On 06 May 2015, at 15:13, Saisai Shao wrote:

Re: No space left on device??

2015-05-06 Thread Saisai Shao
I think it depends on your workload and executor distribution, if your workload is evenly distributed without any big data skew, and executors are evenly distributed on each nodes, the storage usage of each node is nearly the same. Spark itself cannot rebalance the storage overhead as you mentioned

Re: No space left on device??

2015-05-06 Thread Yifan LI
Thanks, Shao. :-) I am wondering if the spark will rebalance the storage overhead in runtime…since still there is some available space on other nodes. Best, Yifan LI > On 06 May 2015, at 14:57, Saisai Shao wrote: > > I think you could configure multiple disks through spark.local.dir, def

Re: No space left on device??

2015-05-06 Thread Saisai Shao
I think you could configure multiple disks through spark.local.dir, default is /tmp. Anyway if your intermediate data is larger than available disk space, still will meet this issue. spark.local.dir/tmpDirectory to use for "scratch" space in Spark, including map output files and RDDs that get stor

union eatch streaming window into a static rdd and use the static rdd periodicity

2015-05-06 Thread lisendong
the pseudo code : object myApp { var myStaticRDD: RDD[Int] def main() { ... //init streaming context, and get two DStream (streamA and streamB) from two hdfs path //complex transformation using the two DStream val new_stream = streamA.transformWith(StreamB, (a, b, t) => { a.join(

No space left on device??

2015-05-06 Thread Yifan LI
Hi, I am running my graphx application on Spark, but it failed since there is an error on one executor node(on which available hdfs space is small) that “no space left on device”. I can understand why it happened, because my vertex(-attribute) rdd was becoming bigger and bigger during computat

Re: Receiver Fault Tolerance

2015-05-06 Thread James King
Many thanks all, your responses have been very helpful. Cheers On Wed, May 6, 2015 at 2:14 PM, ayan guha wrote: > > https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics > > > On Wed, May 6, 2015 at 10:09 PM, James King wrote: > >> In the O'reilly book

Re: Receiver Fault Tolerance

2015-05-06 Thread ayan guha
https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics On Wed, May 6, 2015 at 10:09 PM, James King wrote: > In the O'reilly book Learning Spark Chapter 10 section 24/7 Operation > > It talks about 'Receiver Fault Tolerance' > > I'm unsure of what a Recei

Re: Receiver Fault Tolerance

2015-05-06 Thread Tathagata Das
Incorrect. The receiver runs in an executor just like a any other tasks. In the cluster mode, the driver runs in a worker, however it launches executors in OTHER workers in the cluster. Its those executors running in other workers that run tasks, and also the receivers. On Wed, May 6, 2015 at 5:09

RE: Receiver Fault Tolerance

2015-05-06 Thread Evo Eftimov
This is about Kafka Receiver IF you are using Spark Streaming Ps: that book is now behind the curve in a quite a few areas since the release of 1.3.1 – read the documentation and forums From: James King [mailto:jakwebin...@gmail.com] Sent: Wednesday, May 6, 2015 1:09 PM To: user Subjec

Receiver Fault Tolerance

2015-05-06 Thread James King
In the O'reilly book Learning Spark Chapter 10 section 24/7 Operation It talks about 'Receiver Fault Tolerance' I'm unsure of what a Receiver is here, from reading it sounds like when you submit an application to the cluster in cluster mode i.e. *--deploy-mode cluster *the driver program will run

Re: FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-06 Thread twinkle sachdeva
Hi, Can you please share your compression etc settings, which you are using. Thanks, Twinkle On Wed, May 6, 2015 at 4:15 PM, Jianshi Huang wrote: > I'm facing this error in Spark 1.3.1 > > https://issues.apache.org/jira/browse/SPARK-4105 > > Anyone knows what's the workaround? Change the com

Kryo read method never called before reducing

2015-05-06 Thread Florian Hussonnois
Hi, We use Spark to build graphs of events after querying cassandra. We use mapPartition for both aggregating events and building two graphs per partition. Graphs are returned as Tuple2 as follows : val nodes = events.mapPartitions(part => { var nodeLeft : Node = null var nodeR

FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-06 Thread Jianshi Huang
I'm facing this error in Spark 1.3.1 https://issues.apache.org/jira/browse/SPARK-4105 Anyone knows what's the workaround? Change the compression codec for shuffle output? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

RE: Creating topology in spark streaming

2015-05-06 Thread Evo Eftimov
The “abstraction level” of Storm or shall we call it Architecture, is effectively Pipelines of Nodes/Agents – Pipelines is one of the standard Parallel Programming Patterns which you can use on multicore CPUs as well as Distributed Systems – the chaps from Storm simply implemented it as a reusa

spark jobs input/output http request

2015-05-06 Thread Saurabh Gupta
I have setup an AWS EMR based cluster, where in I am being able to run my spark queries quite ok. The next part of my work is to run the queries coming in from a webclient and show the results at it. The Spark queries as i Know I can only run from my EMR, and they don't return instantly with any

Re: Creating topology in spark streaming

2015-05-06 Thread Juan Rodríguez Hortalá
Hi, I agree with Evo, Spark works at a different abstraction level than Storm, and there is not a direct translation from Storm topologies to Spark Streaming jobs. I think something remotely close is the notion of lineage of DStreams or RDDs, which is similar to a logical plan of an engine like A

Re: Partition Case Class RDD without ParRDDFunctions

2015-05-06 Thread ayan guha
How does your MyClqss looks like? I was experimenting with Row class in python and apparently partitionby automatically takes first column as key. However, I am not sure how you can access a part of an object without deserializing it (either explicitly or Spark doing it for you) On Wed, May 6,

RE: Map one RDD into two RDD

2015-05-06 Thread Evo Eftimov
RDD1 = RDD.filter() RDD2 = RDD.filter() From: Bill Q [mailto:bill.q@gmail.com] Sent: Tuesday, May 5, 2015 10:42 PM To: user@spark.apache.org Subject: Map one RDD into two RDD Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD,

large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-06 Thread Michal Haris
Just wanted to check if somebody has seen similar behaviour or knows what we might be doing wrong. We have a relatively complex spark application which processes half a terabyte of data at various stages. We have profiled it in several ways and everything seems to point to one place where 90% of th

RE: Creating topology in spark streaming

2015-05-06 Thread Evo Eftimov
What is called Bolt in Storm is essentially a combination of [Transformation/Action and DStream RDD] in Spark – so to achieve a higher parallelism for specific Transformation/Action on specific Dstream RDD simply repartition it to the required number of partitions which directly relates to the

Re: Creating topology in spark streaming

2015-05-06 Thread anshu shukla
Thanks alot Juan, That was a great post, One more thing if u can .Any there any demo/blog telling how to configure or create a topology of different types .. i mean how we can decide the pipelining model in spark as done in storm for https://storm.apache.org/documentation/images/topology.p

Re: Creating topology in spark streaming

2015-05-06 Thread Juan Rodríguez Hortalá
Hi, You can use the method repartition from DStream (for the Scala API) or JavaDStream (for the Java API) defrepartition(numPartitions: Int): DStream [T] Return a new DStream with an increased or dec

Partition Case Class RDD without ParRDDFunctions

2015-05-06 Thread Night Wolf
Hi, If I have an RDD[MyClass] and I want to partition it by the hash code of MyClass for performance reasons, is there any way to do this without converting it into a PairRDD RDD[(K,V)] and calling partitionBy??? Mapping it to a tuple2 seems like a waste of space/computation. It looks like the P

Re: Error in SparkSQL/Scala IDE

2015-05-06 Thread Iulian Dragoș
Hi, I just saw this question. I posted my solution to this stack overflow question. Scala reflection can take a classloader when creating a mirror ( universe.runtimeMirror(loader)). I can have a look,

Re: SparkR: filter() function?

2015-05-06 Thread himaeda
Has this issue re-appeared? I posted this on SO before I knew about this list... http://stackoverflow.com/questions/30057702/sparkr-filterrdd-and-flatmap-not-working Also I don't have access to the issues o

Re:MLlib libsvm isssues with data

2015-05-06 Thread doyere
Hi all, After do some tests,finally I solve it.I wrote here for other people who met this question. here's a example of data format error I faced 0 0:0 1:0 2:1 1 1:1 3:2 the data for 0:0 and 1:0/1:1 is the reason for ArrayIndexOutOfBoundsException.If someone who faced the same question just dele

Re: Creating topology in spark streaming

2015-05-06 Thread anshu shukla
But main problem is how to increase the level of parallelism for any particular bolt logic . suppose i want this type of topology . https://storm.apache.org/documentation/images/topology.png How we can manage it . On Wed, May 6, 2015 at 1:36 PM, ayan guha wrote: > Every transformation on a

The explanation of input text format using LDA in Spark

2015-05-06 Thread Cui xp
Hi all, After I read the example code using LDA in Spark, I found the input text in the code is a matrix. the format of the text is as follows: 1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2

Re: Using spark streaming to load data from Kafka to HDFS

2015-05-06 Thread Rendy Bambang Junior
Because using spark streaming looks like a lot simpler. Whats the difference between Camus and Kafka Streaming for this case? Why Camus excel? Rendy On Wed, May 6, 2015 at 2:15 PM, Saisai Shao wrote: > Also Kafka has a Hadoop consumer API for doing such things, please refer > to http://kafka.ap

Re: Creating topology in spark streaming

2015-05-06 Thread ayan guha
Every transformation on a dstream will create another dstream. You may want to take a look at foreachrdd? Also, kindly share your code so people can help better On 6 May 2015 17:54, "anshu shukla" wrote: > Please help guys, Even After going through all the examples given i have > not understood

Creating topology in spark streaming

2015-05-06 Thread anshu shukla
Please help guys, Even After going through all the examples given i have not understood how to pass the D-streams from one bolt/logic to other (without writing it on HDFS etc.) just like emit function in storm . Suppose i have topology with 3 bolts(say) *BOLT1(parse the tweets nd emit tweet u

Re: "com.datastax.spark" % "spark-streaming_2.10" % "1.1.0" in my build.sbt ??

2015-05-06 Thread Akhil Das
I don't see spark-streaming dependency at com.datastax.spark , but it does has a kafka-streaming dependency though. Thanks Best Regards On Tue, May 5, 2015 at 12:42 AM, Eric Ho wrote: > Can I specify this in my build file ? > > Thanks. > > >

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-06 Thread Tristan Blakers
Hi Imran, I had tried setting a really huge kryo buffer size (GB), but it didn’t make any difference. In my data sets, objects are no more than 1KB each, and don’t form a graph, so I don’t think the buffer size should need to be larger than a few MB, except perhaps for reasons of efficiency? The

Re: Spark Mongodb connection

2015-05-06 Thread Akhil Das
Here's a complete example https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html Thanks Best Regards On Mon, May 4, 2015 at 12:57 PM, Yasemin Kaya wrote: > Hi! > > I am new at Spark and I want to begin Spark with simple wordCount example > in Java. But I want to give my input from

Re: Troubling Logging w/Simple Example (spark-1.2.2-bin-hadoop2.4)...

2015-05-06 Thread Akhil Das
You have an issue with your cluster setup. Can you paste your conf/spark-env.sh and the conf/slaves files here? The reason why your job is running fine is because you set the master inside the job as local[*] which runs in local mode (not in standalone cluster mode). Thanks Best Regards On Mon

Re: Using spark streaming to load data from Kafka to HDFS

2015-05-06 Thread Saisai Shao
Also Kafka has a Hadoop consumer API for doing such things, please refer to http://kafka.apache.org/081/documentation.html#kafkahadoopconsumerapi 2015-05-06 12:22 GMT+08:00 MrAsanjar . : > why not try https://github.com/linkedin/camus - camus is kafka to HDFS > pipeline > > On Tue, May 5, 2015 a

Job executed with no data in Spark Straming.

2015-05-06 Thread secfree
My code: Fun01, Fun02, Fun03 all have transformations, output operations (foreachRDD) . 1. Fun01, Fun03 both executed as expected, which prove "messages" is not null or empty. 2. On Spark application UI, I found Fun02's output stage in "Spark stages", which prove "executed". 3. The first line