How to enforce RDD to be cached?

2014-12-03 Thread shahab
Hi, I noticed that rdd.cache() is not happening immediately rather due to lazy feature of Spark, it is happening just at the moment you perform some map/reduce actions. Is this true? If this is the case, how can I enforce Spark to cache immediately at its cache() statement? I need this to

Re: Filter using the Vertex Ids

2014-12-03 Thread Deep Pradhan
And one more thing, the given tupes (1, 1.0) (2, 1.0) (3, 2.0) (4, 2.0) (5, 0.0) are a part of RDD and they are not just tuples. graph.vertices return me the above tuples which is a part of VertexRDD. On Wed, Dec 3, 2014 at 3:43 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: This is just

Re: Help understanding - Not enough space to cache rdd

2014-12-03 Thread Akhil Das
Set spark.storage.memoryFraction flag to 1 while creating the sparkContext to utilize upto 73Gb of your memory, default it 0.6 and hence you are getting 33.6Gb. Also set rdd.compression and StorageLevel as MEMORY_ONLY_SER if your data is kind of larger than your available memory. (you could try

Re: WordCount fails in .textFile() method

2014-12-03 Thread Akhil Das
Try running it in local mode. Looks like a jar conflict/missing. SparkConf conf = new SparkConf().setAppName(JavaWordCount); conf.set(spark.io.compression.codec,org.apache.spark.io.LZ4CompressionCodec); conf.setMaster(*local[2]*).setSparkHome(System.getenv(SPARK_HOME)); JavaSparkContext jsc = new

Re: Spark with HBase

2014-12-03 Thread Akhil Das
You could go through these to start with http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase http://stackoverflow.com/questions/25189527/how-to-process-a-range-of-hbase-rows-using-spark Thanks Best Regards On Wed, Dec 3, 2014 at

Re: Low Level Kafka Consumer for Spark

2014-12-03 Thread Dibyendu Bhattacharya
Hi, Yes, as Jerry mentioned, the Spark -3129 ( https://issues.apache.org/jira/browse/SPARK-3129) enabled the WAL feature which solves the Driver failure problem. The way 3129 is designed , it solved the driver failure problem agnostic of the source of the stream ( like Kafka or Flume etc) But

How does 2.3.4-spark differ from typesafe 2.3.4 akka?

2014-12-03 Thread dresnick
Using sbt-assemble I'm creating a fat jar that includes spark and akka. I've encountered this error: [error] /home/dev/.ivy2/cache/com.typesafe.akka/akka-actor_2.10/jars/akka-actor_2.10-2.3.4.jar:akka/util/ByteIterator$$anonfun$getLongPart$1.class [error]

Re: pySpark saveAsSequenceFile append overwrite

2014-12-03 Thread Akhil Das
You can't append to a file with spark using the native saveAs* calls, it will always check if the directory already exists and if yes, it will throw error. People usually use hadoop's getMerge utilities to combine the output. Thanks Best Regards On Tue, Dec 2, 2014 at 8:10 PM, Csaba Ragany

getting firs N messages froma Kafka topic using Spark Streaming

2014-12-03 Thread Hafiz Mujadid
Hi Experts! Is there a way to read first N messages from kafka stream and put them in some collection and return to the caller for visualization purpose and close spark streaming. I will be glad to hear from you and will be thankful to you. Currently I have following code that def

Spark SQL UDF returning a list?

2014-12-03 Thread Jerry Raj
Hi, Can a UDF return a list of values that can be used in a WHERE clause? Something like: sqlCtx.registerFunction(myudf, { Array(1, 2, 3) }) val sql = select doc_id, doc_value from doc_table where doc_id in myudf() This does not work: Exception in thread main

Re: Spark with HBase

2014-12-03 Thread Ted Yu
Which hbase release are you running ? If it is 0.98, take a look at: https://issues.apache.org/jira/browse/SPARK-1297 Thanks On Dec 2, 2014, at 10:21 PM, Jai jaidishhari...@gmail.com wrote: I am trying to use Apache Spark with a psuedo distributed Hadoop Hbase Cluster and I am looking for

Does count() evaluate all mapped functions?

2014-12-03 Thread Tobias Pfeiffer
Hi, I have an RDD and a function that should be called on every item in this RDD once (say it updates an external database). So far, I used rdd.map(myFunction).count() or rdd.mapPartitions(iter = iter.map(myFunction)) but I am wondering if this always triggers the call of myFunction in both

Re: Filter using the Vertex Ids

2014-12-03 Thread Deep Pradhan
This is just an example but if my graph is big, there will be so many tuples to handle. I cannot manually do val a: RDD[(Int, Double)] = sc.parallelize(List( (1, 1.0), (2, 1.0), (3, 2.0), (4, 2.0), (5, 0.0))) for all the vertices in the graph. What should I do in that

Re: WordCount fails in .textFile() method

2014-12-03 Thread Akhil Das
dump your classpath, looks like you have multiple versions of guava jars in the classpath. Thanks Best Regards On Wed, Dec 3, 2014 at 2:30 PM, Rahul Swaminathan rahul.swaminat...@duke.edu wrote: I’ve tried that and the same error occurs. Do you have any other suggestions? Thanks! Rahul

Re: WordCount fails in .textFile() method

2014-12-03 Thread Rahul Swaminathan
I've tried that and the same error occurs. Do you have any other suggestions? Thanks! Rahul From: Akhil Das ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com Date: Wednesday, December 3, 2014 at 3:55 AM To: Rahul Swaminathan rahul.swaminat...@duke.edumailto:rahul.swaminat...@duke.edu

Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-03 Thread sahanbull
Hi Guys, I am trying to use SparkSQL to convert an RDD to SchemaRDD so that I can save it in parquet format. A record in my RDD has the following format: RDD1 { field1:5, field2: 'string', field3: {'a':1, 'c':2} } I am using field3 to represent a sparse vector and it can have keys:

Re: Spark SQL table Join, one task is taking long

2014-12-03 Thread Cheng Lian
Hey Venkat, This behavior seems reasonable. According to the table name, I guess here |DAgents| should be the fact table and |ContactDetails| is the dim table. Below is an explanation of a similar query, you may see |src| as |DAgents| and |src1| as |ContactDetails|. |0:

Re: Filter using the Vertex Ids

2014-12-03 Thread Ankur Dave
At 2014-12-03 02:13:49 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote: We cannot do sc.parallelize(List(VertexRDD)), can we? There's no need to do this, because every VertexRDD is also a pair RDD: class VertexRDD[VD] extends RDD[(VertexId, VD)] You can simply use graph.vertices in

Re: Filter using the Vertex Ids

2014-12-03 Thread Ankur Dave
At 2014-12-02 22:01:20 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote: I have a graph which returns the following on doing graph.vertices (1, 1.0) (2, 1.0) (3, 2.0) (4, 2.0) (5, 0.0) I want to group all the vertices with the same attribute together, like into one RDD or something. I

textFileStream() issue?

2014-12-03 Thread Bahubali Jain
Hi, I am trying to use textFileStream(some_hdfs_location) to pick new files from a HDFS location.I am seeing a pretty strange behavior though. textFileStream() is not detecting new files when I move them from a location with in hdfs to location at which textFileStream() is checking for new files.

Re: getting firs N messages froma Kafka topic using Spark Streaming

2014-12-03 Thread Akhil Das
You could do something like: val stream = kafkaStream.getStream().repartition(1).mapPartitions(x = x. take(*10*)) Here stream will have 10 elements from the kafakaStream. Thanks Best Regards On Wed, Dec 3, 2014 at 1:05 PM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi Experts! Is

Re: getting firs N messages froma Kafka topic using Spark Streaming

2014-12-03 Thread Hafiz Mujadid
Hi Akhil! Thanks for your response. Can you please suggest me how to return this sample from a function to the caller and stopping SparkStreaming Thanks -- View this message in context:

Re: How to enforce RDD to be cached?

2014-12-03 Thread Daniel Darabos
On Wed, Dec 3, 2014 at 10:52 AM, shahab shahab.mok...@gmail.com wrote: Hi, I noticed that rdd.cache() is not happening immediately rather due to lazy feature of Spark, it is happening just at the moment you perform some map/reduce actions. Is this true? Yes, this is correct. If this is

Re: Announcing Spark 1.1.1!

2014-12-03 Thread rzykov
Andrew and developers, thank you for excellent release! It fixed almost all of our issues. Now we are migrating to Spark from Zoo of Python, Java, Hive, Pig jobs. Our Scala/Spark jobs often failed on 1.1. Spark 1.1.1 works like a Swiss watch. -- View this message in context:

converting DStream[String] into RDD[String] in spark streaming

2014-12-03 Thread Hafiz Mujadid
Hi everyOne! I want to convert a DStream[String] into an RDD[String]. I could not find how to do this. var data = KafkaUtils.createStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](ssc, consumerConfig, topicMap, StorageLevel.MEMORY_ONLY).map(_._2) val streams =

collecting fails - requirements for collecting (clone, hashCode etc?)

2014-12-03 Thread Ron Ayoub
The following code is failing on the collect. If I don't do the collect and go with a JavaRDDDocument it works fine. Except I really would like to collect. At first I was getting an error regarding JDI threads and an index being 0. Then it just started locking up. I'm running the spark context

RE: collecting fails - requirements for collecting (clone, hashCode etc?)

2014-12-03 Thread Ron Ayoub
I didn't realize I do get a nice stack trace if not running in debug mode. Basically, I believe Document has to be serializable. But since the question has already been asked, are the other requirements for objects within an RDD that I should be aware of. serializable is very understandable.

Re: converting DStream[String] into RDD[String] in spark streaming

2014-12-03 Thread Sean Owen
DStream.foreachRDD gives you an RDD[String] for each interval of course. I don't think it makes sense to say a DStream can be converted into one RDD since it is a stream. The past elements are inherently not supposed to stick around for a long time, and future elements aren't known. You may

Re: converting DStream[String] into RDD[String] in spark streaming

2014-12-03 Thread Hafiz Mujadid
Thanks Dear, It is good to save this data to HDFS and then load back into an RDD :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/converting-DStream-String-into-RDD-String-in-spark-streaming-tp20253p20258.html Sent from the Apache Spark User List mailing

Re: How to enforce RDD to be cached?

2014-12-03 Thread Paolo Platter
Yes, otherwise you can try: rdd.cache().count() and then run your benchmark Paolo Da: Daniel Darabosmailto:daniel.dara...@lynxanalytics.com Data invio: ?mercoled?? ?3? ?dicembre? ?2014 ?12?:?28 A: shahabmailto:shahab.mok...@gmail.com Cc: user@spark.apache.orgmailto:user@spark.apache.org On

Re: Low Level Kafka Consumer for Spark

2014-12-03 Thread Luis Ángel Vicente Sánchez
My main complain about the WAL mechanism in the new reliable kafka receiver is that you have to enable checkpointing and for some reason, even if spark.cleaner.ttl is set to a reasonable value, only the metadata is cleaned periodically. In my tests, using a folder in my filesystem as the

Re: Does filter on an RDD scan every data item ?

2014-12-03 Thread Sean Owen
take(1000) merely takes the first 1000 elements of an RDD. I don't imagine that's what the OP means. filter() is how you select a subset of elements to work with. Yes, this requires evaluating the predicate on all 10M elements, at least once. I don't think you could avoid this in general, right,

Failed fetch: Could not get block(s)

2014-12-03 Thread Al M
I am using Spark 1.1.1. I am seeing an issue that only appears when I run in standalone clustered mode with at least 2 workers. The workers are on separate physical machines. I am performing a simple join on 2 RDDs. After the join I run first() on the joined RDD (in Scala) to get the first

Re: MLlib Naive Bayes classifier confidence

2014-12-03 Thread Sean Owen
Probabilities won't sum to 1 since this expression doesn't incorporate the probability of the evidence, I imagine? it's constant across classes so is usually excluded. It would appear as a - log(P(evidence)) term. On Tue, Dec 2, 2014 at 10:44 AM, MariusFS marius.fete...@sien.com wrote: Are we

Spark MOOC by Berkeley and Databricks

2014-12-03 Thread Marco Didonna
Hello everybody, in case you missed DataBricks and Berkeley have announced a free mooc on spark and another one on scalable machine learning using spark. Both courses are free but if you want to have a verified certificate of completion you need to donate at least 50$. I did it, it's a great

Providing query dsl to Elasticsearch for Spark (2.1.0.Beta3)

2014-12-03 Thread Ian Wilkinson
Hi, I'm trying the Elasticsearch support for Spark (2.1.0.Beta3). In the following I provide the query (as query dsl): import org.elasticsearch.spark._ object TryES { val sparkConf = new SparkConf().setAppName(Campaigns) sparkConf.set(es.nodes, es_cluster:9200)

Re: Spark SQL 1.0.0 - RDD from snappy compress avro file

2014-12-03 Thread cjdc
Ideas? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998p20267.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: ALS failure with size Integer.MAX_VALUE

2014-12-03 Thread Bharath Ravi Kumar
Thanks Xiangrui. I'll try out setting a smaller number of item blocks. And yes, I've been following the JIRA for the new ALS implementation. I'll try it out when it's ready for testing. . On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote: Hi Bharath, You can try setting a

Re: Help understanding - Not enough space to cache rdd

2014-12-03 Thread akhandeshi
hmm.. 33.6gb is sum of the memory used by the two RDD that is cached. You're right when I put serialized RDDs in the cache, the memory foot print for these rdds become a lot smaller. Serialized Memory footprint shown below: RDD NameStorage Level Cached Partitions Fraction Cached

Serializing with Kryo NullPointerException - Java

2014-12-03 Thread Robin Keunen
Hi all, I am having troubles using Kryo and being new to this kind of serialization, I am not sure where to look. Can someone please help me? :-) Here is my custom class: public class *DummyClass* implements KryoSerializable { private static final Logger LOGGER =

Re: what is the best way to implement mini batches?

2014-12-03 Thread Alex Minnaar
I am trying to do the same thing and also wondering what the best strategy is. Thanks From: ll duy.huynh@gmail.com Sent: Wednesday, December 3, 2014 10:28 AM To: u...@spark.incubator.apache.org Subject: what is the best way to implement mini batches?

[SQL] Wildcards in SQLContext.parquetFile?

2014-12-03 Thread Yana Kadiyska
Hi folks, I'm wondering if someone has successfully used wildcards with a parquetFile call? I saw this thread and it makes me think no? http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3CCACA1tWLjcF-NtXj=pqpqm3xk4aj0jitxjhmdqbojj_ojybo...@mail.gmail.com%3E I have a set

Re: Help understanding - Not enough space to cache rdd

2014-12-03 Thread akhandeshi
I think, the memory calculation is correct, what I didn't account for is the memory used. I am still puzzled as how I can successfully process the RDD in spark. -- View this message in context:

GraphX Pregel halting condition

2014-12-03 Thread Jay Hutfles
I'm trying to implement a graph algorithm that does a form of path searching. Once a certain criteria is met on any path in the graph, I wanted to halt the rest of the iterations. But I can't see how to do that with the Pregel API, since any vertex isn't able to know the state of other arbitrary

Re: heterogeneous cluster setup

2014-12-03 Thread Victor Tso-Guillen
I don't have a great answer for you. For us, we found a common divisor, not necessarily a whole gigabyte, of the available memory of the different hardware and used that as the amount of memory per worker and scaled the number of cores accordingly so that every core in the system has the same

dockerized spark executor on mesos?

2014-12-03 Thread Dick Davies
Just wondered if anyone had managed to start spark jobs on mesos wrapped in a docker container? At present (i.e. very early testing) I'm able to submit executors to mesos via spark-submit easily enough, but they fall over as we don't have a JVM on our slaves out of the box. I can push one out

Monitoring Spark

2014-12-03 Thread Isca Harmatz
hello, im running spark on stand alone station and im try to view the event log after the run is finished i turned on the event log as the site said (spark.eventLog.enabled set to true) but i can't find the log files or get the web ui to work. any idea on how to do this? thanks Isca

Re: SchemaRDD + SQL , loading projection columns

2014-12-03 Thread Vishnusaran Ramaswamy
Thanks for the help.. Let me find more info on how to enable statistics in parquet. -Vishnu Michael Armbrust wrote There is not a super easy way to do what you are asking since in general parquet needs to read all the data in a column. As far as I understand it does not have indexes that

Re: How to enforce RDD to be cached?

2014-12-03 Thread shahab
Daniel and Paolo, thanks for the comments. best, /Shahab On Wed, Dec 3, 2014 at 3:12 PM, Paolo Platter paolo.plat...@agilelab.it wrote: Yes, otherwise you can try: rdd.cache().count() and then run your benchmark Paolo *Da:* Daniel Darabos daniel.dara...@lynxanalytics.com

MLLib: loading saved model

2014-12-03 Thread Sameer Tilak
Hi All,I am using LinearRegressionWithSGD and then I save the model weights and intercept. File that contains weights have this format: 1.204550.13560.000456.. Intercept is 0 since I am using train not setting the intercept so it can be ignored for the moment. I would now like to initialize

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-03 Thread Davies Liu
inferSchema() will work better than jsonRDD() in your case, from pyspark.sql import Row srdd = sqlContext.inferSchema(rdd.map(lambda x: Row(**x))) srdd.first() Row( field1=5, field2='string', field3={'a'=1, 'c'=2}) On Wed, Dec 3, 2014 at 12:11 AM, sahanbull sa...@skimlinks.com wrote: Hi

GraphX Pregel halting condition

2014-12-03 Thread Jay Hutfles
I'm trying to implement a graph algorithm that does a form of path searching. Once a certain criteria is met on any path in the graph, I wanted to halt the rest of the iterations. But I can't see how to do that with the Pregel API, since any vertex isn't able to know the state of other arbitrary

Re: How to enforce RDD to be cached?

2014-12-03 Thread dsiegel
shahabm wrote I noticed that rdd.cache() is not happening immediately rather due to lazy feature of Spark, it is happening just at the moment you perform some map/reduce actions. Is this true? Yes, .cache() is a transformation (lazy evaluation) shahabm wrote If this is the case, how can I

Re: How can I read an avro file in HDFS in Java?

2014-12-03 Thread Prannoy
Hi, Try using sc.newAPIHadoopFile(hdfs path to your file, AvroSequenceFileInputFormat.class, AvroKey.class, AvroValue.class, your Configuration) You will get the Avro related classes by importing org.apache.avro.* Thanks. On Tue, Dec 2, 2014 at 9:23 PM, leaviva [via Apache Spark User

Re: WordCount fails in .textFile() method

2014-12-03 Thread Rahul Swaminathan
For others who may be having a similar problem: The error below occurs when using Yarn, which uses an earlier version of Guava compared to Spark 1.1.0. When packaging using Maven, if you put the Yarn dependency above the Spark dependency, the earlier version of guava is the one that gets

Re: [SQL] Wildcards in SQLContext.parquetFile?

2014-12-03 Thread Michael Armbrust
It won't work until this is merged: https://github.com/apache/spark/pull/3407 On Wed, Dec 3, 2014 at 9:25 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, I'm wondering if someone has successfully used wildcards with a parquetFile call? I saw this thread and it makes me think no?

Re: object xxx is not a member of package com

2014-12-03 Thread Prannoy
Hi, Add the jars in the external library of you related project. Right click on package or class - Build Path - Configure Build Path - Java Build Path - Select the Libraries tab - Add external library - Browse to com.xxx.yyy.zzz._ - ok Clean and build your project, most probably you will be able

Re: Does filter on an RDD scan every data item ?

2014-12-03 Thread dsiegel
nsareen wrote 1) Does filter function scan every element saved in RDD? if my RDD represents 10 Million rows, and if i want to work on only 1000 of them, is there an efficient way of filtering the subset without having to scan every element ? using .take(1000) may be a biased sample. you

Re: Does filter on an RDD scan every data item ?

2014-12-03 Thread dsiegel
also available is .sample(), which will randomly sample your RDD with or without replacement, and returns an RDD. .sample() takes a fraction, so it doesn't return an exact number of elements. eg. rdd.sample(true, .0001, 1) -- View this message in context:

Re: Insert new data into specific partition of an RDD

2014-12-03 Thread dsiegel
I'm not sure about .union(), but at least in the case of .join(), as long as you have hash partitioned the original RDDs and persisted them, calls to .join() take advantage of already knowing which partition the keys are on, and will not repartition rdd1. val rdd1 = log.partitionBy(new

Re: Announcing Spark 1.1.1!

2014-12-03 Thread Romi Kuntsman
About version compatibility and upgrade path - can the Java application dependencies and the Spark server be upgraded separately (i.e. will 1.1.0 library work with 1.1.1 server, and vice versa), or do they need to be upgraded together? Thanks! *Romi Kuntsman*, *Big Data Engineer*

Re: Announcing Spark 1.1.1!

2014-12-03 Thread Andrew Or
By the Spark server do you mean the standalone Master? It is best if they are upgraded together because there have been changes to the Master in 1.1.1. Although it might just work, it's highly recommended to restart your cluster manager too. 2014-12-03 13:19 GMT-08:00 Romi Kuntsman

Re: Problem creating EC2 cluster using spark-ec2

2014-12-03 Thread Andrew Or
Yeah this is currently broken for 1.1.1. I will submit a fix later today. 2014-12-02 17:17 GMT-08:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu : +Andrew Actually I think this is because we haven't uploaded the Spark binaries to cloudfront / pushed the change to mesos/spark-ec2.

Re: dockerized spark executor on mesos?

2014-12-03 Thread Matei Zaharia
I'd suggest asking about this on the Mesos list (CCed). As far as I know, there was actually some ongoing work for this. Matei On Dec 3, 2014, at 9:46 AM, Dick Davies d...@hellooperator.net wrote: Just wondered if anyone had managed to start spark jobs on mesos wrapped in a docker

Re: Announcing Spark 1.1.1!

2014-12-03 Thread Aaron Davidson
Because this was a maintenance release, we should not have introduced any binary backwards or forwards incompatibilities. Therefore, applications that were written and compiled against 1.1.0 should still work against a 1.1.1 cluster, and vice versa. On Wed, Dec 3, 2014 at 1:30 PM, Andrew Or

Re: Problem creating EC2 cluster using spark-ec2

2014-12-03 Thread Andrew Or
This should be fixed now. Thanks for bringing this to our attention. 2014-12-03 13:31 GMT-08:00 Andrew Or and...@databricks.com: Yeah this is currently broken for 1.1.1. I will submit a fix later today. 2014-12-02 17:17 GMT-08:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu: +Andrew

How to create a new SchemaRDD which is not based on original SparkPlan?

2014-12-03 Thread Tim Chou
Hi All, My question is about lazy running mode for SchemaRDD, I guess. I know lazy mode is good, however, I still have this demand. For example, here is the first SchemaRDD, named result.(select * from table where num1 and num 4): results: org.apache.spark.sql.SchemaRDD = SchemaRDD[59] at RDD

Re: Alternatives to groupByKey

2014-12-03 Thread Nathan Kronenfeld
I think it would depend on the type and amount of information you're collecting. If you're just trying to collect small numbers for each window, and don't have an overwhelming number of windows, you might consider using accumulators. Just make one per value per time window, and for each data

Spark executor lost

2014-12-03 Thread S. Zhou
We are using Spark job server to submit spark jobs (our spark version is 0.91). After running the spark job server for a while, we often see the following errors (executor lost) in the spark job server log. As a consequence, the spark driver (allocated inside spark job server) gradually loses

RE: Spark executor lost

2014-12-03 Thread Ganelin, Ilya
You want to look further up the stack (there are almost certainly other errors before this happens) and those other errors may give your better idea of what is going on. Also if you are running on yarn you can run yarn logs -applicationId yourAppId to get the logs from the data nodes. Sent

Re: Alternatives to groupByKey

2014-12-03 Thread Koert Kuipers
do these requirements boils down to a need for foldLeftByKey with sorting of the values? https://issues.apache.org/jira/browse/SPARK-3655 On Wed, Dec 3, 2014 at 6:34 PM, Xuefeng Wu ben...@gmail.com wrote: I have similar requirememt,take top N by key. right now I use groupByKey,but one key

single key-value pair fitting in memory

2014-12-03 Thread dsiegel
Hi, In the talk A Deeper Understanding of Spark Internals, it was mentioned that for some operators, spark can spill to disk across keys (in 1.1 - .groupByKey(), .reduceByKey(), .sortByKey()), but that as a limitation of the shuffle at that time, each single key-value pair must fit in memory.

SQL query in scala API

2014-12-03 Thread Arun Luthra
I'm wondering how to do this kind of SQL query with PairRDDFunctions. SELECT zip, COUNT(user), COUNT(DISTINCT user) FROM users GROUP BY zip In the Spark scala API, I can make an RDD (called users) of key-value pairs where the keys are zip (as in ZIP code) and the values are user id's. Then I can

Re: Spark SQL UDF returning a list?

2014-12-03 Thread Tobias Pfeiffer
Hi, On Wed, Dec 3, 2014 at 4:31 PM, Jerry Raj jerry@gmail.com wrote: Exception in thread main java.lang.RuntimeException: [1.57] failure: ``('' expected but identifier myudf found I also tried returning a List of Ints, that did not work either. Is there a way to write a UDF that returns

Re: Spark executor lost

2014-12-03 Thread Ted Yu
bq. to get the logs from the data nodes Minor correction: the logs are collected from machines where node managers run. Cheers On Wed, Dec 3, 2014 at 3:39 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: You want to look further up the stack (there are almost certainly other errors

How can a function running on a slave access the Executor

2014-12-03 Thread Steve Lewis
I have been working on balancing work across a number of partitions and find it would be useful to access information about the current execution environment much of which (like Executor ID) are available if there was a way to get the current executor or the Hadoop TaskAttempt context - does any

Re: textFileStream() issue?

2014-12-03 Thread Tobias Pfeiffer
Hi, On Wed, Dec 3, 2014 at 5:31 PM, Bahubali Jain bahub...@gmail.com wrote: I am trying to use textFileStream(some_hdfs_location) to pick new files from a HDFS location.I am seeing a pretty strange behavior though. textFileStream() is not detecting new files when I move them from a location

Re: Alternatives to groupByKey

2014-12-03 Thread Xuefeng Wu
looks good. I concern about the foldLeftByKey which looks break the consistence from foldLeft in RDD and aggregateByKey in PairRDD Yours, Xuefeng Wu 吴雪峰 敬上 On 2014年12月4日, at 上午7:47, Koert Kuipers ko...@tresata.com wrote: foldLeftByKey

Re: Best way to have some singleton per worker

2014-12-03 Thread Tobias Pfeiffer
Hi, On Thu, Dec 4, 2014 at 2:59 AM, Ashic Mahtab as...@live.com wrote: I've been doing this with foreachPartition (i.e. have the parameters for creating the singleton outside the loop, do a foreachPartition, create the instance, loop over entries in the partition, close the partition), but

Re: dockerized spark executor on mesos?

2014-12-03 Thread Kyle Ellrott
I'd like to tag a question onto this; has anybody attempted to deploy spark under Kubernetes https://github.com/googlecloudplatform/kubernetes or Kubernetes mesos ( https://github.com/mesosphere/kubernetes-mesos ) . On Wednesday, December 3, 2014, Matei Zaharia matei.zaha...@gmail.com wrote: I'd

reading dynamoDB with spark

2014-12-03 Thread Tyson
Hi, I try to read data from DynamoDB table with Spark, but after I run this code I got an error massege like in below. I use Spark 1.1.1 and emr-core-1.1.jar, emr-ddb-hive-1.0.jar and emr-ddb-hadoop-1.0.jar. valsparkConf = SparkConf().setAppName(DynamoRdeader).setMaster(local[4]) valctx =

Spark SQL with a sorted file

2014-12-03 Thread Jerry Raj
Hi, If I create a SchemaRDD from a file that I know is sorted on a certain field, is it possible to somehow pass that information on to Spark SQL so that SQL queries referencing that field are optimized? Thanks -Jerry - To

Re: Having problem with Spark streaming with Kinesis

2014-12-03 Thread A.K.M. Ashrafuzzaman
Guys, In my local machine it consumes a stream of Kinesis with 3 shards. But in EC2 it does not consume from the stream. Later we found that the EC2 machine was of 2 cores and my local machine was of 4 cores. I am using a single machine and in spark standalone mode. And we got a larger machine

Re: SQL query in scala API

2014-12-03 Thread Cheng Lian
You may do this: |table(users).groupBy('zip)('zip, count('user), countDistinct('user)) | On 12/4/14 8:47 AM, Arun Luthra wrote: I'm wondering how to do this kind of SQL query with PairRDDFunctions. SELECT zip, COUNT(user), COUNT(DISTINCT user) FROM users GROUP BY zip In the Spark scala API,

spark-submit on YARN is slow

2014-12-03 Thread Tobias Pfeiffer
Hi, I am using spark-submit to submit my application to YARN in yarn-cluster mode. I have both the Spark assembly jar file as well as my application jar file put in HDFS and can see from the logging output that both files are used from there. However, it still takes about 10 seconds for my

cannot submit python files on EC2 cluster

2014-12-03 Thread chocjy
Hi, I am using spark with version number 1.1.0 on an EC2 cluster. After I submitted the job, it returned an error saying that a python module cannot be loaded due to missing files. I am using the same command that used to work on an private cluster before for submitting jobs and all the source

RE: Spark SQL UDF returning a list?

2014-12-03 Thread Cheng, Hao
Yes I agree, and it may also be ambiguous in semantic. A list of objects V.S. A list with single List Object. I’ve also tested that, seems a. There is a bug in registerFunction, which doesn’t support the UDF without argument. ( I just create a PR for this:

RE: Spark SQL with a sorted file

2014-12-03 Thread Cheng, Hao
You can try to write your own Relation with filter push down or use the ParquetRelation2 for workaround. (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala) Cheng Hao -Original Message- From: Jerry Raj

Issue in executing Spark Application from Eclipse

2014-12-03 Thread Stuti Awasthi
Hi All, I have a standalone Spark(1.1) cluster on one machine and I have installed scala Eclipse IDE (scala 2.10) on my desktop. I am trying to execute a spark code to execute over my standalone cluster but getting errors. Please guide me to resolve this. Code: val logFile = File Path present

MLLIB model export: PMML vs MLLIB serialization

2014-12-03 Thread sourabh
Hi All, I am doing model training using Spark MLLIB inside our hadoop cluster. But prediction happens in a different realtime synchronous system(Web application). I am currently exploring different options to export the trained Mllib models from spark. 1. *Export model as PMML:* I found the

Re: netty on classpath when using spark-submit

2014-12-03 Thread Tobias Pfeiffer
Markus, On Tue, Nov 11, 2014 at 10:40 AM, M. Dale medal...@yahoo.com wrote: I never tried to use this property. I was hoping someone else would jump in. When I saw your original question I remembered that Hadoop has something similar. So I searched and found the link below. A quick JIRA

Re: Filter using the Vertex Ids

2014-12-03 Thread Ankur Dave
To get that function in scope you have to import org.apache.spark.SparkContext._ Ankur On Wednesday, December 3, 2014, Deep Pradhan pradhandeep1...@gmail.com wrote: But groupByKey() gives me the error saying that it is not a member of org.apache.spark,rdd,RDD[(Double,

Spark Streaming empty RDD issue

2014-12-03 Thread Hafiz Mujadid
Hi Experts I am using Spark Streaming to integrate Kafka for real time data processing. I am facing some issues related to Spark Streaming So I want to know how can we detect 1) Our connection has been lost 2) Our receiver is down 3) Spark Streaming has no new messages to consume. how can we deal

Re: heterogeneous cluster setup

2014-12-03 Thread Victor Tso-Guillen
You'll have to decide which is more expensive in your heterogenous environment and optimize for the utilization of that. For example, you may decide that memory is the only costing factor and you can discount the number of cores. Then you could have 8GB on each worker each with four cores. Note

Re: Issue in executing Spark Application from Eclipse

2014-12-03 Thread Akhil Das
It seems you provided master url as spark://10.112.67.80:7077 , i think you should give spark://ubuntu:7077 instead. Thanks Best Regards On Thu, Dec 4, 2014 at 11:35 AM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi All, I have a standalone Spark(1.1) cluster on one machine and I have

Re: Spark Streaming empty RDD issue

2014-12-03 Thread Akhil Das
You can setup nagios based monitoring for these, also setting up a high availability environment will be more fault tolerant. Thanks Best Regards On Thu, Dec 4, 2014 at 12:17 PM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi Experts I am using Spark Streaming to integrate Kafka for real

Re: dockerized spark executor on mesos?

2014-12-03 Thread Dick Davies
Oh great, thanks Tim - I'll give that a whirl then. A Spark rollout isn't in our immediate future, I'm just looking for a good framework for compute alongside our marathon deployment. So a good time to experiment! On 4 December 2014 at 02:36, Tim Chen t...@mesosphere.io wrote: Hi Dick, There

serialization issue in case of case class is more than 1

2014-12-03 Thread Rahul Bindlish
Hi, I am newbie in Spark and performed following steps during POC execution: 1. Map csv file to object-file after some transformations once. 2. Serialize object-file to RDD for operation, as per need. In case of 2 csv/object-files, first object-file is serialized to RDD successfully but during

Necessity for rdd replication.

2014-12-03 Thread rapelly kartheek
Hi, I was just thinking about necessity for rdd replication. One category could be something like large number of threads requiring same rdd. Even though, a single rdd can be shared by multiple threads belonging to same application , I believe we can extract better parallelism if the rdd is

Re: cannot submit python files on EC2 cluster

2014-12-03 Thread Davies Liu
On Wed, Dec 3, 2014 at 8:17 PM, chocjy jiyanyan...@gmail.com wrote: Hi, I am using spark with version number 1.1.0 on an EC2 cluster. After I submitted the job, it returned an error saying that a python module cannot be loaded due to missing files. I am using the same command that used to

  1   2   >