global variable in spark streaming with no dependency on key

2015-08-18 Thread Joanne Contact
Hi Gurus, Please help. But please don't tell me to use updateStateByKey because I need a global variable (something like the clock time) across the micro batches but not depending on key. For my case, it is not acceptable to maintain a state for each key since each key comes in different times.

Difference btw MEMORY_ONLY and MEMORY_AND_DISK

2015-08-18 Thread Harsha HN
Hello Sparkers, I would like to understand difference btw these Storage levels for a RDD portion that doesn't fit in memory. As it seems like in both storage levels, whatever portion doesnt fit in memory will be spilled to disk. Any difference as such? Thanks, Harsha

Re: Difference btw MEMORY_ONLY and MEMORY_AND_DISK

2015-08-18 Thread Sabarish Sasidharan
MEMORY_ONLY will fail if there is not enough memory but MEMORY_AND_DISK will spill to disk Regards Sab On Tue, Aug 18, 2015 at 12:45 PM, Harsha HN 99harsha.h@gmail.com wrote: Hello Sparkers, I would like to understand difference btw these Storage levels for a RDD portion that doesn't

Re: Regarding rdd.collect()

2015-08-18 Thread ayan guha
I think you are mixing the notion of job from hadoop map reduce world with spark. In spark, RDDs are immutable and transformations are lazy. So the first time rdd is actually fills up memory is when you run first transformation. After that, it stays up in memory until either application is stopped

Spark SQL Partition discovery - schema evolution

2015-08-18 Thread Guy Hadash
Hi all, I'm using Spark SQL using data from Openstack Swift. I'm trying to load parquet files with partition discovery, but I can't do it when the partitions don't match between two objects. For example, container which contains: /zone=0/object2 /zone=0/area=0/object1 Won't load, and will

Why there are overlapping for tasks on the EventTimeline UI

2015-08-18 Thread Todd
Hi, Following is copied from the spark EventTimeline UI. I don't understand why there are overlapping between tasks? I think they should be sequentially one by one in one executor(there are one core each executor). The blue part of each task is the scheduler delay time. Does it mean it is the

Re: Regarding rdd.collect()

2015-08-18 Thread Hemant Bhanawat
On Tue, Aug 18, 2015 at 1:16 PM, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote: No, the data is not stored between two jobs. But it is stored for a lifetime of a job. Job can have multiple actions run. I too thought so but wanted to confirm. Thanks. For a matter of sharing an rdd

Re: global variable in spark streaming with no dependency on key

2015-08-18 Thread Hemant Bhanawat
See if SparkContext.accumulator helps. On Tue, Aug 18, 2015 at 2:27 PM, Joanne Contact joannenetw...@gmail.com wrote: Hi Gurus, Please help. But please don't tell me to use updateStateByKey because I need a global variable (something like the clock time) across the micro batches but not

Why standalone mode don't allow to set num-executor ?

2015-08-18 Thread canan chen
num-executor only works for yarn mode. In standalone mode, I have to set the --total-executor-cores and --executor-cores. Isn't this way so intuitive ? Any reason for that ?

Re:Re: Regarding rdd.collect()

2015-08-18 Thread Todd
One spark application can have many jobs,eg,first call rdd.count then call rdd.collect At 2015-08-18 15:37:14, Hemant Bhanawat hemant9...@gmail.com wrote: It is still in memory for future rdd transformations and actions. This is interesting. You mean Spark holds the data in memory

Regarding rdd.collect()

2015-08-18 Thread praveen S
When I do an rdd.collect().. The data moves back to driver Or is still held in memory across the executors?

Re:Changed Column order in DataFrame.Columns call and insertIntoJDBC

2015-08-18 Thread Todd
Take a look at the doc for the method: /** * Applies a schema to an RDD of Java Beans. * * WARNING: Since there is no guaranteed ordering for fields in a Java Bean, * SELECT * queries will return the columns in an undefined order. * @group dataframes * @since

Re: Regarding rdd.collect()

2015-08-18 Thread Dawid Wysakowicz
No, the data is not stored between two jobs. But it is stored for a lifetime of a job. Job can have multiple actions run. For a matter of sharing an rdd between jobs you can have a look at Spark Job Server(spark-jobserver https://github.com/ooyala/spark-jobserver) or some In-Memory storages:

Re: Regarding rdd.collect()

2015-08-18 Thread Hemant Bhanawat
It is still in memory for future rdd transformations and actions. This is interesting. You mean Spark holds the data in memory between two job executions. How does the second job get the handle of the data in memory? I am interested in knowing more about it. Can you forward me a spark article or

Re: Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-18 Thread Petr Novak
The solution how to share offsetRanges after DirectKafkaInputStream is transformed is in: https://github.com/apache/spark/blob/master/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala

Re: Regarding rdd.collect()

2015-08-18 Thread Sabarish Sasidharan
It is still in memory for future rdd transformations and actions. What you get in driver is a copy of the data. Regards Sab On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote: When I do an rdd.collect().. The data moves back to driver Or is still held in memory across the

Re: issue Running Spark Job on Yarn Cluster

2015-08-18 Thread MooseSpark
Please check logs in your hadoop yarn cluster, there you would get precise error or exception. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-Running-Spark-Job-on-Yarn-Cluster-tp21779p24308.html Sent from the Apache Spark User List mailing list

Re: registering an empty RDD as a temp table in a PySpark SQL context

2015-08-18 Thread Hemant Bhanawat
It is definitely not the case for Spark SQL. A temporary table (much like dataFrame) is a just a logical plan with a name and it is not iterated unless a query is fired on it. I am not sure if using rdd.take in py code to verify the schema is a right approach as it creates a spark job. BTW, why

Changed Column order in DataFrame.Columns call and insertIntoJDBC

2015-08-18 Thread MooseSpark
I have a RDD which I am using to create the data frame based on one POJO, but when Dataframe is created, the sequence of column order get changed. DataFrame df=sqlCtx.createDataFrame(rdd, Pojo.class); String[] columns=df.columns(); //columns here are of different order what has been defined in

Re: Running Spark on user-provided Hadoop installation

2015-08-18 Thread gauravsehgal
Refer: http://spark.apache.org/docs/latest/hadoop-provided.html Specifically if you want to refer s3a paths. Please edit spark-env.sh and add following lines at end: SPARK_DIST_CLASSPATH=$(/path/to/hadoop/hadoop-2.7.1/bin/hadoop classpath) export

Re: how do I execute a job on a single worker node in standalone mode

2015-08-18 Thread Igor Berman
by default standalone creates 1 executor on every worker machine per application number of overall cores is configured with --total-executor-cores so in general if you'll specify --total-executor-cores=1 then there would be only 1 core on some executor and you'll get what you want on the other

Re: how to write any data (non RDD) to a file inside closure?

2015-08-18 Thread Robineast
Still not sure what you are trying to achieve. If you could post some code that doesn’t work the community can help you understand where the error (syntactic or conceptual) is. On 17 Aug 2015, at 17:42, dianweih001 [via Apache Spark User List] ml-node+s1001560n24299...@n3.nabble.com wrote:

Evaluating spark + Cassandra for our use cases

2015-08-18 Thread Benjamin Ross
My company is interested in building a real-time time-series querying solution using Spark and Cassandra. Specifically, we're interested in setting up a Spark system against Cassandra running a hive thrift server. We need to be able to perform real-time queries on time-series data - things

Re: Spark executor lost because of GC overhead limit exceeded even though using 20 executors using 25GB each

2015-08-18 Thread Ted Yu
Do you mind providing a bit more information ? release of Spark code snippet of your app version of Java Thanks On Tue, Aug 18, 2015 at 8:57 AM, unk1102 umesh.ka...@gmail.com wrote: Hi this GC overhead limit error is making me crazy. I have 20 executors using 25 GB each I dont understand

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi Guru, Thanks! Great to hear that someone tried it in production. How do you like it so far? Best Regards, Jerry On Tue, Aug 18, 2015 at 11:38 AM, Guru Medasani gdm...@gmail.com wrote: Hi Jerry, Yes. I’ve seen customers using this in production for data science work. I’m currently

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Shushant Arora
But KafkaRDD[K, V, U, T, R] is not subclass of RDD[R] as per java generic inheritance is not supported so derived class cannot return different genric typed subclass from overriden method. On Wed, Aug 19, 2015 at 12:02 AM, Cody Koeninger c...@koeninger.org wrote: Option is covariant and

Re: Left outer joining big data set with small lookups

2015-08-18 Thread VIJAYAKUMAR JAWAHARLAL
Nope. Count action did not help to choose broadcast join. All of my tables are hive external tables. So, I tried to trigger compute statistics from sqlContext.sql. It gives me an error saying “nonsuch table”. I am not sure that is due to following bug in 1.4.1

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Cody Koeninger
Option is covariant and KafkaRDD is a subclass of RDD On Tue, Aug 18, 2015 at 1:12 PM, Shushant Arora shushantaror...@gmail.com wrote: Is it that in scala its allowed for derived class to have any return type ? And streaming jar is originally created in scala so its allowed for

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Prabeesh K.
Refer this post http://blog.prabeeshk.com/blog/2015/06/19/pyspark-notebook-with-docker/ Spark + Jupyter + Docker On 18 August 2015 at 21:29, Jerry Lam chiling...@gmail.com wrote: Hi Guru, Thanks! Great to hear that someone tried it in production. How do you like it so far? Best Regards,

Re: What am I missing that's preventing javac from finding the libraries (CLASSPATH is setup...)?

2015-08-18 Thread Ted Yu
Normally people would establish maven project with Spark dependencies or, use sbt. Can you go with either approach ? Cheers On Tue, Aug 18, 2015 at 10:28 AM, Jerry jerry.c...@gmail.com wrote: Hello, So I setup Spark to run on my local machine to see if I can reproduce the issue I'm having

Re: Too many files/dirs in hdfs

2015-08-18 Thread Mohit Anchlia
Is there a way to store all the results in one file and keep the file roll over separate than the spark streaming batch interval? On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY umesh9...@gmail.com wrote: In Spark Streaming you can simply check whether your RDD contains any records or not and

COMPUTE STATS on hive table - NoSuchTableException

2015-08-18 Thread VIJAYAKUMAR JAWAHARLAL
Hi I am trying to compute stats on a lookup table from spark which resides in hive. I am invoking spark API as follows. It gives me NoSuchTableException. Table is double verified and subsequent statement “sqlContext.sql(“select * from cpatext.lkup”)” picks up the table correctly. I am

Re: Spark Job Hangs on our production cluster

2015-08-18 Thread Imran Rashid
just looking at the thread dump from your original email, the 3 executor threads are all trying to load classes. (One thread is actually loading some class, and the others are blocked waiting to load a class, most likely trying to load the same thing.) That is really weird, definitely not

What am I missing that's preventing javac from finding the libraries (CLASSPATH is setup...)?

2015-08-18 Thread Jerry
Hello, So I setup Spark to run on my local machine to see if I can reproduce the issue I'm having with data frames, but I'm running into issues with the compiler. Here's what I got: $ echo $CLASSPATH

RE: Spark Job Hangs on our production cluster

2015-08-18 Thread java8964
Hi, Imran: Thanks for your reply. I am not sure what do you mean repl. Can you be more detail about that? This is only happened when the Spark 1.2.2 try to scan big data set, and cannot reproduce if it scans smaller dataset. FYI, I have to build and deploy Spark 1.3.1 on our production cluster.

RE: Scala: How to match a java object????

2015-08-18 Thread Saif.A.Ellafi
Hi, thank you for further assistance you can reproduce this by simply running 5 match { case java.math.BigDecimal = 2 } In my personal case, I am applying a map acton to a Seq[Any], so the elements inside are of type any, to which I need to apply a proper .asInstanceOf[WhoYouShouldBe]. Saif

Re: Why standalone mode don't allow to set num-executor ?

2015-08-18 Thread Andrew Or
Hi Canan, This is mainly for legacy reasons. The default behavior in standalone in mode is that the application grabs all available resources in the cluster. This effectively means we want one executor per worker, where each executor grabs all the available cores and memory on that worker. In

Re: Evaluating spark + Cassandra for our use cases

2015-08-18 Thread Jörn Franke
Hi, First you need to make your SLA clear. It does not sound for me they are defined very well or that your solution is necessary for the scenario. I also find it hard to believe that 1 customer has 100Million transactions per month. Time series data is easy to precalculate - you do not

RE: Evaluating spark + Cassandra for our use cases

2015-08-18 Thread Benjamin Ross
Hi Jorn, Of course we're planning on doing a proof of concept here - the difficulty is that our timeline is short, so we cannot afford too many PoCs before we have to make a decision. We also need to figure out *which* databases to proof of concept. Note that one tricky aspect of our problem

Re: Json Serde used by Spark Sql

2015-08-18 Thread Michael Armbrust
Under the covers we use Jackson's Streaming API as of Spark 1.4. On Tue, Aug 18, 2015 at 1:12 PM, Udit Mehta ume...@groupon.com wrote: Hi, I was wondering what json serde does spark sql use. I created a JsonRDD out of a json file and then registered it as a temp table to query. I can then

Re: Difference between Sort based and Hash based shuffle

2015-08-18 Thread Andrew Or
Hi Muhammad, On a high level, in hash-based shuffle each mapper M writes R shuffle files, one for each reducer where R is the number of reduce partitions. This results in M * R shuffle files. Since it is not uncommon for M and R to be O(1000), this quickly becomes expensive. An optimization with

Json Serde used by Spark Sql

2015-08-18 Thread Udit Mehta
Hi, I was wondering what json serde does spark sql use. I created a JsonRDD out of a json file and then registered it as a temp table to query. I can then query the table using dot notation for nested structs/arrays. I was wondering how does spark sql deserialize the json data based on the query.

Re: how do I execute a job on a single worker node in standalone mode

2015-08-18 Thread Andrew Or
Hi Axel, You can try setting `spark.deploy.spreadOut` to false (through your conf/spark-defaults.conf file). What this does is essentially try to schedule as many cores on one worker as possible before spilling over to other workers. Note that you *must* restart the cluster through the sbin

Re: Scala: How to match a java object????

2015-08-18 Thread Marcelo Vanzin
On Tue, Aug 18, 2015 at 12:59 PM, saif.a.ell...@wellsfargo.com wrote: 5 match { case java.math.BigDecimal = 2 } 5 match { case _: java.math.BigDecimal = 2 } -- Marcelo - To unsubscribe, e-mail:

Re: Scala: How to match a java object????

2015-08-18 Thread Marcelo Vanzin
On Tue, Aug 18, 2015 at 1:19 PM, saif.a.ell...@wellsfargo.com wrote: Hi, Can you please elaborate? I am confused :-) You did note that the two pieces of code are different, right? See http://docs.scala-lang.org/tutorials/tour/pattern-matching.html for how to match things in Scala, especially

Re: dse spark-submit multiple jars issue

2015-08-18 Thread Andrew Or
Hi Satish, The problem is that `--jars` accepts a comma-delimited list of jars! E.g. spark-submit ... --jars lib1.jar,lib2.jar,lib3.jar main.jar where main.jar is your main application jar (the one that starts a SparkContext), and lib*.jar refer to additional libraries that your main

Re: Is it this a BUG?: Why Spark Flume Streaming job is not deploying the Receiver to the specified host?

2015-08-18 Thread Tathagata Das
Are you using the Flume polling stream or the older stream? Such problems of binding used to occur in the older push-based approach, hence we built the polling stream (pull-based). On Tue, Aug 18, 2015 at 4:45 AM, diplomatic Guru diplomaticg...@gmail.com wrote: I'm testing the Flume + Spark

Re: Spark Job Hangs on our production cluster

2015-08-18 Thread Imran Rashid
sorry, by repl I mean spark-shell, I guess I'm used to them being used interchangeably. From that thread dump, the one thread that isn't stuck is trying to get classes specifically related to the shell / repl: java.lang.Thread.State: RUNNABLE at

Re: Scala: How to match a java object????

2015-08-18 Thread William Briggs
Could you share your pattern matching expression that is failing? On Tue, Aug 18, 2015, 3:38 PM saif.a.ell...@wellsfargo.com wrote: Hi all, I am trying to run a spark job, in which I receive *java.math.BigDecimal* objects, instead of the scala equivalents, and I am trying to convert them

to retrive full stack trace

2015-08-18 Thread satish chandra j
HI All, Please let me know if any arguments to be passed in CLI to retrieve FULL STACK TRACE in Apache Spark I am stuck in a issue for which it would be helpful to analyze full stack trace Regards, Satish Chandra

Re: to retrive full stack trace

2015-08-18 Thread Koert Kuipers
if you error is on executors you need to check the executor logs for full stacktrace On Tue, Aug 18, 2015 at 10:01 PM, satish chandra j jsatishchan...@gmail.com wrote: HI All, Please let me know if any arguments to be passed in CLI to retrieve FULL STACK TRACE in Apache Spark I am stuck in

RE: Scala: How to match a java object????

2015-08-18 Thread Saif.A.Ellafi
Hi, Can you please elaborate? I am confused :-) Saif -Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Tuesday, August 18, 2015 5:15 PM To: Ellafi, Saif A. Cc: wrbri...@gmail.com; user@spark.apache.org Subject: Re: Scala: How to match a java object On Tue,

NaN in GraphX PageRank answer

2015-08-18 Thread Khaled Ammar
Hi all, I was trying to use GraphX to compute pagerank and found that pagerank value for several vertices is NaN. I am using Spark 1.3. Any idea how to fix that? -- Thanks, -Khaled

Re: Programmatically create SparkContext on YARN

2015-08-18 Thread Andrew Or
Hi Andreas, I believe the distinction is not between standalone and YARN mode, but between client and cluster mode. In client mode, your Spark submit JVM runs your driver code. In cluster mode, one of the workers (or NodeManagers if you're using YARN) in the cluster runs your driver code. In the

Spark scala addFile retrieving file with incorrect size

2015-08-18 Thread Bernardo Vecchia Stein
Hi all, I'm trying to run a spark job (written in scala) that uses addFile to download some small files to each node. However, one of the downloaded files has an incorrect size (the other ones are ok), which causes an error when using it in the code. I have looked more into the issue and

Re: Scala: How to match a java object????

2015-08-18 Thread Sujit Pal
Hi Saif, Would this work? import scala.collection.JavaConversions._ new java.math.BigDecimal(5) match { case x: java.math.BigDecimal = x.doubleValue } It gives me on the scala console. res9: Double = 5.0 Assuming you had a stream of BigDecimals, you could just call map on it.

What is the reason for ExecutorLostFailure?

2015-08-18 Thread VIJAYAKUMAR JAWAHARLAL
Hi All Why am I getting ExecutorLostFailure and executors are completely lost for rest of the processing? Eventually it makes job to fail. One thing for sure that lot of shuffling happens across executors in my program. Is there a way to understand and debug ExecutorLostFailure? Any pointers

Re:Why there are overlapping for tasks on the EventTimeline UI

2015-08-18 Thread Todd
I think I find the answer.. On the UI, the recording time of each task is when it is put into the thread pool. Then the UI makes sense At 2015-08-18 17:40:07, Todd bit1...@163.com wrote: Hi, Following is copied from the spark EventTimeline UI. I don't understand why there are overlapping

Re: global variable in spark streaming with no dependency on key

2015-08-18 Thread Joanne Contact
Thanks. I tried. The problem is I have to updateStatebyKey to maintain other states related to keys. Not sure where to pass this accumulator variable into updateStateBykey. On Tue, Aug 18, 2015 at 2:17 AM, Hemant Bhanawat hemant9...@gmail.com wrote: See if SparkContext.accumulator helps. On

[mllib] Random forest maxBins and confidence in training points

2015-08-18 Thread Mark Alen
Hi everyone,  I have two questions regarding the random forest implementation in mllib 1- maxBins: Say the value of a feature is between [0,100]. In my dataset there are a lot of data points between [0,10] and one datapoint at 100 and nothing between (10, 100). I am wondering how does the

Re: broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-18 Thread Shenghua(Daniel) Wan
All of you are right. I was trying to create too many producers. My idea was to create a pool(for now the pool contains only one producer) shared by all the executors. After I realized it was related to the serializable issues (though I did not find clear clues in the source code to indicate the

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi Prabeesh, That's even better! Thanks for sharing Jerry On Tue, Aug 18, 2015 at 1:31 PM, Prabeesh K. prabsma...@gmail.com wrote: Refer this post http://blog.prabeeshk.com/blog/2015/06/19/pyspark-notebook-with-docker/ Spark + Jupyter + Docker On 18 August 2015 at 21:29, Jerry Lam

Re: Is it this a BUG?: Why Spark Flume Streaming job is not deploying the Receiver to the specified host?

2015-08-18 Thread diplomatic Guru
Thank you Tathagata for your response. Yes, I'm using push model on Spark 1.2. For my scenario I do prefer the push model. Is this the case on the later version 1.4 too? I think I can find a workaround for this issue but only if I know how to obtain the worker(executor) ID. I can get the detail

Re: What is the reason for ExecutorLostFailure?

2015-08-18 Thread Corey Nolet
Usually more information as to the cause of this will be found down in your logs. I generally see this happen when an out of memory exception has occurred for one reason or another on an executor. It's possible your memory settings are too small per executor or the concurrent number of tasks you

Re: Is it this a BUG?: Why Spark Flume Streaming job is not deploying the Receiver to the specified host?

2015-08-18 Thread Tathagata Das
I dont think there is a super clean way for doing this. Here is an idea. Run a dummy job with large number of partitions/tasks, which will access SparkEnv.get.blockManager().blockManagerId().host() and return it. sc.makeRDD(1 to 100, 100).map { _ =

Re: broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-18 Thread Tathagata Das
Why are you even trying to broadcast a producer? A broadcast variable is some immutable piece of serializable DATA that can be used for processing on the executors. A Kafka producer is neither DATA nor immutable, and definitely not serializable. The right way to do this is to create the producer

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Guru Medasani
For python it is really great. There is some work in progress in bringing Scala support to Jupyter as well. https://github.com/hohonuuli/sparknotebook https://github.com/hohonuuli/sparknotebook https://github.com/alexarchambault/jupyter-scala https://github.com/alexarchambault/jupyter-scala

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread andy petrella
Hey, Actually, for Scala, I'd better using https://github.com/andypetrella/spark-notebook/ It's deployed at several places like *Alibaba*, *EBI*, *Cray* and is supported by both the Scala community and the company Data Fellas. For instance, it was part of the Big Scala Pipeline training given

Failed to fetch block error

2015-08-18 Thread swetha
Hi, I see the following error in my Spark Job even after using like 100 cores and 16G memory. Did any of you experience the same problem earlier? 15/08/18 21:51:23 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block input-0-1439959114400, and will not retry (0 retries)

Re: Too many files/dirs in hdfs

2015-08-18 Thread UMESH CHAUDHARY
Of course, Java or Scala can do that: 1) Create a FileWriter with append or roll over option 2) For each RDD create a StringBuilder after applying your filters 3) Write this StringBuilder to File when you want to write (The duration can be defined as a condition) On Tue, Aug 18, 2015 at 11:05 PM,

Re: broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-18 Thread Marcin Kuthan
As long as Kafka producent is thread-safe you don't need any pool at all. Just share single producer on every executor. Please look at my blog post for more details. http://allegro.tech/spark-kafka-integration.html 19 sie 2015 2:00 AM Shenghua(Daniel) Wan wansheng...@gmail.com napisał(a): All of

SparkR csv without headers

2015-08-18 Thread Franc Carter
Hi, Does anyone have an example of how to create a DataFrame in SparkR which specifies the column names - the csv files I have do not have column names in the first row. I can get read a csv nicely with com.databricks:spark-csv_2.10:1.0.3, but I end up with column names C1, C2, C3 etc thanks

Repartitioning external table in Spark sql

2015-08-18 Thread James Pirz
I am using Spark 1.4.1 , in stand-alone mode, on a cluster of 3 nodes. Using Spark sql and Hive Context, I am trying to run a simple scan query on an existing Hive table (which is an external table consisting of rows in text files stored in HDFS - it is NOT parquet, ORC or any other richer

Re: broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-18 Thread Tathagata Das
Its a cool blog post! Tweeted it! Broadcasting the configuration necessary for lazily instantiating the producer is a good idea. Nitpick: The first code example has an extra `}` ;) On Tue, Aug 18, 2015 at 10:49 PM, Marcin Kuthan marcin.kut...@gmail.com wrote: As long as Kafka producent is

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Cody Koeninger
The superclass method in DStream is defined as returning an Option[RDD[T]] On Tue, Aug 18, 2015 at 8:07 AM, Shushant Arora shushantaror...@gmail.com wrote: Getting compilation error while overriding compute method of DirectKafkaInputDStream. [ERROR]

Re: Spark 1.4.1 - Mac OSX Yosemite

2015-08-18 Thread Charlie Hack
Looks like Scala 2.11.6 and Java 1.7.0_79. ✔ ~ 09:17 $ scala Welcome to Scala version 2.11.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79). Type in expressions to have them evaluated. Type :help for more information. scala ✔ ~ 09:26 $ echo $JAVA_HOME

Re: Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-18 Thread Cody Koeninger
Python version has been available since 1.4. It should be close to feature parity with the jvm version in 1.5 On Tue, Aug 18, 2015 at 9:36 AM, ayan guha guha.a...@gmail.com wrote: Hi Cody A non-related question. Any idea when Python-version of direct receiver is expected? Me personally

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Shushant Arora
looking at source code of org.apache.spark.streaming.kafka.DirectKafkaInputDStream override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = { val untilOffsets = clamp(latestLeaderOffsets(maxRetries)) val rdd = KafkaRDD[K, V, U, T, R]( context.sparkContext,

Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi spark users and developers, Did anyone have IPython Notebook (Jupyter) deployed in production that uses Spark as the computational engine? I know Databricks Cloud provides similar features with deeper integration with Spark. However, Databricks Cloud has to be hosted by Databricks so we

Re: Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-18 Thread Cody Koeninger
The solution you found is also in the docs: http://spark.apache.org/docs/latest/streaming-kafka-integration.html Java uses an atomic reference because Java doesn't allow you to close over non-final references. I'm not clear on your other question. On Tue, Aug 18, 2015 at 3:43 AM, Petr Novak

Java 8 lambdas

2015-08-18 Thread Kristoffer Sjögren
Hi Is there a way to execute spark jobs with Java 8 lambdas instead of using anonymous inner classes as seen in the examples? I think I remember seeing real lambdas in the examples before and in articles [1]? Cheers, -Kristoffer [1]

Spark works with the data in another cluster(Elasticsearch)

2015-08-18 Thread gen tang
Hi, Currently, I have my data in the cluster of Elasticsearch and I try to use spark to analyse those data. The cluster of Elasticsearch and the cluster of spark are two different clusters. And I use hadoop input format(es-hadoop) to read data in ES. I am wondering how this environment affect

Re: Java 8 lambdas

2015-08-18 Thread Sean Owen
Yes, it should Just Work. lambdas can be used for any method that takes an instance of an interface with one method, and that describes Function, PairFunction, etc. On Tue, Aug 18, 2015 at 3:23 PM, Kristoffer Sjögren sto...@gmail.com wrote: Hi Is there a way to execute spark jobs with Java 8

Scala: How to match a java object????

2015-08-18 Thread Saif.A.Ellafi
Hi all, I am trying to run a spark job, in which I receive java.math.BigDecimal objects, instead of the scala equivalents, and I am trying to convert them into Doubles. If I try to match-case this object class, I get: error: object java.math.BigDecimal is not a value How could I get around

broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-18 Thread Shenghua(Daniel) Wan
Hi, Did anyone see java.util.ConcurrentModificationException when using broadcast variables? I encountered this exception when wrapping a Kafka producer like this in the spark streaming driver. Here is what I did. KafkaProducerString, String producer = new KafkaProducerString, String(properties);

Re: broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-18 Thread Cody Koeninger
I wouldn't expect a kafka producer to be serializable at all... among other things, it has a background thread On Tue, Aug 18, 2015 at 4:55 PM, Shenghua(Daniel) Wan wansheng...@gmail.com wrote: Hi, Did anyone see java.util.ConcurrentModificationException when using broadcast variables? I

Spark and ActorSystem

2015-08-18 Thread maxdml
Hi, I'd like to know where I could find more information related to the depreciation of the actor system in spark (from 1.4.x). I'm interested in the reasons for this decision, Cheers -- View this message in context:

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Guru Medasani
Hi Jerry, Yes. I’ve seen customers using this in production for data science work. I’m currently using this for one of my projects on a cluster as well. Also, here is a blog that describes how to configure this.

Spark executor lost because of GC overhead limit exceeded even though using 20 executors using 25GB each

2015-08-18 Thread unk1102
Hi this GC overhead limit error is making me crazy. I have 20 executors using 25 GB each I dont understand at all how can it throw GC overhead I also dont that that big datasets. Once this GC error occurs in executor it will get lost and slowly other executors getting lost because of IOException,