Error running sbt package on Windows 7 for Spark 1.3.1 and SimpleApp.scala

2015-06-04 Thread Joseph Washington
Hi all, I'm trying to run the standalone application SimpleApp.scala following the instructions on the http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala I was able to create a .jar file by doing sbt package. However when I tried to do $

SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)

2015-06-04 Thread ogoh
Hello, I tested some custom udf on SparkSql's ThriftServer Beeline (Spark 1.3.1). Some udfs work fine (access array parameter and returning int or string type). But my udf returning map type throws an error: Error: scala.MatchError: interface java.util.Map (of class java.lang.Class)

Spark terminology

2015-06-04 Thread ๏̯͡๏
- I see these in my mapper only task. - - *Input Size / Records: *68.0 GB / 577195178 - *Shuffle write: *95.1 GB / 282559291 - *Shuffle spill (memory): *2.8 TB - *Shuffle spill (disk): *90.3 GB I understand the first one, can someone give 1/2 liners for the next three ? also

RE: SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)

2015-06-04 Thread Cheng, Hao
Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0? -Original Message- From: ogoh [mailto:oke...@gmail.com] Sent: Friday, June 5, 2015 10:10 AM To: user@spark.apache.org Subject: SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface

RE: TF-IDF Question

2015-06-04 Thread Somnath Pandeya
Hi, org.apache.spark.mllib.linalg.Vector = (1048576,[35587,884670],[3.458767233,3.458767233]) it is sparse vector representation of terms so the first term(1048576) is the length of vector [35587,884670] is the index of words [3.458767233,3.458767233] are the tf-idf values of the terms. Thanks

FetchFailed Exception

2015-06-04 Thread ๏̯͡๏
I see this Is this a problem with my code or the cluster ? Is there any way to fix it ? FetchFailed(BlockManagerId(2, phxdpehdc9dn2441.stratus.phx.ebay.com, 59574), shuffleId=1, mapId=80, reduceId=20, message= org.apache.spark.shuffle.FetchFailedException: Failed to connect to

Why the default Params.copy doesn't work for Model.copy?

2015-06-04 Thread Justin Yip
Hello, I have a question with Spark 1.4 ml library. In the copy function, it is stated that the default implementation doesn't work of Params doesn't work for models. ( https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Model.scala#L49 ) As a result, some

Column operation on Spark RDDs.

2015-06-04 Thread Carter
Hi, I have a RDD with MANY columns (e.g., hundreds), and most of my operation is on columns, e.g., I need to create many intermediate variables from different columns, what is the most efficient way to do this? For example, if my dataRDD[Array[String]] is like below: 123, 523, 534, ..., 893

Re: Deduping events using Spark

2015-06-04 Thread William Briggs
Hi Lee, You should be able to create a PairRDD using the Nonce as the key, and the AnalyticsEvent as the value. I'm very new to Spark, but here is some uncompilable pseudo code that may or may not help: events.map(event = (event.getNonce, event)).reduceByKey((a, b) = a).map(_._2) The above code

Re: Compute Median in Spark Dataframe

2015-06-04 Thread Holden Karau
My current example doesn't use a Hive UDAF, but you would do something pretty similar (it calls a new user defined UDAF, and there are wrappers to make Spark SQL UDAFs from Hive UDAFs but they are private). So this is doable, but since it pokes at internals it will likely break between versions

Re: Roadmap for Spark with Kafka on Scala 2.11?

2015-06-04 Thread Tathagata Das
But compile scope is supposed to be added to the assembly. https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope On Thu, Jun 4, 2015 at 1:24 PM, algermissen1971 algermissen1...@icloud.com wrote: Hi Iulian, On 26 May 2015, at 13:04, Iulian

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Olivier Girardot
You can use it as a broadcast variable, but if it's too large (more than 1Gb I guess), you may need to share it joining this using some kind of key to the other RDDs. But this is the kind of thing broadcast variables were designed for. Regards, Olivier. Le jeu. 4 juin 2015 à 23:50, dgoldenberg

sqlCtx.load a single big csv file from s3 in parallel

2015-06-04 Thread gy8
Hi there! I'm trying to read a large .csv file (14GB) into a dataframe from S3 via the spark-csv package. I want to load this data in parallel utilizing all 20 executors that I have, however by default only 3 executors are being used (which downloaded 5gb/5gb/4gb). Here is my script (im using

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Huang, Roger
Is the dictionary read-only? Did you look at http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables ? -Original Message- From: dgoldenberg [mailto:dgoldenberg...@gmail.com] Sent: Thursday, June 04, 2015 4:50 PM To: user@spark.apache.org Subject: How to share

Re: TreeReduce Functionality in Spark

2015-06-04 Thread DB Tsai
For the first round, you will have 16 reducers working since you have 32 partitions. Two of 32 partitions will know which reducer they will go by sharing the same key using reduceByKey. After this step is done, you will have 16 partitions, so the next round will be 8 reducers. Sincerely, DB

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Yiannis Gkoufas
Hi there, I would recommend checking out https://github.com/spark-jobserver/spark-jobserver which I think gives the functionality you are looking for. I haven't tested it though. BR On 5 June 2015 at 01:35, Olivier Girardot ssab...@gmail.com wrote: You can use it as a broadcast variable, but

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Dmitry Goldenberg
Thanks so much, Yiannis, Olivier, Huang! On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com wrote: Hi there, I would recommend checking out https://github.com/spark-jobserver/spark-jobserver which I think gives the functionality you are looking for. I haven't tested it

How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread dgoldenberg
We have some pipelines defined where sometimes we need to load potentially large resources such as dictionaries. What would be the best strategy for sharing such resources among the transformations/actions within a consumer? Can they be shared somehow across the RDD's? I'm looking for a way to

Re: Deduping events using Spark

2015-06-04 Thread Richard Marscher
I think if you create a bidirectional mapping from AnalyticsEvent to another type that would wrap it and use the nonce as its equality, you could then do something like reduceByKey to group by nonce and map back to AnalyticsEvent after. On Thu, Jun 4, 2015 at 1:10 PM, lbierman

Re: inlcudePackage() deprecated?

2015-06-04 Thread Shivaram Venkataraman
Yeah - We don't have support for running UDFs on DataFrames yet. There is an open issue to track this https://issues.apache.org/jira/browse/SPARK-6817 Thanks Shivaram On Thu, Jun 4, 2015 at 3:10 AM, Daniel Emaasit daniel.emaa...@gmail.com wrote: Hello Shivaram, Was the includePackage()

Re: TreeReduce Functionality in Spark

2015-06-04 Thread DB Tsai
By default, the depth of the tree is 2. Each partition will be one node. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Thu, Jun 4, 2015 at 10:46 AM, Raghav Shankar raghav0110...@gmail.com wrote: Hey Reza, Thanks for your response!

RE: Anybody using Spark SQL JDBC server with DSE Cassandra?

2015-06-04 Thread Mohammed Guller
Deenar, Thanks for the suggestion. That is one of the ideas that I have, but didn’t get chance to try it out yet. One of the things that could potentially cause problems is that we use wide rows. In addition, the schema is dynamic, with new columns getting added on a regular basis. That is why

Re: TreeReduce Functionality in Spark

2015-06-04 Thread Raghav Shankar
Hey Reza, Thanks for your response! Your response clarifies some of my initial thoughts. However, what I don't understand is how the depth of the tree is used to identify how many intermediate reducers there will be, and how many partitions are sent to the intermediate reducers. Could you

Re: Spark Job always cause a node to reboot

2015-06-04 Thread Ruslan Dautkhanov
vm.swappiness=0? Some vendors recommend this set to 0 (zero), although I've seen this causes even kernel to fail to allocate memory. It may cause node reboot. If that's the case, set vm.swappiness to 5-10 and decrease spark.*.memory. Your spark.driver.memory+ spark.executor.memory + OS + etc

Re: Adding new Spark workers on AWS EC2 - access error

2015-06-04 Thread barmaley
The issue was that SSH key generated on Spark Master was not transferred to this new slave. Spark-ec2 script with `start` command omits this step. The solution is to use `launch` command with `--resume` options. Then the SSH key is transferred to the new slave and everything goes smooth. --

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Saiph Kappa
Thanks! It is working fine now with spark-submit. Just out of curiosity, how would you use org.apache.spark.deploy.yarn.Client? Adding that spark_yarn jar to the configuration inside the application? On Thu, Jun 4, 2015 at 6:37 PM, Vova Shelgunov vvs...@gmail.com wrote: You should run it with

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Sandy Ryza
That might work, but there might also be other steps that are required. -Sandy On Thu, Jun 4, 2015 at 11:13 AM, Saiph Kappa saiph.ka...@gmail.com wrote: Thanks! It is working fine now with spark-submit. Just out of curiosity, how would you use org.apache.spark.deploy.yarn.Client? Adding that

Re: How to run spark streaming application on YARN?

2015-06-04 Thread Saiph Kappa
Additionally, I think this document ( https://spark.apache.org/docs/latest/building-spark.html ) should mention that the protobuf.version might need to be changed to match the one used in the chosen hadoop version. For instance, with hadoop 2.7.0 I had to change protobuf.version to 1.5.0 to be

Re: Required settings for permanent HDFS Spark on EC2

2015-06-04 Thread barmaley
Hi - I'm having similar problem with switching from ephemeral to persistent HDFS - it always looks for 9000 port regardless of options I set for 9010 persistent HDFS. Have you figured out a solution? Thanks -- View this message in context:

Deduping events using Spark

2015-06-04 Thread lbierman
I'm still a bit new to Spark and am struggilng to figure out the best way to Dedupe my events. I load my Avro files from HDFS and then I want to dedupe events that have the same nonce. For example my code so far: JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent)

RE: Standard Scaler taking 1.5hrs

2015-06-04 Thread Piero Cinquegrana
Hi DB, Yes I am running count() operations on the previous steps and it appears that something is slow prior to the scaler. I thought that running take(5) and print the results would execute the command at each step and materialize the RDD, but is that not the case? That’s how I was testing

Re: importerror using external library with pyspark

2015-06-04 Thread Don Drake
I would try setting PYSPARK_DRIVER_PYTHON environment variable to the location of your python binary, especially if you are using a virtual environment. -Don On Wed, Jun 3, 2015 at 8:24 PM, AlexG swift...@gmail.com wrote: I have libskylark installed on both machines in my two node cluster in

Re: Standard Scaler taking 1.5hrs

2015-06-04 Thread Holden Karau
take(5) will only evaluate enough partitions to provide 5 elements (sometimes a few more but you get the idea), so it won't trigger a full evaluation of all partitions unlike count(). On Thursday, June 4, 2015, Piero Cinquegrana pcinquegr...@marketshare.com wrote: Hi DB, Yes I am running

Re: Scaling spark jobs returning large amount of data

2015-06-04 Thread Richard Marscher
It is possible to start multiple concurrent drivers, Spark dynamically allocates ports per spark application on driver, master, and workers from a port range. When you collect results back to the driver, they do not go through the master. The master is mostly there as a coordinator between the

Re: Problem reading Parquet from 1.2 to 1.3

2015-06-04 Thread Marcelo Vanzin
I talked to Don outside the list and he says that he's seeing this issue with Apache Spark 1.3 too (not just CDH Spark), so it seems like there is a real issue here. On Wed, Jun 3, 2015 at 1:39 PM, Don Drake dondr...@gmail.com wrote: As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I

TF-IDF Question

2015-06-04 Thread franco barrientos
Hi all!, I have a .txt file where each row of it it¹s a collection of terms of a document separated by space. For example: 1 Hola spark² 2 .. I followed this example of spark site https://spark.apache.org/docs/latest/mllib-feature-extraction.html and i get something like this: tfidf.first()

How to speed up Spark Job?

2015-06-04 Thread ๏̯͡๏
I have a spark app that reads avro sequence file data and performs join, reduceByKey Results: Command for all runs: ./bin/spark-submit -v --master yarn-cluster --driver-class-path

Re: Optimisation advice for Avro-Parquet merge job

2015-06-04 Thread James Aley
Thanks for the confirmation! We're quite new to Spark, so a little reassurance is a good thing to have sometimes :-) The thing that's concerning me at the moment is that my job doesn't seem to run any faster with more compute resources added to the cluster, and this is proving a little tricky to

Re: Spark 1.3.1 On Mesos Issues.

2015-06-04 Thread John Omernik
So a few updates. When I run local as stated before, it works fine. When I run in Yarn (via Apache Myriad on Mesos) it also runs fine. The only issue is specifically with Mesos. I wonder if there is some sort of class path goodness I need to fix or something along that lines. Any tips would be

Re: Optimisation advice for Avro-Parquet merge job

2015-06-04 Thread Eugen Cepoi
Hi 2015-06-04 15:29 GMT+02:00 James Aley james.a...@swiftkey.com: Hi, We have a load of Avro data coming into our data systems in the form of relatively small files, which we're merging into larger Parquet files with Spark. I've been following the docs and the approach I'm taking seemed

Big performance difference when joining 3 tables in different order

2015-06-04 Thread Hao Ren
Hi, I encountered a performance issue when join 3 tables in sparkSQL. Here is the query: SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt FROM t_category c, t_zipcode z, click_meter_site_grouped g WHERE c.refCategoryID = g.category AND z.regionCode = g.region I need to pay a

Scaling spark jobs returning large amount of data

2015-06-04 Thread Giuseppe Sarno
Hello, I am relatively new to spark and I am currently trying to understand how to scale large numbers of jobs with spark. I understand that spark architecture is split in Driver, Master and Workers. Master has a standby node in case of failure and workers can scale out. All the examples I have

Re: Anybody using Spark SQL JDBC server with DSE Cassandra?

2015-06-04 Thread Deenar Toraskar
Mohammed Have you tried registering your Cassandra tables in Hive/Spark SQL using the data frames API. These should be then available to query via the Spark SQL/Thrift JDBC Server. Deenar On 1 June 2015 at 19:33, Mohammed Guller moham...@glassbeam.com wrote: Nobody using Spark SQL

Re: Spark 1.4.0-rc4 HiveContext.table(db.tbl) NoSuchTableException

2015-06-04 Thread Doug Balog
Hi Yin, I’m very surprised to hear that its not supported in 1.3 because I’ve been using it since 1.3.0. It worked great up until SPARK-6908 was merged into master. What is the supported way to get DF for a table that is not in the default database ? IMHO, If you are not going to support

Setting S3 output file grantees for spark output files

2015-06-04 Thread Justin Steigel
Hi all, I'm running Spark on AWS EMR and I'm having some issues getting the correct permissions on the output files using rdd.saveAsTextFile('file_dir_name'). In hive, I would add a line in the beginning of the script with set fs.s3.canned.acl=BucketOwnerFullControl and that would set the

Re: Spark 1.4 HiveContext fails to initialise with native libs

2015-06-04 Thread Yin Huai
Are you using RC4? On Wed, Jun 3, 2015 at 10:58 PM, Night Wolf nightwolf...@gmail.com wrote: Thanks Yin, that seems to work with the Shell. But on a compiled application with Spark-submit it still fails with the same exception. On Thu, Jun 4, 2015 at 2:46 PM, Yin Huai yh...@databricks.com

Spark Job always cause a node to reboot

2015-06-04 Thread Chao Chen
Hi all, I am new to spark. I am trying to deploy HDFS (hadoop-2.6.0) and Spark-1.3.1 with four nodes, and each node has 8-cores and 8GB memory. One is configured as headnode running masters, and 3 others are workers But when I try to run the Pagerank from HiBench, it always cause a node to

Re: StreamingListener, anyone?

2015-06-04 Thread Akhil Das
Hi Here's a working example: https://gist.github.com/akhld/b10dc491aad1a2007183 [image: Inline image 1] Thanks Best Regards On Wed, Jun 3, 2015 at 10:09 PM, dgoldenberg dgoldenberg...@gmail.com wrote: Hi, I've got a Spark Streaming driver job implemented and in it, I register a streaming

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-04 Thread Ji ZHANG
Hi, I set spark.shuffle.io.preferDirectBufs to false in SparkConf and this setting can be seen in web ui's environment tab. But, it still eats memory, i.e. -Xmx set to 512M but RES grows to 1.5G in half a day. On Wed, Jun 3, 2015 at 12:02 PM, Shixiong Zhu zsxw...@gmail.com wrote: Could you

Re: StreamingListener, anyone?

2015-06-04 Thread Shixiong Zhu
You should not call `jssc.stop(true);` in a StreamingListener. It will cause a dead-lock: `jssc.stop` won't return until `listenerBus` exits. But since `jssc.stop` blocks `StreamingListener`, `listenerBus` cannot exit. Best Regards, Shixiong Zhu 2015-06-04 0:39 GMT+08:00 dgoldenberg

NullPointerException SQLConf.setConf

2015-06-04 Thread patcharee
Hi, I am using Hive 0.14 and spark 0.13. I got java.lang.NullPointerException when inserted into hive. Any suggestions please. hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + ZONE + ,z= + zz + ,year= + YEAR + ,month= + MONTH + ) + select date, hh, x, y, height, u, v,

Re: Spark 1.4 HiveContext fails to initialise with native libs

2015-06-04 Thread Night Wolf
Thanks Yin, that seems to work with the Shell. But on a compiled application with Spark-submit it still fails with the same exception. On Thu, Jun 4, 2015 at 2:46 PM, Yin Huai yh...@databricks.com wrote: Can you put the following setting in spark-defaults.conf and try again?

SparkSQL DF.explode with Nulls

2015-06-04 Thread Tom Seddon
Hi, I've worked out how to use explode on my input avro dataset with the following structure root |-- pageViewId: string (nullable = false) |-- components: array (nullable = true) ||-- element: struct (containsNull = false) |||-- name: string (nullable = false) |||--

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-04 Thread Peter Rudenko
Hi Brandon, they are available, but private to ml package. They are now public in 1.4. For 1.3.1 you can define your transformer in org.apache.spark.ml package - then you could use these traits. Thanks, Peter Rudenko On 2015-06-04 20:28, Brandon Plaster wrote: Is HasInputCol and HasOutputCol

Re: Spark 1.4.0-rc4 HiveContext.table(db.tbl) NoSuchTableException

2015-06-04 Thread Yin Huai
Hi Doug, sqlContext.table does not officially support database name. It only supports table name as the parameter. We will add a method to support database name in future. Thanks, Yin On Thu, Jun 4, 2015 at 8:10 AM, Doug Balog doug.sparku...@dugos.com wrote: Hi Yin, I’m very surprised to

Re: inlcudePackage() deprecated?

2015-06-04 Thread Daniel Emaasit
Got it. Ignore my similar question on Github comments. On Thu, Jun 4, 2015 at 11:48 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah - We don't have support for running UDFs on DataFrames yet. There is an open issue to track this

Re: TreeReduce Functionality in Spark

2015-06-04 Thread Raghav Shankar
Hey DB, Thanks for the reply! I still don't think this answers my question. For example, if I have a top() action being executed and I have 32 workers(32 partitions), and I choose a depth of 4, what does the overlay of intermediate reducers look like? How many reducers are there excluding the

Re: Python Image Library and Spark

2015-06-04 Thread Akhil Das
Replace this line: img_data = sc.parallelize( list(im.getdata()) ) With: img_data = sc.parallelize( list(im.getdata()), 3 * No cores you have ) Thanks Best Regards On Thu, Jun 4, 2015 at 1:57 AM, Justin Spargur jmspar...@gmail.com wrote: Hi all, I'm playing around with

Spark ML decision list

2015-06-04 Thread Sateesh Kavuri
Hi, I have used weka machine learning library for generating a model for my training set. I have used the PART algorithm (decision lists) from weka. Now, I would like to use spark ML for the PART algo for my training set and could not seem to find a parallel. Could anyone point out the

Re: Adding new Spark workers on AWS EC2 - access error

2015-06-04 Thread Akhil Das
That's because you need to add the master's public key (~/.ssh/id_rsa.pub) to the newly added slaves ~/.ssh/authorized_keys. I add slaves this way: - Launch a new instance by clicking on the slave instance and choose *launch more like this * *- *Once its launched, ssh into it and add the master

Re: Equivalent to Storm's 'field grouping' in Spark.

2015-06-04 Thread luke89
Hi Matei, thank you for answering. Accordingly to what you said, am I mistaken when I say that tuples with the same key might eventually be spread across more than one node in case an overloaded worker can no longer accept tuples? In other words, suppose a worker (processing key K) cannot accept

Difference bewteen library dependencies version

2015-06-04 Thread Jean-Charles RISCH
Hello, *(Before everything : I use IntellijIdea 14.0.1, SBT and Scala 2.11.6)* This morning, I was looking to resolve the Failed to locate the winutils binary in the hadoop binary path error. I noticed that I can solve it configuring my build.sbt to ... libraryDependencies +=

large shuffling = executor lost?

2015-06-04 Thread Yifan LI
Hi, I am running my graphx application with Spark 1.3.1 on a small cluster. Then it failed on this exception: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1 But actually I found it is caused by “ExecutorLostFailure” indeed, and someone told

Re: How to create fewer output files for Spark job ?

2015-06-04 Thread ๏̯͡๏
It worked. On Thu, Jun 4, 2015 at 5:14 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Try using coalesce Thanks Regards, Meethu M On Wednesday, 3 June 2015 11:26 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I am running a series of spark functions with 9000 executors and its

Re: How to create fewer output files for Spark job ?

2015-06-04 Thread MEETHU MATHEW
Try using coalesce Thanks Regards, Meethu M On Wednesday, 3 June 2015 11:26 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I am running a series of spark functions with 9000 executors and its resulting in 9000+ files that is execeeding the namespace file count qutota. How can Spark

Re: StreamingListener, anyone?

2015-06-04 Thread Dmitry Goldenberg
Shixiong, Thanks, interesting point. So if we want to only process one batch then terminate the consumer, what's the best way to achieve that? Presumably the listener could set a flag on the driver notifying it that it can terminate. But the driver is not in a loop, it's basically blocked in

Re: Roadmap for Spark with Kafka on Scala 2.11?

2015-06-04 Thread algermissen1971
Hi Iulian, On 26 May 2015, at 13:04, Iulian Dragoș iulian.dra...@typesafe.com wrote: On Tue, May 26, 2015 at 10:09 AM, algermissen1971 algermissen1...@icloud.com wrote: Hi, I am setting up a project that requires Kafka support and I wonder what the roadmap is for Scala 2.11 Support

Re: Transactional guarantee while saving DataFrame into a DB

2015-06-04 Thread Deenar Toraskar
Hi Tariq You need to handle the transaction semantics yourself. You could for example save from the dataframe to a staging table and then write to the final table using a single atomic INSERT INTO finalTable from stagingTable call. Remember to clear the staging table first to recover from

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-04 Thread Dmitry Goldenberg
set the storage policy for the DStream RDDs to MEMORY AND DISK - it appears the storage level can be specified in the createStream methods but not createDirectStream... On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov evo.efti...@isecc.com wrote: You can also try Dynamic Resource Allocation

Re: Adding an indexed column

2015-06-04 Thread Deenar Toraskar
or you could 1) convert dataframe to RDD 2) use mapPartitions and zipWithIndex within each partition 3) convert RDD back to dataframe you will need to make sure you preserve partitioning Deenar On 1 June 2015 at 02:23, ayan guha guha.a...@gmail.com wrote: If you are on spark 1.3, use

Re: Compute Median in Spark Dataframe

2015-06-04 Thread Deenar Toraskar
Hi Holden, Olivier So for column you need to pass in a Java function, I have some sample code which does this but it does terrible things to access Spark internals. I also need to call a Hive UDAF in a dataframe agg function. Are there any examples of what Column expects? Deenar On 2 June 2015

Optimisation advice for Avro-Parquet merge job

2015-06-04 Thread James Aley
Hi, We have a load of Avro data coming into our data systems in the form of relatively small files, which we're merging into larger Parquet files with Spark. I've been following the docs and the approach I'm taking seemed fairly obvious, and pleasingly simple, but I'm wondering if perhaps it's

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-04 Thread Cody Koeninger
direct stream isn't a receiver, it isn't required to cache data anywhere unless you want it to. If you want it, just call cache. On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: set the storage policy for the DStream RDDs to MEMORY AND DISK - it appears the

Re: Difference bewteen library dependencies version

2015-06-04 Thread Ted Yu
For your first question, please take a look at HADOOP-9922. The fix is in hadoop-common module. Cheers On Thu, Jun 4, 2015 at 2:53 AM, Jean-Charles RISCH risch.jeanchar...@gmail.com wrote: Hello, *(Before everything : I use IntellijIdea 14.0.1, SBT and Scala 2.11.6)* This morning, I was