RE: Spark streaming on standalone cluster

2015-07-01 Thread prajod.vettiyattil
Spark streaming needs at least two threads on the worker/slave side. I have seen this issue when(to test the behavior), I set the thread count for spark streaming to 1. It should be atleast 2: one for the receiver adapter(kafka, flume etc) and the second for processing the data. But I tested

Re: import errors with Eclipse Scala

2015-07-01 Thread Jem Tucker
in eclipse you can just add the spark assembly jar to the build path, right click the project build path configure build path library add external jars On Wed, Jul 1, 2015 at 7:15 PM Stefan Panayotov spanayo...@msn.com wrote: Hi Ted, How can I import the relevant Spark projects into

Re: import errors with Eclipse Scala

2015-07-01 Thread Ted Yu
Have you imported the relevant Spark projects into Eclipse. You can run command similar to the following to generate project files for Spark: mvn clean package -DskipTests eclipse:eclipse On Wed, Jul 1, 2015 at 9:57 AM, Stefan Panayotov spanayo...@msn.com wrote: Hi Team, Just installed

Task InputSize source code location

2015-07-01 Thread Shiyao Ma
Hi, When running tasks, I found some task has input size of zero, while others not. For example, in this picture: http://snag.gy/g6iJX.jpg I suspect it has something to do with the block manager. But where is the exact source code that monitors the task input size? Thanks.

making dataframe for different types using spark-csv

2015-07-01 Thread Hafiz Mujadid
Hi experts! I am using spark-csv to lead csv data into dataframe. By default it makes type of each column as string. Is there some way to get dataframe of actual types like int,double etc.? Thanks -- View this message in context:

Spark Standalone Cluster - Slave not connecting to Master

2015-07-01 Thread rshyam
I am trying to setup a Spark standalone cluster following the official documentation. My master is on a local vm running ubuntu and I also have one worker running in the same machine. It is connecting and I am able to see its status in the WebUI of the master. But when I try to connect a slave

RE: import errors with Eclipse Scala

2015-07-01 Thread Stefan Panayotov
Thanks, Jem. I added scala-compiler.jar from C:\Eclipse\eclipse\plugins\org.scala-ide.scala210.jars_4.1.0.201505250838\target\jars And looks like this resolved the issue. Thanks once again. Stefan Panayotov, PhD Home: 610-355-0919 Cell: 610-517-5586 email: spanayo...@msn.com

RE: import errors with Eclipse Scala

2015-07-01 Thread Stefan Panayotov
Hi Ted, How can I import the relevant Spark projects into Eclipse? Do I need to add anything the Java Build Path in the project properties? Also, I have installed sbt on my machine. Is there a corresponding sbt command to the maven command below? Stefan Panayotov, PhD Home: 610-355-0919

BroadCast Multiple DataFrame ( JDBC Tables )

2015-07-01 Thread Ashish Soni
Hi , I need to load 10 tables in memory and have them available to all the workers , Please let me me know what is the best way to do broadcast them sc.broadcast(df) allow only one Thanks,

Re: coalesce on dataFrame

2015-07-01 Thread Olivier Girardot
PySpark or Spark (scala) ? When you use coalesce with anything but a column you must use a literal like that in PySpark : from pyspark.sql import functions as F F.coalesce(df.a, F.lit(True)) Le mer. 1 juil. 2015 à 12:03, Ewan Leith ewan.le...@realitymine.com a écrit : It's in spark 1.4.0, or

custom RDD in java

2015-07-01 Thread Shushant Arora
Hi Is it possible to write custom RDD in java? Requirement is - I am having a list of Sqlserver tables need to be dumped in HDFS. So I have a ListString tables = {dbname.tablename,dbname.tablename2..}; then JavaRDDString rdd = javasparkcontext.parllelise(tables); JavaRDDString

Re: Can a Spark Driver Program be a REST Service by itself?

2015-07-01 Thread Arush Kharbanda
You can try using Spark Jobserver https://github.com/spark-jobserver/spark-jobserver On Wed, Jul 1, 2015 at 4:32 PM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: Folks, My Use case is as follows: My Driver program will be aggregating a bunch of Event Streams and acting on it. The

Making Unpersist Lazy

2015-07-01 Thread Jem Tucker
Hi, The current behavior of rdd.unpersist() appears to not be lazily executed and therefore must be placed after an action. Is there any way to emulate lazy execution of this function so it is added to the task queue? Thanks, Jem

Passing name of package in sparkR.init()

2015-07-01 Thread Sourav Mazumder
Hi, What is the right way to pass package name in sparkR.init() ? I can successfully pass the package name if I'm using sparkR shell by using --package while invoking sparkR. However, if I'm trying to use sparkR from RStudio and neeed to pass a package name in sparkR.init() not sure how to do

Illegal access error when initializing SparkConf

2015-07-01 Thread Ramprakash Ramamoorthy
Team, I'm just playing around with spark and mllib. Installed scala and spark, versions mentioned below. Scala - 2.11.7 Spark - 1.4.0 (Did an mvn package with -Dscala-2.11) I'm trying to run the Java classification, clustering examples that came along with the documentation. However, I'm

Re: Spark streaming on standalone cluster

2015-07-01 Thread Wojciech Pituła
Hi, https://spark.apache.org/docs/latest/streaming-programming-guide.html Points to remember - When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL. Either of these means that only one thread will be used for running tasks locally. If

Re: Difference between spark-defaults.conf and SparkConf.set

2015-07-01 Thread yana
Thanks. Without spark submit it seems the more straightforward solution is to just pass it on the driver's classpath. I was more surprised that the same conf parameter had different behavior depending on where it's specified. Program vs spark-defaults. Im all set now- thanks for replying

Re: Spark streaming on standalone cluster

2015-07-01 Thread Borja Garrido Bear
Hi all, Thanks for the answers, yes, my problem was I was using just one worker with one core, so it was starving and then I never get the job to run, now it seems it's working properly. One question, is this information in the docs? (because maybe I misread it) On Wed, Jul 1, 2015 at 10:30 AM,

Can a Spark Driver Program be a REST Service by itself?

2015-07-01 Thread Spark Enthusiast
Folks, My Use case is as follows: My Driver program will be aggregating a bunch of Event Streams and acting on it. The Action on the aggregated events is configurable and can change dynamically. One way I can think of is to run the Spark Driver as a Service where a config push can be caught via

Re: Issue with parquet write after join (Spark 1.4.0)

2015-07-01 Thread Pooja Jain
Join is happening successfully as I am able to do count() after the join. Error is coming only while trying to write in parquet format on hdfs. Thanks, Pooja. On Wed, Jul 1, 2015 at 1:06 PM, Akhil Das ak...@sigmoidanalytics.com wrote: It says: Caused by: java.net.ConnectException:

RE: coalesce on dataFrame

2015-07-01 Thread Ewan Leith
It's in spark 1.4.0, or should be at least: https://issues.apache.org/jira/browse/SPARK-6972 Ewan -Original Message- From: Hafiz Mujadid [mailto:hafizmujadi...@gmail.com] Sent: 01 July 2015 08:23 To: user@spark.apache.org Subject: coalesce on dataFrame How can we use coalesce(1, true)

Json Dataframe formation and Querying

2015-07-01 Thread Chaudhary, Umesh
Hi, I am creating DataFrame from a json file and the schema of json as truely depicted by dataframe.printschema() is: root |-- 1-F2: struct (nullable = true) ||-- A: string (nullable = true) ||-- B: string (nullable = true) ||-- C: string (nullable = true) |-- 10-C4: struct (nullable

Re: s3 bucket access/read file

2015-07-01 Thread Steve Loughran
s3a uses amazon's own libraries; it's tested against frankfurt too. you have to view s3a support in Hadoop 2.6 as beta-release: it works, with some issues. Hadoop 2.7.0+ has it all working now, though are left with the task of getting hadoop-aws and the amazon JAR onto your classpath via the

Issues when saving dataframe in Spark 1.4 with parquet format

2015-07-01 Thread David Sabater Dinter
Hi chaps, It seems there is an issue while saving dataframes in Spark 1.4. The default file extension inside Hive warehouse folder is now part-r-X.gz.parquet but while running queries from SparkSQL Thriftserver is still looking for part-r-X.parquet. Is there any config parameter we can use as

Re: Check for null in PySpark DataFrame

2015-07-01 Thread Olivier Girardot
I must admit I've been using the same back to SQL strategy for now :p So I'd be glad to have insights into that too. Le mar. 30 juin 2015 à 23:28, pedro ski.rodrig...@gmail.com a écrit : I am trying to find what is the correct way to programmatically check for null values for rows in a

Re: StorageLevel.MEMORY_AND_DISK_SER

2015-07-01 Thread Raghavendra Pandey
So do you want to change the behavior of persist api or write the rdd on disk... On Jul 1, 2015 9:13 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I think i want to use persist then and write my intermediate RDDs to disk+mem. On Wed, Jul 1, 2015 at 8:28 AM, Raghavendra Pandey

Re: Issue with parquet write after join (Spark 1.4.0)

2015-07-01 Thread Raghavendra Pandey
By any chance, are you using time field in your df. Time fields are known to be notorious in rdd conversion. On Jul 1, 2015 6:13 PM, Pooja Jain pooja.ja...@gmail.com wrote: Join is happening successfully as I am able to do count() after the join. Error is coming only while trying to write in

Re: StorageLevel.MEMORY_AND_DISK_SER

2015-07-01 Thread ๏̯͡๏
I think i want to use persist then and write my intermediate RDDs to disk+mem. On Wed, Jul 1, 2015 at 8:28 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: I think persist api is internal to rdd whereas write api is for saving content on dist. Rdd persist will dump your obj bytes

Re: custom RDD in java

2015-07-01 Thread Silvio Fiorito
If all you’re doing is just dumping tables from SQLServer to HDFS, have you looked at Sqoop? Otherwise, if you need to run this in Spark could you just use the existing JdbcRDD? From: Shushant Arora Date: Wednesday, July 1, 2015 at 10:19 AM To: user Subject: custom RDD in java Hi Is it

BroadcastHashJoin when RDD is not cached

2015-07-01 Thread Srikanth
Hello, I have a straight forward use case of joining a large table with a smaller table. The small table is within the limit I set for spark.sql.autoBroadcastJoinThreshold. I notice that ShuffledHashJoin is used to perform the join. BroadcastHashJoin was used only when I pre-fetched and cached

Re: StorageLevel.MEMORY_AND_DISK_SER

2015-07-01 Thread Raghavendra Pandey
For that you need to change the serialize and deserialize behavior of your class. Preferably, you can use Kyro serializers n override the behavior. For details u can look https://github.com/EsotericSoftware/kryo/blob/master/README.md On Jul 1, 2015 9:26 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

Re: StorageLevel.MEMORY_AND_DISK_SER

2015-07-01 Thread ๏̯͡๏
i original assumed that persisting is similar to writing. But its not. Hence i want to change the behavior of intermediate persists. On Wed, Jul 1, 2015 at 8:46 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: So do you want to change the behavior of persist api or write the rdd on

StorageLevel.MEMORY_AND_DISK_SER

2015-07-01 Thread ๏̯͡๏
How do i persist an RDD using StorageLevel.MEMORY_AND_DISK_SER ? -- Deepak

Re: StorageLevel.MEMORY_AND_DISK_SER

2015-07-01 Thread Koert Kuipers
rdd.persist(StorageLevel.MEMORY_AND_DISK_SER) On Wed, Jul 1, 2015 at 11:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: How do i persist an RDD using StorageLevel.MEMORY_AND_DISK_SER ? -- Deepak

Re: StorageLevel.MEMORY_AND_DISK_SER

2015-07-01 Thread ๏̯͡๏
This is my write API. how do i integrate it here. protected def writeOutputRecords(detailRecords: RDD[(AvroKey[DetailOutputRecord], NullWritable)], outputDir: String) { val writeJob = new Job() val schema = SchemaUtil.outputSchema(_detail) AvroJob.setOutputKeySchema(writeJob,

Re: Can a Spark Driver Program be a REST Service by itself?

2015-07-01 Thread Raghavendra Pandey
I am using spark driver as a rest service. I used spray.io to make my app rest server. I think this is a good design for applications that you want to keep in long running mode.. On Jul 1, 2015 6:28 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: You can try using Spark Jobserver

Re: s3 bucket access/read file

2015-07-01 Thread Aaron Davidson
I think 2.6 failed to abruptly close streams that weren't fully read, which we observed as a huge performance hit. We had to backport the 2.7 improvements before being able to use it.

binaryFiles() for 1 million files, too much memory required

2015-07-01 Thread Konstantinos Kougios
Once again I am trying to read a directory tree using binary files. My directory tree has a root dir ROOTDIR and subdirs where the files are located, i.e. ROOTDIR/1 ROOTDIR/2 ROOTDIR/.. ROOTDIR/100 A total of 1 mil files split into 100 sub dirs Using binaryFiles requires too much memory on

Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-07-01 Thread Michael Armbrust
You should probably write a UDF that uses regular expression or other string munging to canonicalize the subject and then group on that derived column. On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya surajshet...@gmail.com wrote: Thanks Salih. :) The output of the groupby is as below.

Re: custom RDD in java

2015-07-01 Thread Shushant Arora
List of tables is not large , RDD is created on table list to parllelise the work of fetching tables in multiple mappers at same time.Since time taken to fetch a table is significant , so can't run that sequentially. Content of table fetched by a map job is large, so one option is to dump

Re: custom RDD in java

2015-07-01 Thread Silvio Fiorito
Sure, you can create custom RDDs. Haven’t done so in Java, but in Scala absolutely. From: Shushant Arora Date: Wednesday, July 1, 2015 at 1:44 PM To: Silvio Fiorito Cc: user Subject: Re: custom RDD in java ok..will evaluate these options but is it possible to create RDD in java? On Wed, Jul

Re: Issue with parquet write after join (Spark 1.4.0)

2015-07-01 Thread Michael Armbrust
I would still look at your executor logs. A count() is rewritten by the optimizer to be much more efficient because you don't actually need any of the columns. Also, writing parquet allocates quite a few large buffers. On Wed, Jul 1, 2015 at 5:42 AM, Pooja Jain pooja.ja...@gmail.com wrote:

Re: custom RDD in java

2015-07-01 Thread Feynman Liang
AFAIK RDDs can only be created on the driver, not the executors. Also, `saveAsTextFile(...)` is an action and hence can also only be executed on the driver. As Silvio already mentioned, Sqoop may be a good option. On Wed, Jul 1, 2015 at 12:46 PM, Shushant Arora shushantaror...@gmail.com wrote:

Re: Custom order by in Spark SQL

2015-07-01 Thread Michael Armbrust
Easiest way to do this today is to define a UDF that maps from string to a number. On Wed, Jul 1, 2015 at 10:25 AM, Mick Davies michael.belldav...@gmail.com wrote: Hi, Is there a way to specify a custom order by (Ordering) on a column in Spark SQL In particular I would like to have the

Re: Check for null in PySpark DataFrame

2015-07-01 Thread Michael Armbrust
There is an isNotNull function on any column. df._1.isNotNull or from pyspark.sql.functions import * col(myColumn).isNotNull On Wed, Jul 1, 2015 at 3:07 AM, Olivier Girardot ssab...@gmail.com wrote: I must admit I've been using the same back to SQL strategy for now :p So I'd be glad to have

Re: BroadcastHashJoin when RDD is not cached

2015-07-01 Thread Michael Armbrust
We don't know that the table is small unless you cache it. In Spark 1.5 you'll be able to give us a hint though ( https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L581 ) On Wed, Jul 1, 2015 at 8:30 AM, Srikanth srikanth...@gmail.com wrote:

Re: sparkR could not find function textFile

2015-07-01 Thread Sourav Mazumder
Thanks Shivram. Your suggestion in stack overflow regarding this did work. Thanks again. Regards, Sourav On Wed, Jul 1, 2015 at 10:21 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: You can check my comment below the answer at http://stackoverflow.com/a/30959388/4577954. BTW we

Use of Apache Spark with R package SNOW, or perhaps Hadoop YARN with same SNOW?

2015-07-01 Thread Galkowski, Jan
Pretty much as in the subject. Snow is an R package for doing mapping of computations onto processes in one or more servers that's simple to use, and requires little configuration. Organizations sometimes use Hadoop and Spark to manage large clusters of processors. Is there a way for snow to

Re: How to disable parquet schema merging in 1.4?

2015-07-01 Thread Cheng Lian
With Spark 1.4, you may use data source option mergeSchema to control it: sqlContext.read.option(mergeSchema, false).parquet(some/path) or CREATE TABLE t USING parquet OPTIONS (mergeSchema false, path some/path) We're considering to disable schema merging by default in 1.5.0 since it

Re: Spark driver using Spark Streaming shows increasing memory/CPU usage

2015-07-01 Thread Neil Mayo
[Apologies if the end of the last email was only included as an attachment - MacMail seems to do that with the rest of the message if an attachment appears inline. I‘m sending again for clarity.] Hi Tathagata, Thanks for your quick reply! I’ll add some more detail below about what I’m doing -

Re: Retrieve hadoop conf object from Python API

2015-07-01 Thread Ted Yu
To close the loop. This should work: sc._jsc.hadoopConfiguration See this method in JavaSparkContext : def hadoopConfiguration(): Configuration = { sc.hadoopConfiguration On Tue, Jun 30, 2015 at 5:52 PM, Ted Yu yuzhih...@gmail.com wrote: Minor correction: It should be sc._jsc Cheers

Re: Subsecond queries possible?

2015-07-01 Thread Debasish Das
If you take bitmap indices out of sybase then I am guessing spark sql will be at par with sybase ? On that note are there plans of integrating indexed rdd ideas to spark sql to build indices ? Is there a JIRA tracking it ? On Jun 30, 2015 7:29 PM, Eric Pederson eric...@gmail.com wrote: Hi

Re: coalesce on dataFrame

2015-07-01 Thread Burak Yavuz
You can use df.repartition(1) in Spark 1.4. See here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1396 . Best, Burak On Wed, Jul 1, 2015 at 3:05 AM, Olivier Girardot ssab...@gmail.com wrote: PySpark or Spark (scala) ? When you use

spark.streaming.receiver.maxRate Not taking effect

2015-07-01 Thread Laeeq Ahmed
Hi, I have set spark.streaming.receiver.maxRate to 100. My batch interval is 4sec but still sometimes there are more than 400 records per batch. I am using spark 1.2.0. Regards,Laeeq

import errors with Eclipse Scala

2015-07-01 Thread Stefan Panayotov
Hi Team, Just installed Eclipse with Scala plugin to benefit from IDE environment and I faced the problem that any import statement gives me an error.For example: import org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport org.apache.spark.sql.hive.HiveContextimport

Re: breeze.linalg.DenseMatrix not found

2015-07-01 Thread Alex Gittens
I think the issue was NOT with spark. I was running a spark program that dumped output to a binary file and then calling a scala program to read it and write out Matrix Market format files. The issue seems to have been with the classpath on the scala program, and went away when I added the spark

Custom order by in Spark SQL

2015-07-01 Thread Mick Davies
Hi, Is there a way to specify a custom order by (Ordering) on a column in Spark SQL In particular I would like to have the order by applied to a currency column not to be alpha, but something like - USD, EUR, JPY, GBP etc.. I saw an earlier post on UDTs and ordering (which I can't seem to

Re: spark.streaming.receiver.maxRate Not taking effect

2015-07-01 Thread Ted Yu
This might be related: SPARK-6985 Cheers On Wed, Jul 1, 2015 at 10:27 AM, Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote: Hi, I have set spark.streaming.receiver.maxRate to 100. My batch interval is 4sec but still sometimes there are more than 400 records per batch. I am using spark

Re: How to recover in case user errors in streaming

2015-07-01 Thread Amit Assudani
Hi TD, Why don’t we have OnBatchError or similar method in StreamingListener ? Also, is StreamingListener only for receiver based approach or does it work for Kafka Direct API / File Based Streaming as well ? Regards, Amit From: Tathagata Das t...@databricks.commailto:t...@databricks.com

Re: sparkR could not find function textFile

2015-07-01 Thread Shivaram Venkataraman
You can check my comment below the answer at http://stackoverflow.com/a/30959388/4577954. BTW we added a new option to sparkR.init to pass in packages and that should be a part of 1.5 Shivaram On Wed, Jul 1, 2015 at 10:03 AM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi,

Re: custom RDD in java

2015-07-01 Thread Shushant Arora
ok..will evaluate these options but is it possible to create RDD in java? On Wed, Jul 1, 2015 at 8:29 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: If all you’re doing is just dumping tables from SQLServer to HDFS, have you looked at Sqoop? Otherwise, if you need to run this in

Re: Spark driver using Spark Streaming shows increasing memory/CPU usage

2015-07-01 Thread Neil Mayo
Hi Tathagata, Thanks for your quick reply! I’ll add some more detail below about what I’m doing - I’ve tried a lot of variations on the code to debug this, with monitoring enabled, but I didn’t want to overwhelm the issue description to start with ;-) On 30 Jun 2015, at 19:30, Tathagata Das

Convert CSV lines to List of Objects

2015-07-01 Thread Ashish Soni
Hi , How can i use Map function in java to convert all the lines of csv file into a list of objects , Can some one please help... JavaRDDListCharge rdd = sc.textFile(data.csv).map(new FunctionString, ListCharge() { @Override public ListCharge call(String s) {

Re: sparkR could not find function textFile

2015-07-01 Thread Sourav Mazumder
Hi, Piggybacking on this discussion. I'm trying to achieve the same, reading a csv file, from RStudio. Where I'm stuck is how to supply some additional package from RStudio to spark.init() as sparkR.init does() not provide an option to specify additional package. I tried following codefrom

Re: Need clarification on spark on cluster set up instruction

2015-07-01 Thread Alex Gittens
I have a similar use case, so I wrote a python script to fix the cluster configuration that spark-ec2 uses when you use Hadoop 2. Start a cluster with enough machines that the hdfs system can hold 1Tb (so use instance types that have SSDs), then follow the instructions at

Re: output folder structure not getting commited and remains as _temporary

2015-07-01 Thread Akhil Das
Looks like a jar conflict to me. ava.lang.NoSuchMethodException: org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData.getBytesWritten() You are having multiple versions of the same jars in the classpath. Thanks Best Regards On Wed, Jul 1, 2015 at 6:58 AM, nkd kalidas.nimmaga...@gmail.com

upload to s3, UI Total Duration and Sum of Job Durations

2015-07-01 Thread igor.berman
Hi, Our job is reading files from s3, transforming/aggregating them and writing them back to s3. While investigating performance problems I've noticed that there is big difference between sum of job durations and Total duration which appears in UI After investigating it a bit the difference

Re: Issue with parquet write after join (Spark 1.4.0)

2015-07-01 Thread Akhil Das
It says: Caused by: java.net.ConnectException: Connection refused: slave2/...:54845 Could you look in the executor logs (stderr on slave2) and see what made it shut down? Since you are doing a join there's a high possibility of OOM etc. Thanks Best Regards On Wed, Jul 1, 2015 at 10:20 AM,

Can I do Joins across Event Streams ?

2015-07-01 Thread Spark Enthusiast
Hi, I have to build a system that reacts to a set of events. Each of these events are separate streams by themselves which are consumed from different Kafka Topics and hence will have different InputDStreams. Questions: Will I be able to do joins across multiple InputDStreams and collate the

coalesce on dataFrame

2015-07-01 Thread Hafiz Mujadid
How can we use coalesce(1, true) on dataFrame? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-on-dataFrame-tp23564.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark run errors on Raspberry Pi

2015-07-01 Thread Akhil Das
Now i'm having a strange feeling to try this on KBOX http://kevinboone.net/kbox.html :/ Thanks Best Regards On Wed, Jul 1, 2015 at 9:10 AM, Exie tfind...@prodevelop.com.au wrote: FWIW, I had some trouble getting Spark running on a Pi. My core problem was using snappy for compression as it

Re: Run multiple Spark jobs concurrently

2015-07-01 Thread Akhil Das
Have a look at https://spark.apache.org/docs/latest/job-scheduling.html Thanks Best Regards On Wed, Jul 1, 2015 at 12:01 PM, Nirmal Fernando nir...@wso2.com wrote: Hi All, Is there any additional configs that we have to do to perform $subject? -- Thanks regards, Nirmal Associate

Re: Can Dependencies Be Resolved on Spark Cluster?

2015-07-01 Thread SLiZn Liu
Thanks for the enlightening solution! On Wed, Jul 1, 2015 at 12:03 AM Burak Yavuz brk...@gmail.com wrote: Hi, In your build.sbt file, all the dependencies you have (hopefully they're not too many, they only have a lot of transitive dependencies), for example: ``` libraryDependencies +=

Re: Can I do Joins across Event Streams ?

2015-07-01 Thread Akhil Das
Have a look at the window, updateStateByKey operations, if you are looking for something more sophisticated then you can actually persists these streams in an intermediate storage (say for x duration) like HBase or Cassandra or any other DB and you can do global aggregations with these. Thanks

Spark program running infinitely

2015-07-01 Thread Ladle
for (i - 0 until distUsers.length) { val subsetData = sqlContext.sql(SELECT bidder_id, t.auction, time from BidsTable b inner join (select distinct auction from BidsTable where bidder_id='+distUsers(i)+') t on t.auction=b.auction order by t.auction, time).map(x=(x(0),x(1),x(2))) val withIndex =

Re: DataFrame Filter Inside Another Data Frame Map

2015-07-01 Thread Holden Karau
Collecting it as a regular (Java/scala/Python) map. You can also broadcast the map if your going to use it multiple times. On Wednesday, July 1, 2015, Ashish Soni asoni.le...@gmail.com wrote: Thanks , So if i load some static data from database and then i need to use than in my map function to

Error with splitting contents of a dataframe column using Spark 1.4 for nested complex json file

2015-07-01 Thread Mike Tracy
Hello, I am having issues with splitting contents of a dataframe column using Spark 1.4. The dataframe was created by reading a nested complex json file. I used df.explode but keep getting error message. The json file format looks like [ { neid:{ }, mi:{

getting WARN ReliableDeliverySupervisor

2015-07-01 Thread xiaohe lan
Hi Expert, Hadoop version: 2.4 Spark version: 1.3.1 I am running the SparkPi example application. bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --executor-memory 2G lib/spark-examples-1.3.1-hadoop2.4.0.jar 2 The same command sometimes gets WARN

Spark on Hadoop 2.5.2

2015-07-01 Thread Xiaoyu Ma
Hi guys, I was trying to deploy SparkSQL thrift server on Hadoop 2.5.2 with Kerberos / Hive .13. It seems I got problem as below when I tried to start thrift server. java.lang.NoSuchFieldError: SASL_PROPS at

Re: DataFrame Find/Filter Based on Input - Inside Map function

2015-07-01 Thread ayan guha
You can directly use filter on a Dataframe On 2 Jul 2015 12:15, Ashish Soni asoni.le...@gmail.com wrote: Hi All , I have an DataFrame Created as below options.put(dbtable, (select * from user) as account); DataFrame accountRdd =

Re: BroadCast Multiple DataFrame ( JDBC Tables )

2015-07-01 Thread Raghavendra Pandey
I am not sure if you can broadcast data frame without collecting it on driver... On Jul 1, 2015 11:45 PM, Ashish Soni asoni.le...@gmail.com wrote: Hi , I need to load 10 tables in memory and have them available to all the workers , Please let me me know what is the best way to do broadcast

Re: making dataframe for different types using spark-csv

2015-07-01 Thread Hafiz Mujadid
hi Mohammed Guller! How can I specify schema in load method? On Thu, Jul 2, 2015 at 6:43 AM, Mohammed Guller moham...@glassbeam.com wrote: Another option is to provide the schema to the load method. One variant of the sqlContext.load takes a schema as a input parameter. You can define the

Re: DataFrame Filter Inside Another Data Frame Map

2015-07-01 Thread asoni . learn
Any example how can i return a Hashmap from data frame ? Thanks , Ashish On Jul 1, 2015, at 11:34 PM, Holden Karau hol...@pigscanfly.ca wrote: Collecting it as a regular (Java/scala/Python) map. You can also broadcast the map if your going to use it multiple times. On Wednesday, July 1,

Re: DataFrame Find/Filter Based on Input - Inside Map function

2015-07-01 Thread Mailing List
I need to pass the value of the filter dynamically like where id=someVal and that someVal exist in another RDD. How can I do this across JavaRDD and DataFrame ? Sent from my iPad On Jul 2, 2015, at 12:49 AM, ayan guha guha.a...@gmail.com wrote: You can directly use filter on a Dataframe

Meets class not found error in spark console with newly hive context

2015-07-01 Thread Terry Hole
All, I am using spark console 1.4.0 to do some tests, when a create a newly HiveContext (Line 18 in the code) in my test function, it always throw exception like below (It works in spark console 1.3.0), but if i removed the HiveContext (The line 18 in the code) in my function, it works fine. Any

RE: making dataframe for different types using spark-csv

2015-07-01 Thread Mohammed Guller
Another option is to provide the schema to the load method. One variant of the sqlContext.load takes a schema as a input parameter. You can define the schema programmatically as shown here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

DataFrame Find/Filter Based on Input - Inside Map function

2015-07-01 Thread Ashish Soni
Hi All , I have an DataFrame Created as below options.put(dbtable, (select * from user) as account); DataFrame accountRdd = sqlContext.read().format(jdbc).options(options).load(); and i have another RDD which contains login name and i want to find the userid from above DF RDD and return

Re: DataFrame Filter Inside Another Data Frame Map

2015-07-01 Thread Raghavendra Pandey
You cannot refer to one rdd inside another rdd.map function... Rdd object is not serialiable. Whatever objects you use inside map function should be serializable as they get transferred to executor nodes. On Jul 2, 2015 6:13 AM, Ashish Soni asoni.le...@gmail.com wrote: Hi All , I am not sure

Re: DataFrame Filter Inside Another Data Frame Map

2015-07-01 Thread Ashish Soni
Thanks , So if i load some static data from database and then i need to use than in my map function to filter records what will be the best way to do it, Ashish On Wed, Jul 1, 2015 at 10:45 PM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: You cannot refer to one rdd inside another

Re: making dataframe for different types using spark-csv

2015-07-01 Thread Krishna Sankar
- use .cast(...).alias('...') after the DataFrame is read. - sql.functions.udf for any domain-specific conversions. Cheers k/ On Wed, Jul 1, 2015 at 11:03 AM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi experts! I am using spark-csv to lead csv data into dataframe. By default it

Re: Subsecond queries possible?

2015-07-01 Thread Eric Pederson
I removed all of the indices from the table in IQ and the time went up to 700ms for the query on the full dataset. The best time I've got so far with Spark for the full dataset is 4s with a cached table and 30 cores. However, every column in IQ is automatically indexed by default

Re: Calling MLLib from SparkR

2015-07-01 Thread Shivaram Venkataraman
The 1.4 release does not support calling MLLib from SparkR. We are working on it as a part of https://issues.apache.org/jira/browse/SPARK-6805 On Wed, Jul 1, 2015 at 4:23 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, Does Spark 1.4 support calling MLLib directly from SparkR ?

Calling MLLib from SparkR

2015-07-01 Thread Sourav Mazumder
Hi, Does Spark 1.4 support calling MLLib directly from SparkR ? If not, is there any work around, any example available somewhere ? Regards, Sourav

Re: Difference between spark-defaults.conf and SparkConf.set

2015-07-01 Thread Akhil Das
.addJar works for me when i run it as a stand-alone application (without using spark-submit) Thanks Best Regards On Tue, Jun 30, 2015 at 7:47 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, running into a pretty strange issue: I'm setting spark.executor.extraClassPath

Run multiple Spark jobs concurrently

2015-07-01 Thread Nirmal Fernando
Hi All, Is there any additional configs that we have to do to perform $subject? -- Thanks regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/

question about resource allocation on the spark standalone cluster

2015-07-01 Thread Tomer Benyamini
Hello spark-users, I would like to use the spark standalone cluster for multi-tenants, to run multiple apps at the same time. The issue is, when submitting an app to the spark standalone cluster, you cannot pass --num-executors like on yarn, but only --total-executor-cores. *This may cause

Re: Issues in reading a CSV file from local file system using spark-shell

2015-07-01 Thread Akhil Das
Since its a windows machine, you are very likely to be hitting this one https://issues.apache.org/jira/browse/SPARK-2356 Thanks Best Regards On Wed, Jul 1, 2015 at 12:36 AM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, I'm running Spark 1.4.0 without Hadoop. I'm using the binary

DataFrame Filter Inside Another Data Frame Map

2015-07-01 Thread Ashish Soni
Hi All , I am not sure what is the wrong with below code as it give below error when i access inside the map but it works outside JavaRDDCharge rdd2 = rdd.map(new FunctionCharge, Charge() { @Override public Charge call(Charge ch) throws Exception { *

KMeans questions

2015-07-01 Thread Eric Friedman
In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans.train method, is there a cleaner way to create the Vectors than this? data.map{r = Vectors.dense(r.getDouble(0), r.getDouble(3), r.getDouble(4), r.getDouble(5), r.getDouble(6))} Second, once I train the model and call predict on my

Re: Calling MLLib from SparkR

2015-07-01 Thread Sourav Mazumder
Hi Shivaram, Thanks for confirmation. Wondering for doing some modeling from SparkR, is there anyway I can call a Machine Learning library of R using the bootstrapping method specified in https://amplab-extras.github.io/SparkR-pkg/. Looks like the RDD apis are now private in SparkR and no way I

  1   2   >