Re: how to load compressed (gzip) csv file using spark-csv

2016-06-16 Thread Vamsi Krishna
Thanks. It works. On Thu, Jun 16, 2016 at 5:32 PM Hyukjin Kwon wrote: > It will 'auto-detect' the compression codec by the file extension and then > will decompress and read it correctly. > > Thanks! > > 2016-06-16 20:27 GMT+09:00 Vamsi Krishna : >

Re: test - what is the wrong while adding one column in the dataframe

2016-06-16 Thread Zhiliang Zhu
just for test, since it seemed that the user email system was something wrong ago, is okay now. On Friday, June 17, 2016 12:18 PM, Zhiliang Zhu wrote: On Tuesday, May 17, 2016 10:44 AM, Zhiliang Zhu wrote: Hi

Re: Spark crashes worker nodes with multiple application starts

2016-06-16 Thread Deepak Goel
Well, my only guess (It is just a guess, as I don't have access to the machines which requires a hard reset)..The system is running into some kind of race condition while accessing the disk...And is not able to solve this..hence it is hanging (well this is a pretty vague statement, but it seems it

test - what is the wrong while adding one column in the dataframe

2016-06-16 Thread Zhiliang Zhu
On Tuesday, May 17, 2016 10:44 AM, Zhiliang Zhu wrote: Hi All, For the given DataFrame created by hive sql, however, then it is required to add one more column based on the existing column, and should also keep the previous columns there for the result

Re: Spark Kafka stream processing time increasing gradually

2016-06-16 Thread Roshan Singh
Hi, According to the docs ( https://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.DStream.reduceByKeyAndWindow), filerFunc can be used to retain expiring keys. I do not want to retain any expiring key, so I do not understand how can this help me stabilize it.

sparkR.init() can not load sparkPackages.

2016-06-16 Thread Joseph
Hi all, I find an issue in sparkR, maybe it's a bug: When I read csv file, it's normal to use the following way: ${SPARK_HOME}/bin/spark-submit --packages com.databricks:spark-csv_2.11:1.4.0 example.R But using the following way will give an error: sc <-

Re: Reporting warnings from workers

2016-06-16 Thread Mathieu Longtin
It turns out you can easily use a Python set, so I can send back a list of failed files. Thanks. On Wed, Jun 15, 2016 at 4:28 PM Ted Yu wrote: > Have you looked at: > > https://spark.apache.org/docs/latest/programming-guide.html#accumulators > > On Wed, Jun 15, 2016 at 1:24

RE: Can I control the execution of Spark jobs?

2016-06-16 Thread Haopu Wang
Jacek, For example, one ETL job is saving raw events and update a file. The other job is using that file's content to process the data set. In this case, the first job has to be done before the second one. That's what I mean by dependency. Any suggestions/comments are appreciated.

Re: Unable to execute sparkr jobs through Chronos

2016-06-16 Thread Sun Rui
I saw in the job definition an Env Var: SPARKR_MASTER. What is that for? I don’t think SparkR uses it. > On Jun 17, 2016, at 10:08, Sun Rui wrote: > > It seems that spark master URL is not correct. What is it? >> On Jun 16, 2016, at 18:57, Rodrick Brown

Re: Unable to execute sparkr jobs through Chronos

2016-06-16 Thread Sun Rui
It seems that spark master URL is not correct. What is it? > On Jun 16, 2016, at 18:57, Rodrick Brown wrote: > > Master must start with yarn, spark, mesos, or local

RE: Spark SQL driver memory keeps rising

2016-06-16 Thread Mohammed Guller
I haven’t read the code yet, but when you invoke spark-submit, where are you specifying --master yarn --deploy-mode client? Is it in the default config file and are you sure that spark-submit is reading the right file? Mohammed Author: Big Data Analytics with

Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-16 Thread Deepak Goel
Seems like the exexutor memory is not enough for your job and it is writing objects to disk On Jun 17, 2016 2:25 AM, "Cassa L" wrote: > > > On Thu, Jun 16, 2016 at 5:27 AM, Deepak Goel wrote: > >> What is your hardware configuration like which you are

Skew data

2016-06-16 Thread Selvam Raman
Hi, What is skew data. I read that if the data was skewed while joining it would take long time to finish the job.(99 percent finished in seconds where 1 percent of task taking minutes to hour). How to handle skewed data in spark. Thanks, Selvam R +91-97877-87724

Re: how to load compressed (gzip) csv file using spark-csv

2016-06-16 Thread Hyukjin Kwon
It will 'auto-detect' the compression codec by the file extension and then will decompress and read it correctly. Thanks! 2016-06-16 20:27 GMT+09:00 Vamsi Krishna : > Hi, > > I'm using Spark 1.4.1 (HDP 2.3.2). > As per the spark-csv documentation ( >

Spark Streaming WAL issue**: File exists and there is no append support!

2016-06-16 Thread tosaigan...@gmail.com
Hello I am using Azure Blob storage for Wal persistence. Iam getting below warnings in driver logs. Is it it something related to Compatibility/throttling issues with storage? java.lang.IllegalStateException: File exists and there is no append support! at

Re: Neither previous window has value for key, nor new values found.

2016-06-16 Thread N B
That post from TD that you reference has a good explanation of the issue you are encountering. The issue in my view here is that the reduce and the inverseReduce function that you have specified are not perfect opposites of each other. Consider the following strings: "a" "b" "a" forward reduce

Re: Spark Kafka stream processing time increasing gradually

2016-06-16 Thread N B
We had this same issue with the reduceByKeyAndWindow API that you are using. For fixing this issue, you have to use different flavor of that API, specifically the 2 versions that allow you to give a 'Filter function' to them. Putting in the filter functions helped stabilize our application too.

Update Batch DF with Streaming

2016-06-16 Thread Amit Assudani
Hi All, Can I update batch data frames loaded in memory with Streaming data, For eg, I have employee DF is registered as temporary table, it has EmployeeID, Name, Address, etc. fields, and assuming it is very big and takes time to load in memory, I've two types of employee events (both

Re: Spark jobs without a login

2016-06-16 Thread Ted Yu
Can you describe more about the container ? Please show complete stack trace for the exception. Thanks On Thu, Jun 16, 2016 at 1:32 PM, jay vyas wrote: > Hi spark: > > Is it possible to avoid reliance on a login user when running a spark job? > > I'm running out a

Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-16 Thread Cassa L
On Thu, Jun 16, 2016 at 5:27 AM, Deepak Goel wrote: > What is your hardware configuration like which you are running Spark on? > > It is 24core, 128GB RAM > Hey > > Namaskara~Nalama~Guten Tag~Bonjour > > >-- > Keigu > > Deepak > 73500 12833 > www.simtree.net,

Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-16 Thread Cassa L
Hi, > > What do you see under Executors and Details for Stage (for the > affected stages)? Anything weird memory-related? > Under executor Tab, logs throw these warning - 16/06/16 20:45:40 INFO TorrentBroadcast: Reading broadcast variable 422145 took 1 ms 16/06/16 20:45:40 WARN MemoryStore:

Re: How to enable core dump in spark

2016-06-16 Thread prateek arora
hi I am using spark with yarn . how can i make sure that the ulimit settings are applied to the Spark process ? I set core dump limit to unlimited in all nodes . Edit /etc/security/limits.conf file and add " * soft core unlimited " line. i rechecked using : $ ulimit -all core file

Spark jobs without a login

2016-06-16 Thread jay vyas
Hi spark: Is it possible to avoid reliance on a login user when running a spark job? I'm running out a container that doesnt supply a valid user name, and thus, I'm getting the following exception: org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:675) I'm

Re: choice of RDD function

2016-06-16 Thread Sivakumaran S
Dear Jacek and Cody, I receive a stream of JSON (exactly this much: 4 json objects) once every 30 seconds from Kafka as follows (I have changed my data source to include more fields) :

spark sql broadcast join ?

2016-06-16 Thread kali.tumm...@gmail.com
Hi All, I had used broadcast join in spark-scala applications I did used partitionby (Hash Partitioner) and then persit for wide dependencies, present project which I am working on pretty much Hive migration to spark-sql which is pretty much sql to be honest no scala or python apps. My question

Re: Recommended way to push data into HBase through Spark streaming

2016-06-16 Thread Mohammad Tariq
Forgot to add, I'm on HBase 1.0.0-cdh5.4.5, so can't use HBaseContext. And spark version is 1.6.1 [image: http://] Tariq, Mohammad about.me/mti [image: http://] On Thu, Jun 16, 2016 at 10:12 PM, Mohammad Tariq wrote: > Hi group, > > I have a

Re: Spark SQL driver memory keeps rising

2016-06-16 Thread Khaled Hammouda
I'm using pyspark and running in YARN client mode. I managed to anonymize the code a bit and pasted it below. You'll notice that I don't collect any output in the driver, instead the data is written to parquet directly. Also notice that I increased spark.driver.maxResultSize to 10g because the

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Mark Hamstra
I mean only that hardware-level threads and the processor's scheduling of those threads is only one segment of the total space of threads and thread scheduling, and that saying things like cores have threads or only the core schedules threads can be more confusing than helpful. On Thu, Jun 16,

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Mich Talebzadeh
Well LOL Given a set of parameters one can argue from any angle. It is not obvious what you are trying to sate here? "It is not strictly true" yeah OK Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: cache datframe

2016-06-16 Thread Jacek Laskowski
Yes. Yes. What's the use case? Jacek On 16 Jun 2016 2:17 p.m., "pseudo oduesp" wrote: > hi, > if i cache same data frame and transforme and add collumns i should cache > second times > > df.cache() > > transforamtion > add new columns > > df.cache() > ? > >

Re: ERROR TaskResultGetter: Exception while getting task result java.io.IOException: java.lang.ClassNotFoundException: scala.Some

2016-06-16 Thread Jacek Laskowski
Hi, Why do you provided spark-core while the others are non-provided? How do you assemble the app? How do you submit it for execution? What's the deployment environment? More info...more info... Jacek On 15 Jun 2016 10:26 p.m., "S Sarkar" wrote: Hello, I built

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Mark Hamstra
> > In addition, it is the core (not the OS) that determines when the thread > is executed. That's also not strictly true. "Thread" is a concept that can exist at multiple levels -- even concurrently at multiple levels for a single running program. Different entities will be responsible for

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Mark Hamstra
> > Actually, threads are a hardware implementation - hence the whole notion > of “multi-threaded cores”. No, a multi-threaded core is a core that supports multiple concurrent threads of execution, not a core that has multiple threads. The terminology and marketing around multi-core processors,

Re: Anyone has used Apache nifi

2016-06-16 Thread u...@moosheimer.com
Hi Mich, we use NiFi and it's really great. My company made a architecture blueprint based on NiFi and Spark. https://www.mysecondway.com/en/BOSON-Architecture Mit freundlichen Grüßen / best regards Kay-Uwe Moosheimer > Am 16.06.2016 um 11:10 schrieb Mich Talebzadeh

Re: Spark crashes worker nodes with multiple application starts

2016-06-16 Thread Carlile, Ken
Hi Deepak,  Yes, that’s about the size of it. The spark job isn’t filling the disk by any stretch of the imagination; in fact the only stuff that’s writing to the disk from Spark in certain of these instances is the logging.  Thanks, —Ken On Jun 16, 2016, at 12:17 PM,

Re: difference between dataframe and dataframwrite

2016-06-16 Thread Richard Catlin
I believe it depends on your Spark application. To write to Hive, use dataframe.saveAsTable To write to S3, use dataframe.write.parquet(“s3://”) Hope this helps. Richard > On Jun 16, 2016, at 9:54 AM, Natu Lauchande wrote: > > Does

RE: difference between dataframe and dataframwrite

2016-06-16 Thread Natu Lauchande
Hi Does anyone know wich one aws emr uses by default? Thanks, Natu On Jun 16, 2016 5:12 PM, "David Newberger" wrote: > DataFrame is a collection of data which is organized into named columns. > > DataFrame.write is an interface for saving the contents of a

Re: How to deal with tasks running too long?

2016-06-16 Thread Utkarsh Sengar
Thanks All, I know i have a data skew but the data is unpredictable and hard to find every time. Do you think this workaround is reasonable? ExecutorService executor = Executors.newCachedThreadPool(); Callable< Result > task = () -> simulation.run();

Recommended way to push data into HBase through Spark streaming

2016-06-16 Thread Mohammad Tariq
Hi group, I have a streaming job which reads data from Kafka, performs some computation and pushes the result into HBase. Actually the results are pushed into 3 different HBase tables. So I was wondering what could be the best way to achieve this. Since each executor will open its own HBase

Re: Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-16 Thread akhandeshi
Rest of the stacktrace. WARNING] java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at

converting timestamp from UTC to many time zones

2016-06-16 Thread ericjhilton
This is using python with Spark 1.6.1 and dataframes. I have timestamps in UTC that I want to convert to local time, but a given row could be in any of several timezones. I have an 'offset' value (or alternately, the local timezone abbreviation. I can adjust all the timestamps to a single zone or

Re: Spark crashes worker nodes with multiple application starts

2016-06-16 Thread Deepak Goel
I guess what you are saying is: 1. The nodes work perfectly ok without io wait before Spark job. 2. After you have run Spark job and killed it, the io wait persist. So what it seems, the Spark Job is altering the disk in such a way that other programs can't access the disk after the spark job is

Re: Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-16 Thread Ami Khandeshi
Spark 1.6.1; Java 7; Hadoop 2.6 On Thursday, June 16, 2016, Ted Yu wrote: > bq. Caused by: KrbException: Cannot locate default realm > > Can you show the rest of the stack trace ? > > What versions of Spark / Hadoop are you using ? > > Which version of Java are you using

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Deepak Goel
Just wondering, if threads were purely an hardware implementation then if my application in Java had one thread, and it was ran on a multcore machine then that thread in Java could be split up into small parts and ran in different cores simultaneously. However this would raise synchronization

Unsubscribe

2016-06-16 Thread Sanjeev Sagar
Unsubscribe Sent from my iPhone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

RE: difference between dataframe and dataframwrite

2016-06-16 Thread David Newberger
DataFrame is a collection of data which is organized into named columns. DataFrame.write is an interface for saving the contents of a DataFrame to external storage. Hope this helps David Newberger From: pseudo oduesp [mailto:pseudo20...@gmail.com] Sent: Thursday, June 16, 2016 9:43 AM To:

difference between dataframe and dataframwrite

2016-06-16 Thread pseudo oduesp
hi, what is difference between dataframe and dataframwrite ?

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Mich Talebzadeh
Thanks all. I think we are diverging but IMO it is a worthwhile discussion Actually, threads are a hardware implementation - hence the whole notion of “multi-threaded cores”. What happens is that the cores often have duplicate registers, etc. for holding execution state. While it is correct

Re: Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-16 Thread Ted Yu
bq. Caused by: KrbException: Cannot locate default realm Can you show the rest of the stack trace ? What versions of Spark / Hadoop are you using ? Which version of Java are you using (local and in cluster) ? Thanks On Thu, Jun 16, 2016 at 6:32 AM, akhandeshi wrote:

Re: cache datframe

2016-06-16 Thread Alexey Pechorin
What's the reason for your first cache call? It looks like you've used the data only once to transform it without reusing the data, so there's no reason for the first cache call, and you need only the second call (and that also depends on the rest of your code). On Thu, Jun 16, 2016 at 3:17 PM,

Re: advise please

2016-06-16 Thread pseudo oduesp
hi , i use pyspark 1.5.0 on yarn cluster with 19 nodes and 200 GO and 4 cores eache (include driver) 2016-06-16 15:42 GMT+02:00 pseudo oduesp : > Hi , > who i can dummies large set of columns with STRINGindexer fast ? > becasue i tested with 89 values and eache one had

advise please

2016-06-16 Thread pseudo oduesp
Hi , who i can dummies large set of columns with STRINGindexer fast ? becasue i tested with 89 values and eache one had 10 max distinct values and that take lot of time thanks

Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-16 Thread akhandeshi
I am trying to setup my IDE to a scala spark application. I want to access HDFS files from remote Hadoop server that has Kerberos enabled. My understanding is I should be able to do that from Spark. Here is my code so far: val sparkConf = new SparkConf().setAppName(appName).setMaster(master);

RE: streaming example has error

2016-06-16 Thread David Newberger
Try adding wordCounts.print() before ssc.start() David Newberger From: Lee Ho Yeung [mailto:jobmatt...@gmail.com] Sent: Wednesday, June 15, 2016 9:16 PM To: David Newberger Cc: user@spark.apache.org Subject: Re: streaming example has error got another error StreamingContext: Error starting the

Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-16 Thread Deepak Goel
What is your hardware configuration like which you are running Spark on? Hey Namaskara~Nalama~Guten Tag~Bonjour -- Keigu Deepak 73500 12833 www.simtree.net, dee...@simtree.net deic...@gmail.com LinkedIn: www.linkedin.com/in/deicool Skype: thumsupdeicool Google talk: deicool Blog:

Re: Spark crashes worker nodes with multiple application starts

2016-06-16 Thread Carlile, Ken
1. There are 320 nodes in total, with 96 dedicated to Spark. In this particular case, 21 are in the Spark cluster. In typical Spark usage, maybe 1-3 nodes will crash in a day, with probably an average of 4-5 Spark clusters running at a given time. In THIS case, 7-12 nodes will crash

Re: [scala-user] ERROR TaskResultGetter: Exception while getting task result java.io.IOException: java.lang.ClassNotFoundException: scala.Some

2016-06-16 Thread Oliver Ruebenacker
Hello, It would be useful to see the code that throws the exception. It probably means that the Scala standard library is not being uploaded to the executers. Try adding the Scala standard library to the SBT file ("org.scala-lang" % "scala-library" % "2.10.3"), or check your configuration.

unsubscribe

2016-06-16 Thread Marco Platania
unsubscribe

cache datframe

2016-06-16 Thread pseudo oduesp
hi, if i cache same data frame and transforme and add collumns i should cache second times df.cache() transforamtion add new columns df.cache() ?

Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-16 Thread Jacek Laskowski
Hi, What do you see under Executors and Details for Stage (for the affected stages)? Anything weird memory-related? How does your "I am reading data from Kafka into Spark and writing it into Cassandra after processing it." pipeline look like? Pozdrawiam, Jacek Laskowski

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Robin East
Mich >> A core may have one or more threads It would be more accurate to say that a core could run one or more threads scheduled for execution. Threads are a software/OS concept that represent executable code that is scheduled to run by the OS; A CPU, core or virtual core/virtual processor

Re: Spark crashes worker nodes with multiple application starts

2016-06-16 Thread Deepak Goel
I am no expert, but some naive thoughts... 1. How many HPC nodes do you have? How many of them crash (What do you mean by multiple)? Do all of them crash? 2. What things are you running on Puppet? Can't you switch it off and test Spark? Also you can switch of Facter. Btw, your observation that

Re: How to deal with tasks running too long?

2016-06-16 Thread Jacek Laskowski
Hi, I'd check Details for Stage page in web UI. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Thu, Jun 16, 2016 at 6:45 AM, Utkarsh Sengar

Re: Error Running SparkPi.scala Example

2016-06-16 Thread Jacek Laskowski
Hi, Before you try to do it inside another environment like an IDE, could you build Spark using mvn or sbt and only when successful try to run SparkPi using spark-submit run-example. With that, you could try to have a complete environment inside your beloved IDE (and I'm very glad to hear it's

Re: choice of RDD function

2016-06-16 Thread Sivakumaran S
Hi Jacek and Cody, First of all, thanks for helping me out. I started with using combineByKey while testing with just one field. Of course it worked fine, but I was worried that the code would become unreadable if there were many fields. Which is why I shifted to sqlContext because the code

Re: How to enable core dump in spark

2016-06-16 Thread Jacek Laskowski
Hi, Can you make sure that the ulimit settings are applied to the Spark process? Is this Spark on YARN or Standalone? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski

Spark crashes worker nodes with multiple application starts

2016-06-16 Thread Carlile, Ken
We run Spark on a general purpose HPC cluster (using standalone mode and the HPC scheduler), and are currently on Spark 1.6.1. One of the primary users has been testing various storage and other parameters for Spark, which involves doing multiple shuffles and shutting down and starting many

[YARN] Questions about YARN's queues and Spark's FAIR scheduler

2016-06-16 Thread Jacek Laskowski
Hi, I'm trying to get my head around the different parts of Spark on YARN architecture with YARN's schedulers and queues as well as Spark's own schedulers - FAIR and FIFO. I'd appreciate if you could read how I see things and correct me where I'm wrong. Thanks! The default scheduler in YARN is

Re: Can I control the execution of Spark jobs?

2016-06-16 Thread Alonso Isidoro Roman
Hi Wang, maybe you can consider to use an integration framework like Apache Camel in order to run differents jobs... Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

how to load compressed (gzip) csv file using spark-csv

2016-06-16 Thread Vamsi Krishna
Hi, I'm using Spark 1.4.1 (HDP 2.3.2). As per the spark-csv documentation (https://github.com/databricks/spark-csv), I see that we can write to a csv file in compressed form using the 'codec' option. But, didn't see the support for 'codec' option to read a csv file. Is there a way to read a

Re: In yarn-cluster mode, provide system prop to the client jvm

2016-06-16 Thread Jacek Laskowski
Hi, You could use --properties-file to point to the properties file with properties or use spark.driver.extraJavaOptions. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at

Re: Can I control the execution of Spark jobs?

2016-06-16 Thread Jacek Laskowski
Hi, When you say "several ETL types of things", what is this exactly? What would an example of "dependency between these jobs" be? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at

In yarn-cluster mode, provide system prop to the client jvm

2016-06-16 Thread Ellis, Tom (Financial Markets IT)
Hi, I was wondering if it was possible to submit a java system property to the JVM that does the submission of a yarn-cluster application, for instance, -Dlog4j.configuration. I believe it will default to using the SPARK_CONF_DIR's log4j.properties, is it possible to override this, as I do not

Unable to execute sparkr jobs through Chronos

2016-06-16 Thread Rodrick Brown
We use Chronos extensively for running our batch based spark jobs and recently started using sparkr however I'm seeing an issue trying to get sparkr jobs to run successfully when launched through Chronos basically always failing with a generic error below Launching java with spark-submit

Re: HIVE Query 25x faster than SPARK Query

2016-06-16 Thread Mich Talebzadeh
Hi, Your statement "I have a system with 64 GB RAM and SSD and its performance on local cluster SPARK is way better" Is this a host with 64GB of RAM and you data is stored on local Solid State Disks? Can you kindly provide the parameters you pass to spark-submit:

Re: HIVE Query 25x faster than SPARK Query

2016-06-16 Thread Gourav Sengupta
Hi, We do have a dimension table with around few hundred columns from which we need only a few columns to join with the main fact table which has a few million rows. I do not know how one off this case sounds like but since I have been working in data warehousing it sounds like a fairly general

Re: choice of RDD function

2016-06-16 Thread Jacek Laskowski
Rather val df = sqlContext.read.json(rdd) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Wed, Jun 15, 2016 at 11:55 PM, Sivakumaran S

Re: choice of RDD function

2016-06-16 Thread Jacek Laskowski
Hi, That's one of my concerns with the code. What concerned me the most is that the RDD(s) were converted to DataFrames only to registerTempTable and execute SQLs. I think it'd have better performance if DataFrame operators were used instead. Wish I had numbers. Pozdrawiam, Jacek Laskowski

String indexer

2016-06-16 Thread pseudo oduesp
hi , what is limite of modalties in Stingindexer : if i have columns with 1000 modalities it good to use STRINGindexers ? or i should try other function and which one please ? thanks

Can I control the execution of Spark jobs?

2016-06-16 Thread Haopu Wang
Hi, Suppose I have a spark application which is doing several ETL types of things. I understand Spark can analyze and generate several jobs to execute. The question is: is it possible to control the dependency between these jobs? Thanks!

STringindexer

2016-06-16 Thread pseudo oduesp
Hi , i have dataframe with 1000 columns to dummies with stingIndexer when i apply pipliene take long times whene i want merge result with other data frame i mean : originnal data frame + columns indexed by STringindexers PB save stage it s long why ? code indexers =

Spark cache behaviour when the source table is modified

2016-06-16 Thread Anjali Chadha
Hi all, I am having a hard time understanding the caching concepts in Spark. I have a hive table("person"), which is cached in Spark. sqlContext.sql("create table person (name string, age int)") //Create a new table //Add some values to the table ... ... //Cache the table in Spark

Anyone has used Apache nifi

2016-06-16 Thread Mich Talebzadeh
Hi, Anyone has used Apache nifi for data ingestion? There was a presentation yesterday in Hortonworks' London office titled "Learn more about Data Ingest at Hortonworks". It is about HDF ( .. Data Flow) including nifi (Niagara Files?) as its core solution. It looks

Re: HIVE Query 25x faster than SPARK Query

2016-06-16 Thread Mich Talebzadeh
sounds like this is a one off case. Do you have any other use case where you have Hive on MR outperforms Spark? I did some tests on 1 billion row table getting the selectivity of a column using Hive on MR, Hive on Spark engine and Spark running on local mode (to keep it simple) Hive 2, Spark

Re: HIVE Query 25x faster than SPARK Query

2016-06-16 Thread Jörn Franke
I agree here. However it depends always on your use case ! Best regards > On 16 Jun 2016, at 04:58, Gourav Sengupta wrote: > > Hi Mahender, > > please ensure that for dimension tables you are enabling the broadcast > method. You must be able to see surprising

Re: How to deal with tasks running too long?

2016-06-16 Thread Jeff Zhang
This may be due to data skew On Thu, Jun 16, 2016 at 12:45 PM, Utkarsh Sengar wrote: > This SO question was asked about 1yr ago. > > http://stackoverflow.com/questions/31799755/how-to-deal-with-tasks-running-too-long-comparing-to-others-in-job-in-yarn-cli > > I answered

Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-16 Thread Takeshi Yamamuro
Hi, Have you checked the statistics of storage memory, or something? // maropu On Thu, Jun 16, 2016 at 1:37 PM, Cassa L wrote: > Hi, > I did set --driver-memory 4G. I still run into this issue after 1 hour > of data load. > > I also tried version 1.6 in test environment.