Spark SQL driver memory keeps rising

2016-06-14 Thread Khaled Hammouda
I'm having trouble with a Spark SQL job in which I run a series of SQL transformations on data loaded from HDFS. The first two stages load data from hdfs input without issues, but later stages that require shuffles cause the driver memory to keep rising until it is exhausted, and then the driver

Re: sqlcontext - not able to connect to database

2016-06-14 Thread Jeff Zhang
The jdbc driver jar is not on classpath, please add it using --jars On Wed, Jun 15, 2016 at 12:45 PM, Tejaswini Buche < tejaswini.buche0...@gmail.com> wrote: > hi, > > I am trying to connect to a mysql database on my machine. > But, I am getting some error > > dataframe_mysql =

sqlcontext - not able to connect to database

2016-06-14 Thread Tejaswini Buche
hi, I am trying to connect to a mysql database on my machine. But, I am getting some error dataframe_mysql = sqlContext.read.format("jdbc").options( url="jdbc:mysql://localhost:3306/my_db", driver = "com.mysql.jdbc.Driver", dbtable = "data1", user="123").load() below is the

Re: can not show all data for this table

2016-06-14 Thread Mich Talebzadeh
there may be an issue with data in your csv file. like blank header line etc. sounds like you have an issue there. I normally get rid of blank lines before putting csv file in hdfs. can you actually select from that temp table. like sql("select TransactionDate, TransactionType, Description,

Re: can not show all data for this table

2016-06-14 Thread Lee Ho Yeung
filter also has error 16/06/14 19:00:27 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. Spark context available as sc. SQL context available as sqlContext. scala> import org.apache.spark.sql.SQLContext import org.apache.spark.sql.SQLContext scala> val sqlContext

Re: hivecontext error

2016-06-14 Thread Ted Yu
Which release of Spark are you using ? Can you show the full error trace ? Thanks On Tue, Jun 14, 2016 at 6:33 PM, Tejaswini Buche < tejaswini.buche0...@gmail.com> wrote: > I am trying to use hivecontext in spark. The following statements are > running fine : > > from pyspark.sql import

streaming example has error

2016-06-14 Thread Lee Ho Yeung
when simulate streaming with nc -lk got error below, then i try example, martin@ubuntu:~/Downloads$ /home/martin/Downloads/spark-1.6.1/bin/run-example streaming.NetworkWordCount localhost Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/06/14 18:33:06

hivecontext error

2016-06-14 Thread Tejaswini Buche
I am trying to use hivecontext in spark. The following statements are running fine : from pyspark.sql import HiveContext sqlContext = HiveContext(sc) But, when i run the below statement, sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") I get the following error : Java

Re: Writing empty Dataframes doesn't save any _metadata files in Spark 1.5.1 and 1.6

2016-06-14 Thread Hyukjin Kwon
Ops, I just so the link. It is not actually only for Spark 2.0. To be clear, https://issues.apache.org/jira/browse/SPARK-15393 was a bit different with your case (it was about writing empty data frame with empty partitions). This was caused by https://github.com/apache/spark/pull/12855 and

Re: Writing empty Dataframes doesn't save any _metadata files in Spark 1.5.1 and 1.6

2016-06-14 Thread Hyukjin Kwon
Yea, I met this case before. I guess this is related with https://issues.apache.org/jira/browse/SPARK-15393. 2016-06-15 8:46 GMT+09:00 antoniosi : > I tried the following code in both Spark 1.5.1 and Spark 1.6.0: > > import org.apache.spark.sql.types.{ > StructType,

can not show all data for this table

2016-06-14 Thread Lee Ho Yeung
after tried following commands, can not show data https://drive.google.com/file/d/0Bxs_ao6uuBDUVkJYVmNaUGx2ZUE/view?usp=sharing https://drive.google.com/file/d/0Bxs_ao6uuBDUc3ltMVZqNlBUYVk/view?usp=sharing /home/martin/Downloads/spark-1.6.1/bin/spark-shell --packages

Writing empty Dataframes doesn't save any _metadata files in Spark 1.5.1 and 1.6

2016-06-14 Thread antoniosi
I tried the following code in both Spark 1.5.1 and Spark 1.6.0: import org.apache.spark.sql.types.{ StructType, StructField, StringType, IntegerType} import org.apache.spark.sql.Row val schema = StructType( StructField("k", StringType, true) :: StructField("v", IntegerType, false) ::

Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-14 Thread Cassa L
Hi, I would appreciate any clue on this. It has become a bottleneck for our spark job. On Mon, Jun 13, 2016 at 2:56 PM, Cassa L wrote: > Hi, > > I'm using spark 1.5.1 version. I am reading data from Kafka into Spark and > writing it into Cassandra after processing it. Spark

spark standalone High availibilty issues

2016-06-14 Thread Darshan Singh
Hi, I am using standalone spark cluster and using zookeeper cluster for the high availbilty. I am getting sometimes error when I start the master. The error is related to Leader election in curator and says that noMethod found (getProcess) and master doesnt get started. Just wondering what could

choice of RDD function

2016-06-14 Thread Sivakumaran S
Dear friends, I have set up Kafka 0.9.0.0, Spark 1.6.1 and Scala 2.10. My source is sending a json string periodically to a topic in kafka. I am able to consume this topic using Spark Streaming and print it. The schema of the source json is as follows: { “id”: 121156, “ht”: 42, “rotor_rpm”:

Re: spark-ec2 scripts with spark-2.0.0-preview

2016-06-14 Thread Shivaram Venkataraman
Can you open an issue on https://github.com/amplab/spark-ec2 ? I think we should be able to escape the version string and pass the 2.0.0-preview through the scripts Shivaram On Tue, Jun 14, 2016 at 12:07 PM, Sunil Kumar wrote: > Hi, > > The spark-ec2 scripts are

SparkContext#cancelJobGroup : is it safe? Who got burn? Who is alive?

2016-06-14 Thread Bertrand Dechoux
Hi, I am wondering about the safety of the *SparkContext#cancelJobGroup* method that should allow to stop specific (ie not all) jobs inside a spark context. There is a big disclaimer (

spark-ec2 scripts with spark-2.0.0-preview

2016-06-14 Thread Sunil Kumar
Hi, The spark-ec2 scripts are missing from spark-2.0.0-preview. Is there a workaround available ? I tried to change the ec2 scripts to accomodate spark-2.0.0...If I call the release spark-2.0.0-preview, then it barfs because the command line argument : --spark-version=spark-2.0.0-preview  gets

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-14 Thread Xinh Huynh
Hi Arun, This documentation may be helpful: The 2.0-preview Scala doc for Dataset class: http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.sql.Dataset Note that the Dataset API has completely changed from 1.6. In 2.0, there is no separate DataFrame class. Rather,

Re: Spark-SQL with Oozie

2016-06-14 Thread nsalian
Hi, Thanks for the question. This would be a good starting point for your Oozie workflow application with a Spark action. - Neelesh S. Salian Cloudera -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-with-Oozie-tp27167p27168.html Sent from

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-14 Thread Michael Armbrust
> > 1) What does this really mean to an Application developer? > It means there are less concepts to learn. > 2) Why this unification was needed in Spark 2.0? > To simplify the API and reduce the number of concepts that needed to be learned. We only didn't do it in 1.6 because we didn't want

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-14 Thread Mich Talebzadeh
Hi Swetha, Have you actually tried doing this in Hive using Hive CLI or beeline? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-14 Thread Mich Talebzadeh
In all probability there is no user database created in Hive Create a database yourself sql("create if not exists database test") It would be helpful if you grasp some concept of Hive databases etc? HTH Dr Mich Talebzadeh LinkedIn *

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-14 Thread Arun Patel
Can anyone answer these questions please. On Mon, Jun 13, 2016 at 6:51 PM, Arun Patel wrote: > Thanks Michael. > > I went thru these slides already and could not find answers for these > specific questions. > > I created a Dataset and converted it to DataFrame in 1.6

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-14 Thread swetha kasireddy
Hi Bijay, This approach might not work for me as I have to do partial inserts/overwrites in a given table and data_frame.write.partitionBy will overwrite the entire table. Thanks, Swetha On Mon, Jun 13, 2016 at 9:25 PM, Bijay Pathak wrote: > Hi Swetha, > > One

restarting of spark streaming

2016-06-14 Thread Chen, Yan I
Hi, I notice that in the process of restarting, spark streaming will try to recover/replay all the batches it missed. But in this process, will streams be checkpointed like the way they are checkpointed in the normal process? Does anyone know? Sometimes our cluster goes maintenance, and our

Spark-SQL with Oozie

2016-06-14 Thread chandana
Hello, I would like to configure a spark oozie action to execute spark-sql from a file on AWS EMR. spark-sql -f Has anybody tried this with oozie spark action? If so, please post your spark action xml. Thanks in advance! -- View this message in context:

Spark SQL NoSuchMethodException...DriverWrapper.()

2016-06-14 Thread Mirko Bernardoni
Hi All, I’m using Spark 1.6.1 and I’m getting the error below. This appear also with the current branch 1.6 The code that is generating the error is loading a table from MsSql server. I’ve also looked if the microsoft jdbc driver is loaded correctly and it is (I’m using an uber jar with all the

Re: MAtcheERROR : STRINGTYPE

2016-06-14 Thread Ted Yu
Can you give a bit more detail ? version of Spark complete error trace code snippet which reproduces the error On Tue, Jun 14, 2016 at 9:54 AM, pseudo oduesp wrote: > hello > > why i get this error > > when using > > assembleur = VectorAssembler( inputCols=l_CDMVT,

MAtcheERROR : STRINGTYPE

2016-06-14 Thread pseudo oduesp
hello why i get this error when using assembleur = VectorAssembler( inputCols=l_CDMVT, outputCol="aev"+"CODEM") output = assembler.transform(df_aev) L_CDMTV list of columns thanks ?

Re: [Spark 2.0.0] Structured Stream on Kafka

2016-06-14 Thread Chris Fregly
+1 vote, +1 watch this would be huge. On Tue, Jun 14, 2016 at 10:47 AM, andy petrella wrote: > kool, voted and watched! > tx > > On Tue, Jun 14, 2016 at 4:44 PM Cody Koeninger wrote: > >> I haven't done any significant work on using structured

Re: how to investigate skew and DataFrames and RangePartitioner

2016-06-14 Thread Takeshi Yamamuro
Hi, I'm afraid there is currently no api to define RangeParititoner in df. // maropu On Tue, Jun 14, 2016 at 5:04 AM, Peter Halliday wrote: > I have two questions > > First,I have a failure when I write parquet from Spark 1.6.1 on Amazon EMR > to S3. This is full batch,

Re: Is there a limit on the number of tasks in one job?

2016-06-14 Thread Khaled Hammouda
Yes, I check Spark UI to follow what’s going on. It seems to start several tasks fine (8 tasks in my case) out of ~70k tasks, and then stalls. I actually was able to get things to work by disabling dynamic allocation. Basically I set the number of executors manually, which disables dynamic

Re: [Spark 2.0.0] Structured Stream on Kafka

2016-06-14 Thread andy petrella
kool, voted and watched! tx On Tue, Jun 14, 2016 at 4:44 PM Cody Koeninger wrote: > I haven't done any significant work on using structured streaming with > kafka, there's a jira ticket for tracking purposes > > https://issues.apache.org/jira/browse/SPARK-15406 > > > > On

Re: [Spark 2.0.0] Structured Stream on Kafka

2016-06-14 Thread Cody Koeninger
I haven't done any significant work on using structured streaming with kafka, there's a jira ticket for tracking purposes https://issues.apache.org/jira/browse/SPARK-15406 On Tue, Jun 14, 2016 at 9:21 AM, andy petrella wrote: > Heya folks, > > Just wondering if there

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
I managed to get remote debugging up and running and can in fact reproduce the error and get a breakpoint triggered as it happens. But it seems like the code does not go through TextInputFormat, or at least the breakpoint is not triggered from this class? Don't know what other class to look for

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-14 Thread Sree Eedupuganti
Hi Spark users, i am new to spark. I am trying to connect hive using SparkJavaContext. Unable to connect to the database. By executing the below code i can see only "default" database. Can anyone help me out. What i need is a sample program for Querying Hive results using SparkJavaContext. Need to

Re: RBM in mllib

2016-06-14 Thread Krishna Kalyan
Hi Robert, According to the jira the Resolution is wont fix. The pull request was closed as it did not merge cleanly with the master. (https://github.com/apache/spark/pull/3222) On Tue, Jun 14, 2016 at 4:23 PM, Roberto Pagliari wrote: > Is RBM being developed? > >

RBM in mllib

2016-06-14 Thread Roberto Pagliari
Is RBM being developed? This one is marked as resolved, but it is not https://issues.apache.org/jira/browse/SPARK-4251

[Spark 2.0.0] Structured Stream on Kafka

2016-06-14 Thread andy petrella
Heya folks, Just wondering if there are some doc regarding using kafka directly from the reader.stream? Has it been integrated already (I mean the source)? Sorry if the answer is RTFM (but then I'd appreciate a pointer anyway^^) thanks, cheers andy -- andy

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
I'm pretty confident the lines are encoded correctly since I can read them both locally and on Spark (by ignoring the faulty line and proceed to next). I also get the correct number of lines through Spark, again by ignoring the faulty line. I get the same error by reading the original file using

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
Thanks for you help. Really appreciate it! Give me some time i'll come back after I've tried your suggestions. On Tue, Jun 14, 2016 at 3:28 PM, Kristoffer Sjögren wrote: > I cannot reproduce it by running the file through Spark in local mode > on my machine. So it does indeed

Re: Spark corrupts text lines

2016-06-14 Thread Sean Owen
It takes a little setup, but you can do remote debugging: http://danosipov.com/?p=779 ... and then use similar config to connect your IDE to a running executor. Before that you might strip your program down to only a call to textFile that then checks the lines according to whatever logic would

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
I cannot reproduce it by running the file through Spark in local mode on my machine. So it does indeed seems to be something related to split across partitions. On Tue, Jun 14, 2016 at 3:04 PM, Kristoffer Sjögren wrote: > Can you do remote debugging in Spark? Didn't know that.

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
Can you do remote debugging in Spark? Didn't know that. Do you have a link? Also noticed isSplittable in org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there are some way to tell it not to split? On Tue, Jun

Re: Spark corrupts text lines

2016-06-14 Thread Sean Owen
It really sounds like the line is being split across partitions. This is what TextInputFormat does but should be perfectly capable of putting together lines that break across files (partitions). If you're into debugging, that's where I would start if you can. Breakpoints around how TextInputFormat

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
That's funny. The line after is the rest of the whole line that got split in half. Every following lines after that are fine. I managed to reproduce without gzip also so maybe it's no gzip's fault after all.. I'm clueless... On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren

Re: Create external table with partitions using sqlContext.createExternalTable

2016-06-14 Thread Mich Talebzadeh
it is a good to be in control :) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 14 June 2016 at

Re: Spark 2.0.0 : GLM problem

2016-06-14 Thread april_ZMQ
To update the post: • First problem: This problem can be solved by adding a epsilon(very small value to 0 value). Because in poisson model, it doesn't allow the y value to be zero. But in general, it doesn't have this requirement. But now I encounter another problem that in every GLM

Re: Create external table with partitions using sqlContext.createExternalTable

2016-06-14 Thread Patrick Duin
Thanks, yes I have something similar working as "the alternative solution". :) I was hoping to get away with not having to specify my schema so the sqlContext.createExternalTable seemed like a nice clean approach. 2016-06-14 13:59 GMT+02:00 Mich Talebzadeh : > Try

Re: Spark Streaming application failing with Kerboros issue while writing data to HBase

2016-06-14 Thread Kamesh
Thanks Ted. Thanks & Regards Kamesh. On Mon, Jun 13, 2016 at 10:48 PM, Ted Yu wrote: > Can you show snippet of your code, please ? > > Please refer to obtainTokenForHBase() in > yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala > > Cheers > > On

Re: Create external table with partitions using sqlContext.createExternalTable

2016-06-14 Thread Mich Talebzadeh
Try this this will work sql("use test") sql("drop table if exists test.orctype") var sqltext: String = "" sqltext = """ CREATE EXTERNAL TABLE test.orctype( prod_id bigint, cust_id bigint, time_id timestamp, channel_id bigint, promo_id bigint,

Running streaming applications in Production environment

2016-06-14 Thread Mail.com
Hi All, Can you please advise best practices to running streaming jobs in Production that reads from Kafka. How do we trigger them - through a start script and best ways to monitor the application is running and send alert when down etc. Thanks, Pradeep

Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-14 Thread Chanh Le
I am testing Spark 2.0 I load data from alluxio and cached then I query but the first query is ok because it kick off cache action. But after that I run the query again and it’s stuck. I ran in cluster 5 nodes in spark-shell. Did anyone has this issue?

Create external table with partitions using sqlContext.createExternalTable

2016-06-14 Thread Patrick Duin
Hi, I'm trying to use sqlContext.createExternalTable("my_table", "/tmp/location/", "orc") to create tables. This is working fine for non-partitioned tables. I'd like to create a partitioned table though, how do I do that? Can I add some information in the options: Map[String, String] parameter?

Re: Spark corrupts text lines

2016-06-14 Thread Jeff Zhang
Can you read this file using MR job ? On Tue, Jun 14, 2016 at 5:26 PM, Sean Owen wrote: > It's really the MR InputSplit code that splits files into records. > Nothing particularly interesting happens in that process, except for > breaking on newlines. > > Do you have one

Re: Spark corrupts text lines

2016-06-14 Thread Sean Owen
It's really the MR InputSplit code that splits files into records. Nothing particularly interesting happens in that process, except for breaking on newlines. Do you have one huge line in the file? are you reading as a text file? can you give any more detail about exactly how you parse it? it

Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
Hi We have log files that are written in base64 encoded text files (gzipped) where each line is ended with a new line character. For some reason a particular line [1] is split by Spark [2] making it unparsable by the base64 decoder. It does this consequently no matter if I gives it the

Re: Suggestions on Lambda Architecture in Spark

2016-06-14 Thread Jörn Franke
You do not describe use cases, but technologies. First be aware on your needs and then check technologies. Otherwise nobody can help you properly and you will end up with an inefficient stack for your needs. > On 14 Jun 2016, at 00:52, KhajaAsmath Mohammed > wrote: >

cluster mode for Python on standalone cluster

2016-06-14 Thread Jan Sourek
The official documentation states 'Currently only YARN supports cluster mode for Python applications.' I would like to know if work is being done or planned to support cluster mode for Python applications on standalone spark clusters? Does anyone know if this is part of the roadmap for Spark 2.0 -

Re: Limit pyspark.daemon threads

2016-06-14 Thread agateaaa
Hi, I am seeing this issue too with pyspark (Using Spark 1.6.1). I have set spark.executor.cores to 1, but I see that whenever streaming batch starts processing data, see python -m pyspark.daemon processes increase gradually to about 5, (increasing CPU% on a box about 4-5 times, each

Re: Suggestions on Lambda Architecture in Spark

2016-06-14 Thread Sean Owen
Our labs project oryx is intended to be pretty much a POC of the lambda architecture on Spark (for ML): http://oryx.io/ You might consider reusing bits of that. On Mon, Jun 13, 2016 at 11:52 PM, KhajaAsmath Mohammed wrote: > Hi, > > In my current project, we are planning

RE: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-06-14 Thread Ravi Aggarwal
Hi, Is there any breakthrough here? I had one more observation while debugging the issue Here are the 4 types of data I had: Da -> stored in parquet Di -> stored in parquet Dl1 -> parquet version of lookup Dl2 -> hbase version of lookup Joins performed and type of join done by spark: Da and Di