Kinesis streaming misunderstanding..?

2017-01-26 Thread Graham Clark
Hi everyone - I am building a small prototype in Spark 1.6.0 (cloudera) to read information from Kinesis and write it to HDFS in parquet format. The write seems very slow, and if I understood Spark's diagnostics correctly, always seemed to run from the same executor, one partition after the other,

Re: kafka structured streaming source refuses to read

2017-01-26 Thread Koert Kuipers
my little program prints out query.lastProgress every 10 seconds, and this is what it shows: { "id" : "4cc09da1-c002-4fa7-86dd-ba03018e53a0", "runId" : "bf8cdde2-44d0-4bff-ad90-7cbac0432099", "name" : "wiki", "timestamp" : "2017-01-26T22:54:45.732Z", "numInputRows" : 0,

Re: spark intermediate data fills up the disk

2017-01-26 Thread kanth909
Hi! Yes these files are for shuffle blocks however they need to be cleaned as well right? I had been running a streaming application for 2 days. The third day my disk fills up with all .index and .data files and my assumption is that these files had been there since the start of my streaming

Re: eager? in dataframe's checkpoint

2017-01-26 Thread Burak Yavuz
Hi, One of the goals of checkpointing is to cut the RDD lineage. Otherwise you run into StackOverflowExceptions. If you eagerly checkpoint, you basically cut the lineage there, and the next operations all depend on the checkpointed DataFrame. If you don't checkpoint, you continue to build the

Re: spark narrow vs wide dependency

2017-01-26 Thread Shushant Arora
3.Also will the mappartitions can go out of memory if I return the arraylist of whole partition after processing the partition ? whats the alternative to this if this can fail. On Fri, Jan 27, 2017 at 9:32 AM, Shushant Arora wrote: > Hi > > I have two transformations

spark narrow vs wide dependency

2017-01-26 Thread Shushant Arora
Hi I have two transformations in series. rdd1 = sourcerdd.map(new Function(...)); //step1 rdd2 = rdd1.mapPartitions(new Function(...)); //step2 1.Is map and mapPartitions narrow dependency ? Does spark optimise the dag and execute step 1 and step2 in single stage or there will be two stages ?

kafka structured streaming source refuses to read

2017-01-26 Thread Koert Kuipers
hey, i am just getting started with kafka + spark structured streaming. so this is probably a pretty dumb mistake. i wrote a little program in spark to read messages from a kafka topic and display them in the console, using the kafka source and console sink. i run it it in spark local mode. i

Re: Oracle JDBC - Spark SQL - Key Not Found: Scale

2017-01-26 Thread ayan guha
Hi I will do a little more testing and will let you know. It did not work with INT and Number types, for sure. While writing, everything is fine :) On Fri, Jan 27, 2017 at 1:04 PM, Takeshi Yamamuro wrote: > How about this? >

Re: Oracle JDBC - Spark SQL - Key Not Found: Scale

2017-01-26 Thread Takeshi Yamamuro
How about this? https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L729 Or, how about using Double or something instead of Numeric? // maropu On Fri, Jan 27, 2017 at 10:25 AM, ayan guha wrote: > Okay, it is working

Re: Oracle JDBC - Spark SQL - Key Not Found: Scale

2017-01-26 Thread ayan guha
Okay, it is working with varchar columns only. Is there any way to workaround this? On Fri, Jan 27, 2017 at 12:22 PM, ayan guha wrote: > hi > > I thought so too, so I created a table with INT and Varchar columns > > desc agtest1 > > Name Null Type > -

Re: Oracle JDBC - Spark SQL - Key Not Found: Scale

2017-01-26 Thread ayan guha
hi I thought so too, so I created a table with INT and Varchar columns desc agtest1 Name Null Type - PID NUMBER(38) DES VARCHAR2(100) url="jdbc:oracle:thin:@mpimpclu1-scan:1521/DEVAIM" table = "agtest1" user = "bal" password= "bal"

Issue creating row with java.util.Map type

2017-01-26 Thread Ankur Srivastava
Hi, I am trying to map a Dataset with rows which have a map attribute. When I try to create a Row with the map attribute I get cast errors. I am able to reproduce the issue with the below sample code. The surprising thing is with same schema I am able to create a dataset from the List of rows. I

Re: spark intermediate data fills up the disk

2017-01-26 Thread Takeshi Yamamuro
Yea, I think so and they are the intermediate files for shuffling. Probably, kant checked the configuration here ( http://spark.apache.org/docs/latest/spark-standalone.html) though, this is not related to the issue. // maropu On Fri, Jan 27, 2017 at 7:46 AM, Jacek Laskowski

Re: Oracle JDBC - Spark SQL - Key Not Found: Scale

2017-01-26 Thread Takeshi Yamamuro
Hi, I think you got this error because you used `NUMERIC` types in your schema ( https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala#L32). So, IIUC avoiding the type is a workaround. // maropu On Fri, Jan 27, 2017 at 8:18 AM, ayan

structured streaming 2.1.0 kafka driver --packages 'org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0' works on YARN but having trouble on standalone cluster mode

2017-01-26 Thread Heji Kim
Hello everyone, Currently we are testing structured streaming kafka drivers. We submit on YARN(2.7.3) with --packages 'org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0', without problems. However when we try to launch on spark standalone with deploy mode=cluster, we get the

Re: How to reduce number of tasks and partitions in Spark job?

2017-01-26 Thread Jacek Laskowski
Repartition Jacek On 26 Jan 2017 6:13 p.m., "Md. Rezaul Karim" < rezaul.ka...@insight-centre.org> wrote: > Hi All, > > When I run a Spark job on my local machine (having 8 cores and 16GB of > RAM) on an input data of 6.5GB, it creates 193 parallel tasks and put > the output into 193 partitions.

Oracle JDBC - Spark SQL - Key Not Found: Scale

2017-01-26 Thread ayan guha
Hi I am facing exact issue with Oracle/Exadataas mentioned here . Any idea? I could not figure out so sending to this grou hoping someone have see it (and solved it) Spark Version: 1.6 pyspark command: pyspark

Re: Cached table details

2017-01-26 Thread Jacek Laskowski
Hi, I think that the only way to get the information about a cached RDD is to use SparkListener and intercept respective events about cached blocks on BlockManagers. Jacek On 25 Jan 2017 5:54 a.m., "kumar r" wrote: Hi, I have cached some table in Spark Thrift Server. I

Re: spark intermediate data fills up the disk

2017-01-26 Thread Jacek Laskowski
Hi, The files are for shuffle blocks. Where did you find the docs about them? Jacek On 25 Jan 2017 8:41 p.m., "kant kodali" wrote: oh sorry its actually in the documentation. I should just set spark.worker.cleanup.enabled = true On Wed, Jan 25, 2017 at 11:30 AM, kant

Re: json_tuple fails to parse string with emoji

2017-01-26 Thread Andrew Ehrlich
It looks like I'm hitting this bug in jackson-core 2.2.3 which is included in the version of CDH I'm on: https://github.com/FasterXML/jackson-core/issues/115 Jackson-core 2.3.0 has the fix. On Tue, Jan 24, 2017 at 5:14 PM, Andrew Ehrlich wrote: > On Spark 1.6.0, calling

RE: spark 2.02 error when writing to s3

2017-01-26 Thread VND Tremblay, Paul
This seems to have done the trick, although I am not positive. If I have time, I'll test spinning up a cluster with and without consistent view to pin point the error. _ Paul Tremblay Analytics

RE: Ingesting Large csv File to relational database

2017-01-26 Thread VND Tremblay, Paul
What relational db are you using? We do this at work, and the way we handle it is to unload the db into Spark (actually, we unload it to S3 and then into Spark). Redshift is very efficient at dumlping tables this way.

Re: Dataframe fails to save to MySQL table in spark app, but succeeds in spark shell

2017-01-26 Thread Suresh Thalamati
I notice columns are quoted wit double quotes in the error message ('"user","age","state”)) . By chance did you override the MySQL JDBC dialect, default MySQL identifiers are quoted with ` override def quoteIdentifier(colName: String): String = { s"`$colName`" } Just wondering if the error

eager? in dataframe's checkpoint

2017-01-26 Thread Jean Georges Perrin
Hey Sparkers, Trying to understand the Dataframe's checkpoint (not in the context of streaming) https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)

Re: How to tune number of tesks

2017-01-26 Thread Md. Rezaul Karim
Hi, If you require all the partitioned to be saved with saveAsTextFile you can use coalesce(1,true).saveAsTextFile(). This basically means do the computation then coalesce to only 1 partition. You can also use repartition(1) too which is just a wrapper for the coalesce that sets the shuffle

Re: Java heap error during matrix multiplication

2017-01-26 Thread Burak Yavuz
Hi, Have you tried creating more column blocks? BlockMatrix matrix = cmatrix.toBlockMatrix(100, 100); for example. Is your data randomly spread out, or do you generally have clusters of data points together? On Wed, Jan 25, 2017 at 4:23 AM, Petr Shestov wrote: > Hi

How to reduce number of tasks and partitions in Spark job?

2017-01-26 Thread Md. Rezaul Karim
Hi All, When I run a Spark job on my local machine (having 8 cores and 16GB of RAM) on an input data of 6.5GB, it creates 193 parallel tasks and put the output into 193 partitions. How can I change the number of tasks and consequently, the number of output files - say to just one or less?

Re: Issue returning Map from UDAF

2017-01-26 Thread Ankur Srivastava
Thank you Takeshi, that was the issue. Thanks Ankur Sent from my iPhone > On Jan 25, 2017, at 9:08 PM, Takeshi Yamamuro wrote: > > Hi, > > Quickly looking around the attached, I found you wrongly passed the dataType > of your aggregator output in line70. > So, you need

Re: can we plz open up encoder on dataset

2017-01-26 Thread Koert Kuipers
oh the map in DataFrame is actually using a RowEncoder. i left it out because it wasn't important: so this doesn't compile: def f[T]: Dataset[T] => Dataset[T] = dataset => { val df = dataset.toDF df.map(row => row)(RowEncoder(df.schema)).as[T] } now this does compile. but i don't like it,

How to tune number of tesks

2017-01-26 Thread Soheila S.
Hi all, Please tell me how can I tune output partition numbers. I run my spark job on my local machine with 8 cores and input data is 6.5GB. It creates 193 tasks and put the output into 193 partitions. How can I change the number of tasks and consequently, the number of output files? Best,

Re: can we plz open up encoder on dataset

2017-01-26 Thread Jacek Laskowski
Hi Koert, map will take the value that has an implicit Encoder to any value that may or may not have an encoder in scope. That's why I'm asking about the map function to see what it does. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0

RE: [PySpark 2.1.0] - SparkContext not properly initialized by SparkConf

2017-01-26 Thread Sidney Feiner
I think I'm getting close to find the reason: When I initialize the SparkContext, the following code is executed: def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls): self.environment = environment or {} # java

Re: can we plz open up encoder on dataset

2017-01-26 Thread Koert Kuipers
the map operation works on DataFrame so it doesn't need an encoder. It could have been any operation on DataFrame. the issue is at the end going back to Dataset[T] using as[T]. this requires an encoder for T which i know i already have since i started with a Dataset[T]. i could add an implicit

Re: is it possible to read .mdb file in spark

2017-01-26 Thread Richard Siebeling
Hi, haven't used it, but Jackcess should do the trick > http://jackcess.sourceforge.net/ kind regards, Richard 2017-01-25 11:47 GMT+01:00 Selvam Raman : > > > -- > Selvam Raman > "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து" >

DML in Spark ETL

2017-01-26 Thread A Shaikh
In past we used ETL tool wherein ETL task which update, insert and delete rows in target database tables(Oracle/Netezza). Sparks Dataset (and RDD) has .save* method which can insert rows. How to delete or update records in database table in Spark?

Re: can we plz open up encoder on dataset

2017-01-26 Thread Jacek Laskowski
Hi, Can you show the code from map to reproduce the issue? You can create encoders using Encoders object (I'm using it all over the place for schema generation). Jacek On 25 Jan 2017 10:19 p.m., "Koert Kuipers" wrote: > i often run into problems like this: > > i need to

Re: where is mapWithState executed?

2017-01-26 Thread Jacek Laskowski
Hi, Shooting in the dark...it's executed on executors (it's old tech RDD-based so not many extra optimizations like in Spark SQL now). Can you show the code as I'm scared to hear that you're trying to broadcast inside a transformation which I'd believe is impossible. Jacek On 26 Jan 2017 12:18

Re: Dataframe fails to save to MySQL table in spark app, but succeeds in spark shell

2017-01-26 Thread Didac Gil
Are you sure that “age” is a numeric field? Even numeric, you could pass the “44” between quotes: INSERT into your_table ("user","age","state") VALUES ('user3’,’44','CT’) Are you sure there are no more fields that are specified as NOT NULL, and that you did not provide a value (besides user,