Re: Spark - Partitions

2017-10-17 Thread Sebastian Piu
Change this unionDS.repartition(numPartitions); unionDS.createOrReplaceTempView(... To unionDS.repartition(numPartitions).createOrReplaceTempView(... On Wed, 18 Oct 2017, 03:05 KhajaAsmath Mohammed, wrote: > val unionDS = rawDS.union(processedDS) >

Re: parition by multiple columns/keys

2017-10-17 Thread ayan guha
How or what you want to achieve? Ie are planning to do some aggregation on group by c1,c2? On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad wrote: > Hi, > > I have a set of rows that are a result of a > groupBy(col1,col2,col3).count(). > > Is it possible to map rows belong to

Re: No space left on device

2017-10-17 Thread Imran Rajjad
don't think so. check out the documentation for this method On Wed, Oct 18, 2017 at 10:11 AM, Mina Aslani wrote: > I have not tried rdd.unpersist(), I thought using rdd = null is the same, > is it not? > > On Wed, Oct 18, 2017 at 1:07 AM, Imran Rajjad

parition by multiple columns/keys

2017-10-17 Thread Imran Rajjad
Hi, I have a set of rows that are a result of a groupBy(col1,col2,col3).count(). Is it possible to map rows belong to unique combination inside an iterator? e.g col1 col2 col3 a 1 a1 a 1 a2 b 2 b1 b 2 b2 how can I separate rows with col1 and col2 =

Re: No space left on device

2017-10-17 Thread Mina Aslani
I have not tried rdd.unpersist(), I thought using rdd = null is the same, is it not? On Wed, Oct 18, 2017 at 1:07 AM, Imran Rajjad wrote: > did you try calling rdd.unpersist() > > On Wed, Oct 18, 2017 at 10:04 AM, Mina Aslani > wrote: > >> Hi, >> >> I

Re: No space left on device

2017-10-17 Thread Imran Rajjad
did you try calling rdd.unpersist() On Wed, Oct 18, 2017 at 10:04 AM, Mina Aslani wrote: > Hi, > > I get "No space left on device" error in my spark worker: > > Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr > java.io.IOException: No space left on

Re: No space left on device

2017-10-17 Thread Chetan Khatri
Process data in micro batch On 18-Oct-2017 10:36 AM, "Chetan Khatri" wrote: > Your hard drive don't have much space > On 18-Oct-2017 10:35 AM, "Mina Aslani" wrote: > >> Hi, >> >> I get "No space left on device" error in my spark worker: >> >>

Re: No space left on device

2017-10-17 Thread Chetan Khatri
Your hard drive don't have much space On 18-Oct-2017 10:35 AM, "Mina Aslani" wrote: > Hi, > > I get "No space left on device" error in my spark worker: > > Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr > java.io.IOException: No space left on device

No space left on device

2017-10-17 Thread Mina Aslani
Hi, I get "No space left on device" error in my spark worker: Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr java.io.IOException: No space left on device In my spark cluster, I have one worker and one master. My program consumes stream of data from kafka and publishes

Re: java.io.NotSerializableException about SparkStreaming

2017-10-17 Thread 高佳翔
Hi Shengshan, In first code, ‘newAPIJobConfiguration’ is sharing across all rdds. So, it should be serializable. In second code, each rdd creates a new ‘mytest_config’ object and an individual ‘newAPIJobConfiguration’ instead of sharing the same object. So it can be non-serializable. If it’s

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
val unionDS = rawDS.union(processedDS) //unionDS.persist(StorageLevel.MEMORY_AND_DISK) val unionedDS = unionDS.dropDuplicates() //val unionedPartitionedDS=unionedDS.repartition(unionedDS("year"),unionedDS("month"),unionedDS("day")).persist(StorageLevel.MEMORY_AND_DISK)

SparklyR and the Tidyverse

2017-10-17 Thread Adaryl Wakefield
I'm curious about the inner technical workings of SparklyR. Let's say you have: titanic_train = spark_read_csv(sc, name="titanic_train", path="../Data/titanic_train.csv", header = TRUE, delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL, repartition = 0, memory =

Appending column to a parquet

2017-10-17 Thread sk skk
Hi , I have two parquet files with different schemas based on unique I have to fetch one column value and append to all rows on the parquet file . I tried join but I guess due to diff schema it’s not working . I can use withcolumn but can we get single value of a column and assign it to a

Re: Spark - Partitions

2017-10-17 Thread Sebastian Piu
Can you share some code? On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, wrote: > In my case I am just writing the data frame back to hive. so when is the > best case to repartition it. I did repartition before calling insert > overwrite on table > > On Tue, Oct 17,

Serverless ETL

2017-10-17 Thread Benjamin Kim
With AWS having Glue and GCE having Dataprep, is Databricks coming out with an equivalent or better? I know that Serverless is a new offering, but will it go farther with automatic data schema discovery, profiling, metadata storage, change triggering, joining, transform suggestions, etc.? Just

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
In my case I am just writing the data frame back to hive. so when is the best case to repartition it. I did repartition before calling insert overwrite on table On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu wrote: > You have to repartition/coalesce *after *the action

Re: Spark - Partitions

2017-10-17 Thread Sebastian Piu
You have to repartition/coalesce *after *the action that is causing the shuffle as that one will take the value you've set On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Yes still I see more number of part files and exactly the number I have > defined

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
Yes still I see more number of part files and exactly the number I have defined did spark.sql.shuffle.partitions Sent from my iPhone > On Oct 17, 2017, at 2:32 PM, Michael Artz wrote: > > Have you tried caching it and using a coalesce? > > > >> On Oct 17, 2017 1:47

Re: Spark - Partitions

2017-10-17 Thread Michael Artz
Have you tried caching it and using a coalesce? On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" wrote: > I tried repartitions but spark.sql.shuffle.partitions is taking up > precedence over repartitions or coalesce. how to get the lesser number of > files with same

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
I tried repartitions but spark.sql.shuffle.partitions is taking up precedence over repartitions or coalesce. how to get the lesser number of files with same performance? On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara < tushar_adesh...@persistent.com> wrote: > You can also try coalesce as it

Re: Database insert happening two times

2017-10-17 Thread Harsh Choudhary
Hi @Marco, the multiple rows written are not dupes as current timestamp field is different in each of them. @Ayan I checked and found that my whole code is rerun twice. Although there seems to be no error, is it configurable to re-run by cluster manager? On Tue, Oct 17, 2017 at 6:45 PM, ayan

Re: Database insert happening two times

2017-10-17 Thread ayan guha
It should not be parallel exec as the logging code is called in driver. Have you checked if your driver is reran by cluster manager due to any failure or error situation> On Tue, Oct 17, 2017 at 11:52 PM, Marco Mistroni wrote: > Hi > Uh if the problem is really with

Re: Database insert happening two times

2017-10-17 Thread Marco Mistroni
Hi Uh if the problem is really with parallel exec u can try to call repartition(1) before u save Alternatively try to store data in a csv file and see if u have same behaviour, to exclude dynamodb issues Also ..are the multiple rows being written dupes (they have all same fields/values)? Hth On

Re: Database insert happening two times

2017-10-17 Thread Harsh Choudhary
This is the code - hdfs_path= if(hdfs_path.contains(".avro")){ data_df = spark.read.format("com.databricks.spark.avro").load(hdfs_path) }else if(hdfs_path.contains(".tsv")){ data_df = spark.read.option("delimiter","\t").option("header","true").csv(hdfs_path) }else

Re: Database insert happening two times

2017-10-17 Thread ayan guha
Can you share your code? On Tue, 17 Oct 2017 at 10:22 pm, Harsh Choudhary wrote: > Hi > > I'm running a Spark job in which I am appending new data into Parquet > file. At last, I make a log entry in my Dynamodb table stating the number > of records appended, time etc.

Database insert happening two times

2017-10-17 Thread Harsh Choudhary
Hi I'm running a Spark job in which I am appending new data into Parquet file. At last, I make a log entry in my Dynamodb table stating the number of records appended, time etc. Instead of one single entry in the database, multiple entries are being made to it. Is it because of parallel execution

java.io.NotSerializableException about SparkStreaming

2017-10-17 Thread Shengshan Zhang
Hello guys! java.io.NotSerializableException troubles me a lot when i process data with spark. ``` val hbase_conf = HBaseConfiguration.create() hbase_conf.set("hbase.zookeeper.property.clientPort", "2181") hbase_conf.set("hbase.zookeeper.quorum", "hadoop-zk0.s.qima-inc.com