Change this
unionDS.repartition(numPartitions);
unionDS.createOrReplaceTempView(...
To
unionDS.repartition(numPartitions).createOrReplaceTempView(...
On Wed, 18 Oct 2017, 03:05 KhajaAsmath Mohammed,
wrote:
> val unionDS = rawDS.union(processedDS)
>
How or what you want to achieve? Ie are planning to do some aggregation on
group by c1,c2?
On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad wrote:
> Hi,
>
> I have a set of rows that are a result of a
> groupBy(col1,col2,col3).count().
>
> Is it possible to map rows belong to
don't think so. check out the documentation for this method
On Wed, Oct 18, 2017 at 10:11 AM, Mina Aslani wrote:
> I have not tried rdd.unpersist(), I thought using rdd = null is the same,
> is it not?
>
> On Wed, Oct 18, 2017 at 1:07 AM, Imran Rajjad
Hi,
I have a set of rows that are a result of a groupBy(col1,col2,col3).count().
Is it possible to map rows belong to unique combination inside an iterator?
e.g
col1 col2 col3
a 1 a1
a 1 a2
b 2 b1
b 2 b2
how can I separate rows with col1 and col2 =
I have not tried rdd.unpersist(), I thought using rdd = null is the same,
is it not?
On Wed, Oct 18, 2017 at 1:07 AM, Imran Rajjad wrote:
> did you try calling rdd.unpersist()
>
> On Wed, Oct 18, 2017 at 10:04 AM, Mina Aslani
> wrote:
>
>> Hi,
>>
>> I
did you try calling rdd.unpersist()
On Wed, Oct 18, 2017 at 10:04 AM, Mina Aslani wrote:
> Hi,
>
> I get "No space left on device" error in my spark worker:
>
> Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr
> java.io.IOException: No space left on
Process data in micro batch
On 18-Oct-2017 10:36 AM, "Chetan Khatri"
wrote:
> Your hard drive don't have much space
> On 18-Oct-2017 10:35 AM, "Mina Aslani" wrote:
>
>> Hi,
>>
>> I get "No space left on device" error in my spark worker:
>>
>>
Your hard drive don't have much space
On 18-Oct-2017 10:35 AM, "Mina Aslani" wrote:
> Hi,
>
> I get "No space left on device" error in my spark worker:
>
> Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr
> java.io.IOException: No space left on device
Hi,
I get "No space left on device" error in my spark worker:
Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr
java.io.IOException: No space left on device
In my spark cluster, I have one worker and one master.
My program consumes stream of data from kafka and publishes
Hi Shengshan,
In first code, ‘newAPIJobConfiguration’ is sharing across all rdds. So, it
should be serializable.
In second code, each rdd creates a new ‘mytest_config’ object and an
individual ‘newAPIJobConfiguration’ instead of sharing the same object. So
it can be non-serializable.
If it’s
val unionDS = rawDS.union(processedDS)
//unionDS.persist(StorageLevel.MEMORY_AND_DISK)
val unionedDS = unionDS.dropDuplicates()
//val
unionedPartitionedDS=unionedDS.repartition(unionedDS("year"),unionedDS("month"),unionedDS("day")).persist(StorageLevel.MEMORY_AND_DISK)
I'm curious about the inner technical workings of SparklyR. Let's say you have:
titanic_train = spark_read_csv(sc, name="titanic_train",
path="../Data/titanic_train.csv", header = TRUE, delimiter = ",", quote = "\"",
escape = "\\", charset = "UTF-8", null_value = NULL, repartition = 0, memory =
Hi ,
I have two parquet files with different schemas based on unique I have to
fetch one column value and append to all rows on the parquet file .
I tried join but I guess due to diff schema it’s not working . I can use
withcolumn but can we get single value of a column and assign it to a
Can you share some code?
On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed,
wrote:
> In my case I am just writing the data frame back to hive. so when is the
> best case to repartition it. I did repartition before calling insert
> overwrite on table
>
> On Tue, Oct 17,
With AWS having Glue and GCE having Dataprep, is Databricks coming out with
an equivalent or better? I know that Serverless is a new offering, but will
it go farther with automatic data schema discovery, profiling, metadata
storage, change triggering, joining, transform suggestions, etc.?
Just
In my case I am just writing the data frame back to hive. so when is the
best case to repartition it. I did repartition before calling insert
overwrite on table
On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu
wrote:
> You have to repartition/coalesce *after *the action
You have to repartition/coalesce *after *the action that is causing the
shuffle as that one will take the value you've set
On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:
> Yes still I see more number of part files and exactly the number I have
> defined
Yes still I see more number of part files and exactly the number I have defined
did spark.sql.shuffle.partitions
Sent from my iPhone
> On Oct 17, 2017, at 2:32 PM, Michael Artz wrote:
>
> Have you tried caching it and using a coalesce?
>
>
>
>> On Oct 17, 2017 1:47
Have you tried caching it and using a coalesce?
On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed"
wrote:
> I tried repartitions but spark.sql.shuffle.partitions is taking up
> precedence over repartitions or coalesce. how to get the lesser number of
> files with same
I tried repartitions but spark.sql.shuffle.partitions is taking up
precedence over repartitions or coalesce. how to get the lesser number of
files with same performance?
On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
tushar_adesh...@persistent.com> wrote:
> You can also try coalesce as it
Hi
@Marco, the multiple rows written are not dupes as current timestamp field
is different in each of them.
@Ayan I checked and found that my whole code is rerun twice. Although there
seems to be no error, is it configurable to re-run by cluster manager?
On Tue, Oct 17, 2017 at 6:45 PM, ayan
It should not be parallel exec as the logging code is called in driver.
Have you checked if your driver is reran by cluster manager due to any
failure or error situation>
On Tue, Oct 17, 2017 at 11:52 PM, Marco Mistroni
wrote:
> Hi
> Uh if the problem is really with
Hi
Uh if the problem is really with parallel exec u can try to call
repartition(1) before u save
Alternatively try to store data in a csv file and see if u have same
behaviour, to exclude dynamodb issues
Also ..are the multiple rows being written dupes (they have all same
fields/values)?
Hth
On
This is the code -
hdfs_path=
if(hdfs_path.contains(".avro")){
data_df =
spark.read.format("com.databricks.spark.avro").load(hdfs_path)
}else if(hdfs_path.contains(".tsv")){
data_df =
spark.read.option("delimiter","\t").option("header","true").csv(hdfs_path)
}else
Can you share your code?
On Tue, 17 Oct 2017 at 10:22 pm, Harsh Choudhary
wrote:
> Hi
>
> I'm running a Spark job in which I am appending new data into Parquet
> file. At last, I make a log entry in my Dynamodb table stating the number
> of records appended, time etc.
Hi
I'm running a Spark job in which I am appending new data into Parquet file.
At last, I make a log entry in my Dynamodb table stating the number of
records appended, time etc. Instead of one single entry in the database,
multiple entries are being made to it. Is it because of parallel execution
Hello guys!
java.io.NotSerializableException troubles me a lot when i process data with
spark.
```
val hbase_conf = HBaseConfiguration.create()
hbase_conf.set("hbase.zookeeper.property.clientPort", "2181")
hbase_conf.set("hbase.zookeeper.quorum", "hadoop-zk0.s.qima-inc.com
27 matches
Mail list logo