> Use as a format orc, parquet or avro because they support any
>>>> compression type with parallel processing. Alternatively split your file in
>>>> several smaller ones. Another alternative would be bzip2 (but slower in
>>>> general) or Lzo (usually it is not incl
>> wrote:
>>>
>>>> Use as a format orc, parquet or avro because they support any
>>>> compression type with parallel processing. Alternatively split your file in
>>>> several smaller ones. Another alternative would be bzip2 (but slower in
>>>> g
>>> compression type with parallel processing. Alternatively split your file in
>>> several smaller ones. Another alternative would be bzip2 (but slower in
>>> general) or Lzo (usually it is not included by default in many
>>> distributions).
>>>
>
<yeshwant...@gmail.com> wrote:
>>
>> Hi,
>>
>> we are running Hive on Spark, we have an external table over snappy
>> compressed csv file of size 917.4 M
>> HDFS block size is set to 256 MB
>>
>> as per my Understanding, if i run a query over that exte
e on Spark, we have an external table over snappy
> compressed csv file of size 917.4 M
> HDFS block size is set to 256 MB
>
> as per my Understanding, if i run a query over that external table , it
> should launch 4 tasks. one for each block.
> but i am seeing one executor an
).
> On 21 Nov 2016, at 23:17, yeshwanth kumar <yeshwant...@gmail.com> wrote:
>
> Hi,
>
> we are running Hive on Spark, we have an external table over snappy
> compressed csv file of size 917.4 M
> HDFS block size is set to 256 MB
>
> as per my Understanding, if i run
le of size 917.4 M
> HDFS block size is set to 256 MB
>
> as per my Understanding, if i run a query over that external table , it
> should launch 4 tasks. one for each block.
> but i am seeing one executor and one task processing all the file.
>
> trying to understand the rea
Hi,
we are running Hive on Spark, we have an external table over snappy
compressed csv file of size 917.4 M
HDFS block size is set to 256 MB
as per my Understanding, if i run a query over that external table , it
should launch 4 tasks. one for each block.
but i am seeing one executor and one
Hi all,
I have an RDD that I wish to write to HDFS.
data.saveAsTextFile("hdfs://path/vertices")
This returns: WARN RetryInvocationHandler: Exception while invoking
ClientNamenodeProtocolTranslatorPB.getFileInfo over null. Not retrying because
try onc
to get data from source as is and create
file system format (csv, etc) as target, then those files can land on a
directory. A cron job then can put those files in HDFS directories and the
rest our choice how to treat those files. Hive external tables can be used
plus using Hive or Spark to ingest data
that I have to do…
-Mike
On Nov 9, 2016, at 4:16 PM, Mich Talebzadeh
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:
Thanks guys,
Sounds like let Informatica get the data out of RDBMS and create mapping to
flat files that will be delivered to a directory vis
Thanks guys,
Sounds like let Informatica get the data out of RDBMS and create mapping to
flat files that will be delivered to a directory visible by HDFS host. Then
push the csv files into HDFS. then there are number of options to work on:
1. run cron or oozie to get data out of HDFS
alebza...@gmail.com> wrote:
>
> Hi,
>
> I am exploring the idea of flexibility with importing multiple RDBMS tables
> using Informatica that customer has into HDFS.
>
> I don't want to use connectivity tools from Informatica to Hive etc.
>
> So this is what
Yes, it can be done and a standard practice. I would suggest a mixed
approach: use Informatica to create files in hdfs and have hive staging
tables as external tables on those directories. Then that point onwards use
spark.
Hth
Ayan
On 10 Nov 2016 04:00, "Mich Talebzadeh" <
ich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Hi,
>
> I am exploring the idea of flexibility with importing multiple RDBMS
> tables using Informatica that customer has into HDFS.
>
> I don't want to use connectivity tools from Informatica to Hive etc.
>
> So
Hi,
I am exploring the idea of flexibility with importing multiple RDBMS tables
using Informatica that customer has into HDFS.
I don't want to use connectivity tools from Informatica to Hive etc.
So this is what I have in mind
1. If possible get the tables data out using Informatica
;rohit.ve...@rokittech.com> wrote:
>
>> I am using spark to read from database and write in hdfs as parquet file.
>> Here is code snippet.
>>
>> private long etlFunction(SparkSession spark){
>> spark.sqlContext().setConf("spark.sql.parquet.compres
read from database and write in hdfs as parquet file.
> Here is code snippet.
>
> private long etlFunction(SparkSession spark){
> spark.sqlContext().setConf("spark.sql.parquet.compression.codec",
> “SNAPPY");
> Properties properties = new Properties();
> properti
I am using spark to read from database and write in hdfs as parquet file. Here
is code snippet.
private long etlFunction(SparkSession spark){
spark.sqlContext().setConf("spark.sql.parquet.compression.codec", “SNAPPY");
Properties properties = new Properties();
propert
I am using Spark 1.6.1 and writing to HDFS. In some cases it seems like all
the work is being done by one thread. Why is that?
Also, I need parquet.enable.summary-metadata to register the parquet files
to Impala.
Df.write().partitionBy("COLUMN").parquet(outputFileLocation);
It a
I am trying to prototype using a single instance SqlContext and use it toappend
Dataframes,partition by a field, to the same HDFS folder from multiple threads.
(Each thread will work with a DataFrame having different partition column
value.)
I get the exception16/09/24 16:45:12 ERROR
gt; I am seeing similar issues when I was working on Oracle with Tableau as
> the dashboard.
>
> Currently I have a batch layer that gets streaming data from
>
> source -> Kafka -> Flume -> HDFS
>
> It stored on HDFS as text files and a cron process sinks H
On 18 September 2016 at 11:08, Jörn Franke <jornfra...@gmail.com> wrote:
> Ignite has a special cache for HDFS data (which is not a Java cache), for
> rdds etc. So you are right it is in this sense very different.
>
> Besides caching, from what I see from data scientists is that f
Ignite has a special cache for HDFS data (which is not a Java cache), for rdds
etc. So you are right it is in this sense very different.
Besides caching, from what I see from data scientists is that for interactive
queries and models evaluation they anyway do not browse the complete data. Even
with a
certain refresh of these tables and let the customer decide. I stand
corrected otherwise.
BTW I did these simple test on using Zeppelin (running on Spark Standalone
mode)
1) Read data using Spark sql from Flume text files on HDFS (real time)
2) Read data using Spark sql from ORC table
Alluxio isn't a database though; it's storage. I may be still harping
on the wrong solution for you, but as we discussed offline, that's
also what Impala, Drill et al are for.
Sorry if this was mentioned before but Ignite is what GridGain became,
if that helps.
On Sat, Sep 17, 2016 at 11:00 PM,
). If the
data is processed in parallel then IO will be done in parallel thanks to the
architecture of HDFS. Oracle Exadata exploits similar concepts. The advantage
of Ignite compared to e.g.Exadata would be that you have also the indexes of
ORC and Parquet in-memory which avoids reading data in-memory
>> HTH.
>>
>> -Todd
>>
>>
>> On Sat, Sep 17, 2016 at 12:53 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am seeing similar issues when I was working on Oracle with Tableau as
>>&g
h Tableau as
>> the dashboard.
>>
>> Currently I have a batch layer that gets streaming data from
>>
>> source -> Kafka -> Flume -> HDFS
>>
>> It stored on HDFS as text files and a cron process sinks Hive table with
>> the the external table buil
<mich.talebza...@gmail.com
> wrote:
> Hi,
>
> I am seeing similar issues when I was working on Oracle with Tableau as
> the dashboard.
>
> Currently I have a batch layer that gets streaming data from
>
> source -> Kafka -> Flume -> HDFS
>
> It stored on HDFS as t
Hi,
I am seeing similar issues when I was working on Oracle with Tableau as the
dashboard.
Currently I have a batch layer that gets streaming data from
source -> Kafka -> Flume -> HDFS
It stored on HDFS as text files and a cron process sinks Hive table with
the the external ta
opy/rename the file once complete.
(things are slightly complicated by the fact that HDFS doesn' t update modtimes
until (a) the file is closed or (b) enough data has been written that the write
spans a block boundary. That means that incremental writes to HDFS may appear
to work, but once yo
You can listen to files in a specific directory using:
Take a look at:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
streamingContext.fileStream
On Thu, Sep 15, 2016 at 10:31 AM, Jörn Franke wrote:
> Hi,
> I recommend that the third party
Hi,
I recommend that the third party application puts an empty file with the same
filename as the original file, but the extension ".uploaded". This is an
indicator that the file has been fully (!) written to the fs. Otherwise you
risk only reading parts of the file.
Then, you can have a file
Hello,
I am a newbie to spark and I have below requirement.
Problem statement : A third party application is dumping files continuously in
a server. Typically the count of files is 100 files per hour and each file is
of size less than 50MB. My application has to process those files.
Here
some other key).
>>>>
>>>> On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake <esal...@gmail.com>
>>>> wrote:
>>>>
>>>>> 1.) What needs to be parallelized is the work for each of those 6M
>>>>> rows, not the 80K files.
gt; rows, not the 80K files. Let me elaborate this with a simple for loop if we
>>>> were to write this serially.
>>>>
>>>> For each line L out of 6M in the first file{
>>>> process the file corresponding to L out of those 80K files.
>>>> }
t;>> 1.) What needs to be parallelized is the work for each of those 6M rows,
>>> not the 80K files. Let me elaborate this with a simple for loop if we were
>>> to write this serially.
>>>
>>> For each line L out of 6M in the first file{
>>> p
6M rows,
>> not the 80K files. Let me elaborate this with a simple for loop if we were
>> to write this serially.
>>
>> For each line L out of 6M in the first file{
>> process the file corresponding to L out of those 80K files.
>> }
>>
>> The 80K files a
onding to L out of those 80K files.
> }
>
> The 80K files are in HDFS and to read all that content into each worker is
> not possible due to size.
>
> 2. No. multiple rows may point to rthe same file but they operate on
> different records within the file.
>
> 3.
files are in HDFS and to read all that content into each worker is
not possible due to size.
2. No. multiple rows may point to rthe same file but they operate on
different records within the file.
3. End goal is to write back 6M processed information.
This is simple map only type scenario. One
Question:
1. Why you can not read all 80K files together? ie, why you have a
dependency on first text file?
2. Your first text file has 6M rows, but total number of files~80K. is
there a scenario where there may not be a file in HDFS corresponding to the
row in first text file?
3. May be a follow
13 Sep 2016 11:39 p.m., "Saliya Ekanayake" <esal...@gmail.com> wrote:
>
>> Just wonder if this is possible with Spark?
>>
>> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <esal...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>
:
> Just wonder if this is possible with Spark?
>
> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <esal...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I've got a text file where each line is a record. For each record, I need
>> to process a file in HDFS.
>
Just wonder if this is possible with Spark?
On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <esal...@gmail.com>
wrote:
> Hi,
>
> I've got a text file where each line is a record. For each record, I need
> to process a file in HDFS.
>
> So if I represent these reco
Hi,
I've got a text file where each line is a record. For each record, I need
to process a file in HDFS.
So if I represent these records as an RDD and invoke a map() operation on
them how can I access the HDFS within that map()? Do I have to create a
Spark context within map
I understand now that for I cannot use spark streaming window operation without
checkpointing to HDFS as pointed out by @Ofir but Without window operation I
don't think we can do much with spark streaming. so since it is very essential
can I use Cassandra as a distributed storage? If so, can I see
t;partition with both groups thinking they are in charge", which is
more dangerous. then there's "partitioning event not detected", which may be
bad.
so: consider the failure modes and then consider not so much whether the tech
you are using is vulnerable to it, but "if
rue I will be left with the following stack
Spark + Mesos scheduler + Distributed File System or to be precise I should say
Distributed Storage since S3 is an object store so I guess this will be HDFS for
us + etcd & consul. Now the big question for me is how do I set all this up
We use Spark with NFS as the data store, mainly using Dr. Jeremy Freeman’s Thunder framework. Works very well (and I see HUGE throughput on the storage system during loads). I haven’t seen (or heard from the devs/users) a need for HDFS or S3.
—Ken
On Aug 25, 2016, at 8:02 PM
eper can do
>> many things but for our use case all we need is for high availability and
>> given the devops people frustrations here in our company who had extensive
>> experience managing large clusters in the past we would be very happy to
>> avoid Zookeeper. I also heard that Mesos can provide High Availability
>> through etcd and consul and if that is true I will be left with the
>> following stack
>>
>>
>>
>>
>>
>> Spark + Mesos scheduler + Distributed File System or to be precise I
>> should say Distributed Storage since S3 is an object store so I guess this
>> will be HDFS for us + etcd & consul. Now the big question for me is how do
>> I set all this up
>>
>>
>>
>>
>>
>>
torage since S3 is an object store so I guess this will be HDFS for
us + etcd & consul. Now the big question for me is how do I set all this up
if that is true I will be left with the following stack
Spark + Mesos scheduler + Distributed File System or to be precise I should say
Distributed Storage since S3 is an object store so I guess this will be HDFS
for us + etcd & consul. Now the big question for me is how do I set all this up
[https
outputs, and will do so with
good data locality.
On Thu, Aug 25, 2016 at 3:36 PM, ayan guha < guha.a...@gmail.com > wrote:
At the core of it map reduce relies heavily on data locality. You would lose the
ability to process data closest to where it resides if you do not use hdfs.
S3 or NFS will no
) will use the map outputs, and will do so with
good data locality.
On Thu, Aug 25, 2016 at 3:36 PM, ayan guha < guha.a...@gmail.com > wrote:
At the core of it map reduce relies heavily on data locality. You would lose the
ability to process data closest to where it resides if you do not use hdfs.
S3
guha <guha.a...@gmail.com> wrote:
> At the core of it map reduce relies heavily on data locality. You would
> lose the ability to process data closest to where it resides if you do not
> use hdfs.
> S3 or NFS will not able to provide that.
> On 26 Aug 2016 07:49, "kant k
> You would lose the ability to process data closest to where it resides if
you do not use hdfs.
This isn't true. Many other data sources (e.g. Cassandra) support locality.
On Thu, Aug 25, 2016 at 3:36 PM, ayan guha <guha.a...@gmail.com> wrote:
> At the core of it map reduce re
At the core of it map reduce relies heavily on data locality. You would
lose the ability to process data closest to where it resides if you do not
use hdfs.
S3 or NFS will not able to provide that.
On 26 Aug 2016 07:49, "kant kodali" <kanth...@gmail.com> wrote:
> yeah so its
os can provide High Availability through etcd and consul and
if that is true I will be left with the following stack
Spark + Mesos scheduler + Distributed File System or to be precise I should say
Distributed Storage since S3 is an object store so I guess this will be HDFS for
us + etcd & consul. N
gt; avoid Zookeeper. I also heard that Mesos can provide High Availability
> through etcd and consul and if that is true I will be left with the
> following stack
>
> Spark + Mesos scheduler + Distributed File System or to be precise I
> should say Distributed Storage since S3 is an o
is an object store so I guess this will be HDFS for
us + etcd & consul. Now the big question for me is how do I set all this up
On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
Just to add one concrete example regarding HDFS dependency. Have a look at checkpointing
h
Just to add one concrete example regarding HDFS dependency.
Have a look at checkpointing
https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
For example, for Spark Streaming, you can not do any window operation in a
cluster without checkpointing to HDFS (or S3
Hi Kant,
I trust the following would be of use.
Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
In the heart of it and with reference to points you raised about HDFS, one
needs to have a working knowledge of Hadoop Core System including HDFS,
Map-reduce algorithm
te:
>>
>>> @Mich I understand why I would need Zookeeper. It is there for fault
>>> tolerance given that spark is a master-slave architecture and when a mater
>>> goes down zookeeper will run a leader election algorithm to elect a new
>>> leader ho
a new
>> leader however DevOps hate Zookeeper they would be much happier to go with
>> etcd & consul and looks like if we mesos scheduler we should be able to
>> drop Zookeeper.
>>
>> HDFS I am still trying to understand why I would need for spark. I
>> unde
ure and when a mater
> goes down zookeeper will run a leader election algorithm to elect a new
> leader however DevOps hate Zookeeper they would be much happier to go with
> etcd & consul and looks like if we mesos scheduler we should be able to
> drop Zookeeper.
>
> HDFS I am stil
& consul and
looks like if we mesos scheduler we should be able to drop Zookeeper.
HDFS I am still trying to understand why I would need for spark. I understand
the purpose of distributed file systems in general but I don't understand in the
context of spark since many people say you can run a s
Spark is a parallel computing framework. There are many ways to give it
data to chomp down on. If you don't know why you would need HDFS, then you
don't need it. Same goes for Zookeeper. Spark works fine without either.
Much of what we read online comes from people with specialized problems
You can use Spark on Oracle as a query tool.
It all depends on the mode of the operation.
If you running Spark with yarn-client/cluster then you will need yarn. It
comes as part of Hadoop core (HDFS, Map-reduce and Yarn).
I have not gone and installed Yarn without installing Hadoop.
What
What do I loose if I run spark without using HDFS or Zookeper ? which of them is
almost a must in practice?
d.com <mailto:tony@tendcloud.com>
>
> From: Sun Rui <mailto:sunrise_...@163.com>
> Date: 2016-08-24 22:17
> To: Saisai Shao <mailto:sai.sai.s...@gmail.com>
> CC: tony@tendcloud.com <mailto:tony@tendcloud.com>; user
> <mailto:user@spark.apa
oud.com
>
>
> *From:* Sun Rui <sunrise_...@163.com>
> *Date:* 2016-08-24 22:17
> *To:* Saisai Shao <sai.sai.s...@gmail.com>
> *CC:* tony@tendcloud.com; user <user@spark.apache.org>
> *Subject:* Re: Can we redirect Spark shuffle spill data to HDFS or
>
@tendcloud.com
From: Sun Rui
Date: 2016-08-24 22:17
To: Saisai Shao
CC: tony@tendcloud.com; user
Subject: Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?
Yes, I also tried FUSE before, it is not stable and I don’t recommend it
On Aug 24, 2016, at 22:15, Saisai Shao
>), but not so stable as I tried
> before.
>
> On Wed, Aug 24, 2016 at 10:09 PM, Sun Rui <sunrise_...@163.com
> <mailto:sunrise_...@163.com>> wrote:
> For HDFS, maybe you can try mount HDFS as NFS. But not sure about the
> stability, and also there is additiona
Also fuse is another candidate (https://wiki.apache.org/hadoop/MountableHDFS),
but not so stable as I tried before.
On Wed, Aug 24, 2016 at 10:09 PM, Sun Rui <sunrise_...@163.com> wrote:
> For HDFS, maybe you can try mount HDFS as NFS. But not sure about the
> stabili
For HDFS, maybe you can try mount HDFS as NFS. But not sure about the
stability, and also there is additional overhead of network I/O and replica of
HDFS files.
> On Aug 24, 2016, at 21:02, Saisai Shao <sai.sai.s...@gmail.com> wrote:
>
> Spark Shuffle uses Java File related API
data, spark will do shuffle and the
> shuffle data will write to local disk. Because we have limited capacity at
> local disk, the shuffled data will occupied all of the local disk and then
> will be failed. So is there a way we can write the shuffle spill data to
> HDFS? Or if we
data to HDFS? Or if we
introduce alluxio in our system, can the shuffled data write to alluxio?
Thanks and Regards,
阎志涛(Tony)
北京腾云天下科技有限公司
邮箱:tony@tendcloud.com
电话:13911815695
微信
maybe this problem is not so easy to understand, so I attached my full code.
Hope this could help in solving the problem.
ThanksBest regards!
San.Luo
- 原始邮件 -
发件人:<luohui20...@sina.com>
收件人:"user" <user@spark.apache.org>
hi there:I got a problem in saving a DF to HDFS as parquet format very
slow. And I attached a pic which shows a lot of time is spent in getting
result.the code is
:streamingData.write.mode(SaveMode.Overwrite).parquet("/data/streamingData")
I don't quite understand why my app
; On Aug 8, 2016, at 1:35 PM, Sumit Khanna <sumit.kha...@askme.in> wrote:
>>
>> Hello,
>>
>> the use case is as follows :
>>
>> say I am inserting 200K rows as dataframe.write.formate("parquet") etc
>> etc (like a basic write to hdfs command
kha...@askme.in>> wrote:
>>
>> Hello,
>>
>> the use case is as follows :
>>
>> say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc
>> (like a basic write to hdfs command), but say due to some reason or rhyme
&g
or nothing inside
> (Spark job was failed).
>
>
>
>
>
> On Aug 8, 2016, at 1:35 PM, Sumit Khanna <sumit.kha...@askme.in> wrote:
>
> Hello,
>
> the use case is as follows :
>
> say I am inserting 200K rows as dataframe.write.formate("parquet") et
016, at 1:35 PM, Sumit Khanna <sumit.kha...@askme.in> wrote:
>
> Hello,
>
> the use case is as follows :
>
> say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc
> (like a basic write to hdfs command), but say due to some reason or
Hello,
the use case is as follows :
say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc
(like a basic write to hdfs command), but say due to some reason or rhyme
my job got killed, when the run was in the mid of it, meaning lets say I
was only able to insert
I'm wondering if there is a do processing one each of these (process in this
> case is just getting the bytes length but will be something else in real
> world) and then write the contents to separate HDFS files.
> If this doesn't make sense, would it make more sense to have all contents in
I'm wondering if there is a do processing one each of these (process
in this case is just getting the bytes length but will be something
else in real world) and then write the contents to separate HDFS
files.
If this doesn't make sense, would it make more sense to have all
contents in a single HDF
This sounds a bad idea, given hdfs does not work well with small files.
On Sun, Jul 31, 2016 at 8:57 AM, Bhaarat Sharma <bhaara...@gmail.com> wrote:
> I am reading bunch of files in PySpark using binaryFiles. Then I want to
> get the number of bytes for each file and write this numbe
I am reading bunch of files in PySpark using binaryFiles. Then I want to
get the number of bytes for each file and write this number to an HDFS file
with the corresponding name.
Example:
if directory /myimages has one.jpg, two.jpg, and three.jpg then I want
three files one-success.jpg, two
<a...@santacruzintegration.com>
Date: Wednesday, July 27, 2016 at 4:25 PM
To: "user @spark" <user@spark.apache.org>
Subject: how to copy local files to hdfs quickly?
> I have a spark streaming app that saves JSON files to s3:// . It works fine
>
> Now I need to calculate
vidson <a...@santacruzintegration.com>, Pedro Rodriguez
<ski.rodrig...@gmail.com>
Cc: "user @spark" <user@spark.apache.org>
Subject: use big files and read from HDFS was: performance problem when
reading lots of small files created by spark streaming.
> Hi Pedro
>
> I did some experi
Hi Pedro
I did some experiments. I using one of our relatively small data set. The
data set is loaded into 3 or 4 data frames. I then call count()
Looks like using bigger files and reading from HDFS is a good solution for
reading data. I guess I¹ll need to do something similar to this to deal
gt;>>> Hey,
>>>>
>>>> the very first run :
>>>>
>>>> glossary :
>>>>
>>>> delta_df := current run / execution changes dataframe.
>>>>
>>>> def deduplicate :
>>>> apply windowin
;> apply windowing function and group by
>>>
>>> def partitionDataframe(delta_df) :
>>> get unique keys of that data frame and then return an array of data
>>> frames each containing just that very same key as the column.
>>> this will give the
te(delta_df : delta_df [ with all unique primary /
>> deduplicating key column ]
>> 1. partitionDataframe(delta_df) : Array[delta_df(i to # partitons)]
>> 2. write the dataframe to corresponding parent hdfs path + partiton dir_
>>
>> subsequent runs :
>>
>> for e
a_df) : Array[delta_df(i to # partitons)]
> 2. write the dataframe to corresponding parent hdfs path + partiton dir_
>
> subsequent runs :
>
> for each partition :
> 0. partitionDataframe(delta_df) : Array[delta_df(i to # partitons)]
> 1. load df from previous hdfs location o
. write the dataframe to corresponding parent hdfs path + partiton dir_
subsequent runs :
for each partition :
0. partitionDataframe(delta_df) : Array[delta_df(i to # partitons)]
1. load df from previous hdfs location of that partition
2. filter the above df(p) where p is the partiton
I have a spark streaming app that saves JSON files to s3:// . It works fine
Now I need to calculate some basic summary stats and am running into
horrible performance problems.
I want to run a test to see if reading from hdfs instead of s3 makes
difference. I am able to quickly copy my the data
0 messages from one
> topic of Kafka, did some transformation, and wrote the result back to
> another topic. But only found 988 messages in the second topic. I checked
> log info and confirmed all messages was received by receivers. But I found a
> hdfs writing time out
found a
hdfs writing time out message printed from Class BatchedWriteAheadLog.
I checkout source code and found code like this:
/** Add received block. This event will get written to the write ahead
log (if enabled). */
def addBlock(receivedBlockInfo: ReceivedBlockInfo
301 - 400 of 1329 matches
Mail list logo