Hi,
I have non secure Hadoop 2.7.2 cluster on EC2 having Spark 1.5.2
When I am submitting my spark scala script through shell script using Oozie
workflow.
I am submitting job as hdfs user but It is running as user = "yarn" so all
the output should get store under user/yarn directory on
Hi,
I have a Spark (Spark 1.5.2) application that streams data from Kafka to
HDFS. My application contains two Typesafe config files to configure
certain things like Kafka topic etc.
Now I want to run my application with spark-submit (cluster mode) in a
cluster.
The jar file with all dependencies
Hi,
I have a Spark (Spark 1.5.2) application that streams data from Kafka to
HDFS. My application contains two Typesafe config files to configure certain
things like Kafka topic etc.
Now I want to run my application with spark-submit (cluster mode) in a
cluster.
The jar file with all
My code looks like
>
>> import org.apache.spark._
>> import org.apache.spark.sql._
>> val hadoopConf = new org.apache.hadoop.conf.Configuration()
>> val hdfsConn = org.apache.hadoop.fs.FileSystem.get(new
>> java.net.URI("hdfs://xxx.xx.xx.xxx:8020"), hadoopConf)
stem.get(new
> java.net.URI("hdfs://xxx.xx.xx.xxx:8020"), hadoopConf)
> hdfsConn.listStatus(new
> org.apache.hadoop.fs.Path("/TestDivya/Spark/ParentDir/")).foreach{
> fileStatus =>
>val filePathName = fileStatus.getPath().toString()
>val fileName = fil
Hi,
Hdfs is append only, that you need to modify it as you read and write
in other place.
On Wed, Feb 17, 2016 at 2:45 AM, SRK <swethakasire...@gmail.com> wrote:
> Hi,
>
> How do I update data saved as Parquet in hdfs using dataframes? If I use
> SaveMode.Append, it just seems
Hi,
Thanks for the question.
1) The core-site.xml holds the parameter for the defaultFS:
fs.defaultFS
hdfs://:8020
This will be appended to your value in spark.eventLog.dir. So depending on
which location you intend to write it to, you can point it to either HDFS or
local.
As far
Hi,
We have a requirement wherein we need to store the documents in hdfs. The
documents are nothing but Json Strings. We should be able to query them by
Id using Spark SQL/Hive Context as and when needed. What would be the
correct approach to do this?
Thanks!
--
View this message in context
--
> Date: Thu, 11 Feb 2016 17:29:00 -0800
> Subject: Re: Building Spark with a Custom Version of Hadoop: HDFS
> ClassNotFoundException
> From: yuzhih...@gmail.com
> To: charliewri...@live.ca
> CC: d...@spark.apache.org
>
> Hdfs class is in ha
10
>
> I am using the 1.6.0 release.
>
>
> Charles.
>
> --
> Date: Thu, 11 Feb 2016 17:41:54 -0800
> Subject: Re: Building Spark with a Custom Version of Hadoop: HDFS
> ClassNotFoundException
> From: yuzhih...@gmail.com
> To: char
Hi,
How to do a lookup by id from a set of records stored in hdfs from inside a
transformation/action of an RDD.
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-a-look-up-by-id-from-files-in-hdfs-inside-a-transformation-action-ina
Is there any way to get data from HDFS (e.g. with sc.textFile) with two
separate usernames in the same Spark job? For instance, if I have a file on
hdfs-server-1.com and the alice user has permission to view it, and I have a
file on hdfs-server-2.com and the bob user has permission to view
I configured HDFS to cache file in HDFS's cache, like following:
hdfs cacheadmin -addPool hibench
hdfs cacheadmin -addDirective -path /HiBench/Kmeans/Input -pool hibench
But I didn't see much performance impacts, no matter how I configure
dfs.datanode.max.locked.memory
Is it possible
Have you read this thread ?
http://search-hadoop.com/m/uOzYttXZcg1M6oKf2/HDFS+cache=RE+hadoop+hdfs+cache+question+do+client+processes+share+cache+
Cheers
On Mon, Jan 25, 2016 at 1:23 PM, Jia Zou <jacqueline...@gmail.com> wrote:
> I configured HDFS to cache file in HDFS's cache, like
Please see also:
http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html
According to Chris Nauroth, an hdfs committer, it's extremely difficult to
use the feature correctly.
The feature also brings operational complexity. Since off-heap memory
Hi all,
I have calculated a covariance?? it's a Matrix type ,now i want to save
the result to hdfs, how can i do it?
thx
Hi all,
I have calculated a covariance?? it's a Matrix type ,now i want to save
the result to hdfs, how can i do it?
thx
Matrix can be save as column of type MatrixUDT.
Hi Yanbo,
I'm using java language and the environment is spark 1.4.1.
Can u tell me how to do it more detail , the follows is my code, how can
i save the cov to hdfs file ?
"
RowMatrix mat = new RowMatrix(rows.rdd());
Matrix cov = mat.computeCovar
Is 'hadoop' / 'hdfs' command accessible to your python script ?
If so, you can call 'hdfs dfs -ls' from python.
Cheers
On Sat, Jan 23, 2016 at 4:08 AM, Andrew Holway <
andrew.hol...@otternetworks.de> wrote:
> Hello,
>
> I would like to make a list of files (parquet or json
Hi
I have data in HDFS partitioned by a logical key and would like to preserve
the partitioning when creating a dataframe for the same. Is it possible to
create a dataframe that preserves partitioning from HDFS or the underlying
RDD?
Regards
Deenar
DataFrame operations which shuffle, you will end
up implicitly re-partitioning to spark.sql.shuffle.partitions (default 200).
Simon
> On 13 Jan 2016, at 10:09, Deenar Toraskar <deenar.toras...@gmail.com> wrote:
>
> Hi
>
> I have data in HDFS partitioned by a logic
Hi,
Is there a way to read a text file from inside a spark executor? I need to
do this for an streaming application where we need to read a file(whose
contents would change) from a closure.
I cannot use the "sc.textFile" method since spark context is not
serializable. I also cannot read a file
Hi Ewan,
Thank you for your answer.
I have already tried what you suggest.
If I use:
"hdfs://172.27.13.57:7077/user/hdfs/parquet-multi/BICC"
I get the AssertionError exception:
Exception in thread "main" java.lang.AssertionError: assertion
failed: No pre
File coreSite = new File("/etc/hadoop/conf/core-site.xml");
File hdfsSite = new File("/etc/hadoop/conf/hdfs-site.xml");
Configuration hConf = sc.hadoopConfiguration();
hConf.addResource(new Path(coreSite.getAbsolutePath()));
hConf.addResource(new P
rg.apache.spark.sql.SQLContext(sc)
> # val reader = sqlCont.read
> # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
> # dataFrame.registerTempTable("BICC")
> # val recSet = sqlCont.sql("SELECT
> protocolCode,beginTime,endTime,called,calling F
all function
correct:
# val sqlCont = new org.apache.spark.sql.SQLContext(sc)
# val reader = sqlCont.read
# val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
# dataFrame.registerTempTable("BICC")
# val recSet = sqlCont.sql("SELECT
protocolCode,beginTime,en
You will need to use the HDFS API to do that.
Try something like:
val conf = sc.hadoopConfiguration
val fs = org.apache.hadoop.fs.FileSystem.get(conf)
fs.rename(new org.apache.hadoop.fs.Path("/path/on/hdfs/file.txt"), new
org.apache.hadoop.fs.Path("/path/on/hdfs/other/file.t
For some file on hdfs, it is necessary to copy/move it to some another specific
hdfs directory, and the directory name would keep unchanged.Just need finish
it in spark program, but not hdfs commands.Is there any codes, it seems not to
be done by searching spark doc ...
Thanks in advance!
My guess is No, unless you are okay to read the data and write it back
again.
On Tue, Jan 5, 2016 at 2:07 PM, Zhiliang Zhu <zchl.j...@yahoo.com.invalid>
wrote:
>
> For some file on hdfs, it is necessary to copy/move it to some another
> specific hdfs directory, and the directory
from table where id = ")
> //filtered data frame
> df.count
>
> On Sat, Jan 2, 2016 at 11:56 AM, SRK <swethakasire...@gmail.com> wrote:
>
>> Hi,
>>
>> How to load partial data from hdfs using Spark SQL? Suppose I want to load
>> data based on a filter like
&
Hi,
How to load partial data from hdfs using Spark SQL? Suppose I want to load
data based on a filter like
"Select * from table where id = " using Spark SQL with DataFrames,
how can that be done? The
idea here is that I do not want to load the whole data into memory when I
use the
Ok, so whats wrong in using :
var df=HiveContext.sql("Select * from table where id = ")
//filtered data frame
df.count
On Sat, Jan 2, 2016 at 11:56 AM, SRK <swethakasire...@gmail.com> wrote:
> Hi,
>
> How to load partial data from hdfs using Spark SQL? Suppose I
nsive a set of applications are. Closest thing I have seen is
> the HDFS DataNode Logs in YARN but they don't seem to have Spark
> applications specific reads and writes.
>
> 2015-12-21 18:29:15,347 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /127.0.0.1:53
Hello,
Spark collect HDFS read/write metrics per application/job see details
http://spark.apache.org/docs/latest/monitoring.html.
I have connected spark metrics to Graphite and then doing nice graphs
display on Graphana.
BR,
Arek
On Thu, Dec 31, 2015 at 2:00 PM, Steve Loughran <
Hello:
Is there anyway of monitoring the number of Bytes or blocks read and written
by an Spark application?. I'm running Spark with YARN and I want to measure
how I/O intensive a set of applications are. Closest thing I have seen is
the HDFS DataNode Logs in YARN but they don't seem to have
ow to use sparkR or spark MLlib load csv file on hdfs
> thencalculate covariance
>
>
>
>
>
> Now i have huge columns about 5k -20k, so if i want to Calculate
> covariance matrix ,which is the best method or common method ?
>
>
>
> -- 原始邮件
hi all,
I want to use sparkR or spark MLlib load csv file on hdfs then calculate
covariance, how to do it .
thks.
gt; hi all,
> I want to use sparkR or spark MLlib load csv file on hdfs then
> calculate covariance, how to do it .
> thks.
>
"user @spark" <user@spark.apache.org>
Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then
calculate covariance
> Load csv file:
> df <- read.df(sqlContext, "file-path", source = "com.databricks.spark.csv",
> header = "true")
ber 28, 2015 10:24 AM
Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then
calculate covariance
To: zhangjp <592426...@qq.com>, Yanbo Liang <yblia...@gmail.com>
Cc: user <user@spark.apache.org>
Hi Yanbo
I use spark.csv to load my data se
5
??: "Andy Davidson"<a...@santacruzintegration.com>;
"zhangjp"<592426...@qq.com>; "Yanbo Liang"<yblia...@gmail.com>;
: "user"<user@spark.apache.org>;
: Re: how to use sparkR or spark MLlib load csv file on hdfs thencal
: how to use sparkR or spark MLlib load csv file on hdfs
thencalculate covariance
Now i have huge columns about 5k -20k, so if i want to Calculate covariance
matrix ,which is the best method or common method ?
-- 原始邮件 --
发件人: "Felix
Cheung";&
ot;Batman")).toDF("year","title")
df.write.partitionBy("year").avro("/tmp/data")
val df2 = Seq((2013, "Batman")).toDF("year","title")
df2.write.partitionBy("year").avro("/tmp/data")
As you can see,
e.partitionBy("year").avro("/tmp/data")
>>
>> val df2 = Seq((2013, "Batman")).toDF("year","title")
>>
>> df2.write.partitionBy("year").avro("/tmp/data")
>>
>>
>> As yo
Hi,
I'm stuck with writing partitioned data to hdfs. Example below ends up with
'already exists' -error.
I'm wondering how to handle streaming use case.
What is the intended way to write streaming data to hdfs? What am I missing?
cheers,
-jan
import com.databricks.spark.avro._
import
mobile, excuse brevity.
On Dec 22, 2015 2:31 PM, "Jan Holmberg" <jan.holmb...@perigeum.fi> wrote:
> Hi,
> I'm stuck with writing partitioned data to hdfs. Example below ends up
> with 'already exists' -error.
>
> I'm wondering how to handle streaming use case.
>
&
ng example where each batch would
create a new distinct directory.
Granularity has no impact. No matter how data is partitioned, second 'batch'
always fails with existing base dir.
scala> df2.write.partitionBy("year").avro("/tmp/data")
org.apache.spark.sql.AnalysisException:
e a new distinct directory.
>
> Granularity has no impact. No matter how data is partitioned, second
> 'batch' always fails with existing base dir.
>
> scala> df2.write.partitionBy("year").avro("/tmp/data")
> or
ot;)
>
> df.write.partitionBy("year").avro("/tmp/data")
>
> val df2 = Seq((2013, "Batman")).toDF("year","title")
>
> df2.write.partitionBy("year").avro("/tmp/data")
>
>
> As you can see, it complains abou
uot;/tmp/data")
val df2 = Seq((2013, "Batman")).toDF("year","title")
df2.write.partitionBy("year").avro("/tmp/data")
As you can see, it complains about the target directory (/tmp/data) and not
about the partitioni
com> wrote:
> hi Folks
>
> I am using standalone cluster of 50 servers on aws. i loaded data on hdfs,
> why i am getting Locality Level as ANY for data on hdfs, i have 900+
> partitions.
>
>
> --
> with Regards
> Shahid Ashraf
>
hi Folks
I am using standalone cluster of 50 servers on aws. i loaded data on hdfs,
why i am getting Locality Level as ANY for data on hdfs, i have 900+
partitions.
--
with Regards
Shahid Ashraf
Hi Prateek,
you mean writing spark output to any storage system ? yes you can .
Thanks
Sri
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/can-i-write-only-RDD-transformation-into-hdfs-or-any-other-storage-system-tp25637p25651.html
Sent from the Apache
Hi!
I configured log4j.properties file in conf folder of spark with following
values...
log4j.appender.file.File=hdfs://
I expected all log files to log output to the file in HDFS.
Instead files are created locally.
Has anybody tried logging to HDFS by configuring log4j.properties?
Warm
This would require a special HDFS log4j appender. Alternatively try the flume
log4j appender
> On 08 Dec 2015, at 13:00, sunil m <260885smanik...@gmail.com> wrote:
>
> Hi!
> I configured log4j.properties file in conf folder of spark with following
> values...
>
> lo
at 6:26 AM
To: Andrew Davidson <a...@santacruzintegration.com>
Subject: epoch date format to normal date format while loading the files to
HDFS
> Hi Andy,
>
> How are you? i need your help again.
>
> I have written a spark streaming program in Java to access twitter tweets a
Can you clarify your use case ?
Apart from hdfs, S3 (and possibly others) can be used.
Cheers
On Tue, Dec 8, 2015 at 9:40 AM, prateek arora <prateek.arora...@gmail.com>
wrote:
> Hi
>
> Is it possible into spark to write only RDD transformation into hdfs or any
> ot
Hi
Is it possible into spark to write only RDD transformation into hdfs or any
other storage system ?
Regards
Prateek
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/can-i-write-only-RDD-transformation-into-hdfs-or-any-other-storage-system-tp25637.html
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-5-2-getting-stuck-when-reading-from-HDFS-in-YARN-client-mode-tp25527p25589.html
Sent from the Apache Spark User List mailing list archive at
/hive on HDFS should be writable
error
Hi,
Actually I went back to 1.3, 1.3.1 and 1.4 and built spark from source code
with no luck.
So I am not sure if any good result is going to come from back tracking?
Cheers,
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running
I actually don't have the folder /tmp/hive created in my master node, is that a
problem?
From: Mich Talebzadeh [mailto:m...@peridale.co.uk]
Sent: Wednesday, December 02, 2015 5:40 PM
To: Lin, Hao; user@spark.apache.org
Subject: RE: starting spark-shell throws /tmp/hive on HDFS should be writable
HDFS has a default replication factor of 3
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471p25497.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
0 actually gives smaller parts.
I'd like to be able to specify the size of the parts- directly rather than
guess and check what coalesce value to use.
Why I care: my data is ~3Tb in Parquet form, with about 16 thousand files of
around 200MB each. Transferring this from HDFS on EC2 to S3 based on th
Thanks, the issue was indeed the dfs replication factor. To fix it without
entirely clearing out HDFS and rebooting, I first ran
hdfs dfs -setrep -R -w 1 /
to reduce all the current files' replication factor to 1 recursively from
the root, then I changed the dfs.replication factor in
ephemeral
gt; entirely clearing out HDFS and rebooting, I first ran
> hdfs dfs -setrep -R -w 1 /
> to reduce all the current files' replication factor to 1 recursively from
> the root, then I changed the dfs.replication factor in
> ephemeral-hdfs/conf/hdfs-site.xml and ran ephemeral-hdfs/sbin/stop-a
and be able to set the output filenames.
that's how everything else does it: relies on rename() being atomic and O(1) on
HDFS. Just create the temp dir with the same parent dir as the destination, so
in encrypted HDFS they are both in the same encryption zone.
And know that renames in S3n/s3a
I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster
with 16.73 Tb storage, using
distcp. The dataset is a collection of tar files of about 1.7 Tb each.
Nothing else was stored in the HDFS, but after completing the download, the
namenode page says that 11.59 Tb are in use
what is your hdfs replication set to?
On Wed, Nov 25, 2015 at 1:31 AM, AlexG <swift...@gmail.com> wrote:
> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2
> cluster
> with 16.73 Tb storage, using
> distcp. The dataset is a collection of tar files of
Hi AlexG:
Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 * 3 =
11.4TB.
--
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote:
> I downloaded a 3.8 T dataset from S3 to a freshly launched sp
to process log
>>
files from S3 and save them on hadoop to later analyze them with
>>
sparkSQL.
>>
Everything works well except when I kill the spark application and
>>
restart it: it picks up from the latest processed batch and reprocesses
>>
it which results in dupl
sands of empty files being created on
HDFS?
>
> Hi Andy
>
> You can try sc.wholeTextFiles() instead of sc.textFile()
>
> Regards
> Sab
>
> On 24-Nov-2015 4:01 am, "Andy Davidson" <a...@santacruzintegration.com> wrote:
>> Hi Xiao and Sabarish
&g
gt;, "user @spark" <user@spark.apache.org>
Subject: Re: newbie : why are thousands of empty files being created on
HDFS?
> I'm seeing similar slowness in saveAsTextFile(), but only in Python.
>
> I'm sorting data in a dataframe, then transform it and get a RDD, and then
> coale
calls sample( 0.01).filter(not null).saveAsTextFile(). This takes
about 35 min, to scan 500,000 JSON strings and write 5000 to disk. The total
data writing in 38M.
The data is read from HDFS. My understanding is Spark can not know in
advance how HDFS partitioned the data. Spark knows I have
0,000 JSON strings and write 5000 to disk.
> The total data writing in 38M.
>
> The data is read from HDFS. My understanding is Spark can not know in
> advance how HDFS partitioned the data. Spark knows I have a master and 3
> slaves machines. It knows how many works/executors are
')
The data was originally collected using spark stream. I noticed that the
number of default partitions == the number of files create on hdfs. I bet
each file is one spark streaming mini-batchI suspect if I concatenate
these into a small number of files things will run much faster. I suspect
I
it which results in duplicate data on hdfs.
How can I make the writing step on hdfs idempotent ? I couldn't find any
way to control for example the filenames of the parquet files being
written, the idea being to include the batch time so that the same batch
gets written always on the same path.
I've
analyze them with
> sparkSQL.
> Everything works well except when I kill the spark application and
> restart it: it picks up from the latest processed batch and reprocesses
> it which results in duplicate data on hdfs.
>
> How can I make the writing step on hdfs idempotent ? I
;
>
> The data was originally collected using spark stream. I noticed that the
> number of default partitions == the number of files create on hdfs. I bet
> each file is one spark streaming mini-batchI suspect if I concatenate
> these into a small number of files things w
idson" <a...@santacruzintegration.com>
wrote:
> I start working on a very simple ETL pipeline for a POC. It reads a in a
> data set of tweets stored as JSON strings on in HDFS and randomly selects
> 1% of the observations and writes them to HDFS. It seems to run very
> slowly. E.G. To write 4720 observat
Hi All,
If write ahead logs are enabled in spark streaming does all the received
data gets written to HDFS path ? or it only writes the metadata.
How does clean up works , does HDFS path gets bigger and bigger up everyday
do I need to write an clean up job to delete data from write ahead logs
From: Mich Talebzadeh [mailto:m...@peridale.co.uk]
Sent: 20 November 2015 21:14
To: u...@hive.apache.org
Subject: starting spark-shell throws /tmp/hive on HDFS should be writable
error
Hi,
Has this been resolved. I don't think this has anything to do with /tmp/hive
directory permission
expect
I.E. In valid JSON format. The key names are double quoted. Boolean values
are the works true or false in lower case.
When I run in my cluster the only difference is I call
data.saveAsTextFiles() using an hdfs: URI instead of using file:/// . When
the files are written to HDFS the JSON
Hi
I'm looking for some benchmarks on joining data frames where most of the
data is in HDFS (e.g. in parquet) and some "reference" or "metadata" is
still in RDBMS. I am only looking at the very first join before any caching
happens, and I assume there will be loss of par
I have verified that this error exists on my system as well, and the suggested
workaround also works.
Spark version: 1.5.1; 1.5.2
Mesos version: 0.21.1
CDH version: 4.7
I have set up the spark-env.sh to contain HADOOP_CONF_DIR pointing to the
correct place, and I have also linked in the hdfs
Cool thanksI have a CDH 5.4.8 (Cloudera Starving Developers Version) with 1 NN
and 4 DN and SPark is running but its 1.3.xI want to leverage this HDFS hive
cluster for SparkR because we do all data munging here and produce datasets for
ML.
I am thinking of the following idea
1. Add 2 datanodes
make sure
"/mnt/local/1024gbxvdf1/all_adleads_cleaned_commas_in_quotes_good_file.csv" is
accessible on your slave node.
--
Ali
On Nov 9, 2015, at 6:06 PM, Sanjay Subramanian
wrote:
> hey guys
>
> I have a 2 node SparkR (1 master 1 slave)cluster on AWS
hey guys
I have a 2 node SparkR (1 master 1 slave)cluster on AWS using
spark-1.5.1-bin-without-hadoop.tgz
Running the SparkR job on the master node
/opt/spark-1.5.1-bin-hadoop2.6/bin/sparkR --master
spark://ip-xx-ppp-vv-ddd:7077 --packages com.databricks:spark-csv_2.10:1.2.0
--executor-cores
rquet file from the driver. I could use the HDFS API
> but I am worried that it won't work on a secure cluster. I assume that the
> method the executors use to write to HDFS takes care of managing Hadoop
> security. However, I can't find the place where HDFS write happens in the
> spark source.
I am not looking for Spark Sql specifically. My usecase is that I need to
save an RDD as a parquet file in hdfs at the end of a batch and load it
back and convert it into an RDD in the next batch. The RDD has a String and
a Long as the key/value pairs.
On Wed, Nov 4, 2015 at 11:52 PM, Stefano
How to convert a parquet file that is saved in hdfs to an RDD after reading
the file from hdfs?
On Thu, Nov 5, 2015 at 10:02 AM, Igor Berman <igor.ber...@gmail.com> wrote:
> Hi,
> we are using avro with compression(snappy). As soon as you have enough
> partitions, the saving won
Hi,
we are using avro with compression(snappy). As soon as you have enough
partitions, the saving won't be a problem imho.
in general hdfs is pretty fast, s3 is less so
the issue with storing data is that you will loose your partitioner(even
though rdd has it) at loading moment. There is PR
gt; *e.g. if u have dataframe and working from java - toJavaRDD
>> <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#toJavaRDD()>*
>> ()
>>
>> On 5 November 2015 at 21:13, swetha kasireddy <swethakasire...@gmail.com>
>> wr
gmail.com>
wrote:
> How to convert a parquet file that is saved in hdfs to an RDD after
> reading the file from hdfs?
>
> On Thu, Nov 5, 2015 at 10:02 AM, Igor Berman <igor.ber...@gmail.com>
> wrote:
>
>> Hi,
>> we are using avro with compression(snappy). As s
ng from java - toJavaRDD
> <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#toJavaRDD()>*
> ()
>
> On 5 November 2015 at 21:13, swetha kasireddy <swethakasire...@gmail.com>
> wrote:
>
>> How to convert a parquet file that is saved
Hi,
What is the efficient approach to save an RDD as a file in HDFS and retrieve
it back? I was thinking between Avro, Parquet and SequenceFileFormart. We
currently use SequenceFileFormart for one of our use cases.
Any example on how to store and retrieve an RDD in an Avro and Parquet file
lt;swethakasire...@gmail.com> wrote:
> Hi,
>
> What is the efficient approach to save an RDD as a file in HDFS and
> retrieve
> it back? I was thinking between Avro, Parquet and SequenceFileFormart. We
> currently use SequenceFileFormart for one of our use cases.
>
> Any e
Hi,
I'd like to write a parquet file from the driver. I could use the HDFS API
but I am worried that it won't work on a secure cluster. I assume that the
method the executors use to write to HDFS takes care of managing Hadoop
security. However, I can't find the place where HDFS write happens
I am a bit curious: why is the synchronization on finalLock is needed ?
Thanks
> On Oct 23, 2015, at 8:25 AM, Anubhav Agarwal <anubha...@gmail.com> wrote:
>
> I have a spark job that creates 6 million rows in RDDs. I convert the RDD
> into Data-frame and write it to HDFS. Cu
creates 6 million rows in RDDs. I convert the RDD
> into Data-frame and write it to HDFS. Currently it takes 3 minutes to write
> it to HDFS.
>
> Here is the snippet:-
> RDDList.parallelStream().forEach(mapJavaRDD -> {
> if (mapJavaR
501 - 600 of 1329 matches
Mail list logo