[Error]Run Spark job as hdfs user from oozie workflow

2016-03-09 Thread Divya Gehlot
Hi, I have non secure Hadoop 2.7.2 cluster on EC2 having Spark 1.5.2 When I am submitting my spark scala script through shell script using Oozie workflow. I am submitting job as hdfs user but It is running as user = "yarn" so all the output should get store under user/yarn directory on

How to add a typesafe config file which is located on HDFS to spark-submit (cluster-mode)?

2016-02-22 Thread Johannes Haaß
Hi, I have a Spark (Spark 1.5.2) application that streams data from Kafka to HDFS. My application contains two Typesafe config files to configure certain things like Kafka topic etc. Now I want to run my application with spark-submit (cluster mode) in a cluster. The jar file with all dependencies

How to add a typesafe config file which is located on HDFS to spark-submit (cluster-mode)?

2016-02-22 Thread Jobs
Hi, I have a Spark (Spark 1.5.2) application that streams data from Kafka to HDFS. My application contains two Typesafe config files to configure certain things like Kafka topic etc. Now I want to run my application with spark-submit (cluster mode) in a cluster. The jar file with all

Re: Error :Type mismatch error when passing hdfs file path to spark-csv load method

2016-02-21 Thread Jonathan Kelly
My code looks like > >> import org.apache.spark._ >> import org.apache.spark.sql._ >> val hadoopConf = new org.apache.hadoop.conf.Configuration() >> val hdfsConn = org.apache.hadoop.fs.FileSystem.get(new >> java.net.URI("hdfs://xxx.xx.xx.xxx:8020"), hadoopConf)

Error :Type mismatch error when passing hdfs file path to spark-csv load method

2016-02-21 Thread Divya Gehlot
stem.get(new > java.net.URI("hdfs://xxx.xx.xx.xxx:8020"), hadoopConf) > hdfsConn.listStatus(new > org.apache.hadoop.fs.Path("/TestDivya/Spark/ParentDir/")).foreach{ > fileStatus => >val filePathName = fileStatus.getPath().toString() >val fileName = fil

Re: How to update data saved as parquet in hdfs using Dataframes

2016-02-17 Thread Arkadiusz Bicz
Hi, Hdfs is append only, that you need to modify it as you read and write in other place. On Wed, Feb 17, 2016 at 2:45 AM, SRK <swethakasire...@gmail.com> wrote: > Hi, > > How do I update data saved as Parquet in hdfs using dataframes? If I use > SaveMode.Append, it just seems

Re: Write spark eventLog to both HDFS and local FileSystem

2016-02-13 Thread nsalian
Hi, Thanks for the question. 1) The core-site.xml holds the parameter for the defaultFS: fs.defaultFS hdfs://:8020 This will be appended to your value in spark.eventLog.dir. So depending on which location you intend to write it to, you can point it to either HDFS or local. As far

How to store documents in hdfs and query them by id using Hive/Spark SQL

2016-02-13 Thread SRK
Hi, We have a requirement wherein we need to store the documents in hdfs. The documents are nothing but Json Strings. We should be able to query them by Id using Spark SQL/Hive Context as and when needed. What would be the correct approach to do this? Thanks! -- View this message in context

Re: Building Spark with a Custom Version of Hadoop: HDFS ClassNotFoundException

2016-02-11 Thread Ted Yu
-- > Date: Thu, 11 Feb 2016 17:29:00 -0800 > Subject: Re: Building Spark with a Custom Version of Hadoop: HDFS > ClassNotFoundException > From: yuzhih...@gmail.com > To: charliewri...@live.ca > CC: d...@spark.apache.org > > Hdfs class is in ha

Re: Building Spark with a Custom Version of Hadoop: HDFS ClassNotFoundException

2016-02-11 Thread Ted Yu
10 > > I am using the 1.6.0 release. > > > Charles. > > -- > Date: Thu, 11 Feb 2016 17:41:54 -0800 > Subject: Re: Building Spark with a Custom Version of Hadoop: HDFS > ClassNotFoundException > From: yuzhih...@gmail.com > To: char

How to do a look up by id from files in hdfs inside a transformation/action ina RDD

2016-02-09 Thread SRK
Hi, How to do a lookup by id from a set of records stored in hdfs from inside a transformation/action of an RDD. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-a-look-up-by-id-from-files-in-hdfs-inside-a-transformation-action-ina

Connect to two different HDFS servers with different usernames

2016-02-03 Thread Wayne Song
Is there any way to get data from HDFS (e.g. with sc.textFile) with two separate usernames in the same Spark job? For instance, if I have a file on hdfs-server-1.com and the alice user has permission to view it, and I have a file on hdfs-server-2.com and the bob user has permission to view

Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Jia Zou
I configured HDFS to cache file in HDFS's cache, like following: hdfs cacheadmin -addPool hibench hdfs cacheadmin -addDirective -path /HiBench/Kmeans/Input -pool hibench But I didn't see much performance impacts, no matter how I configure dfs.datanode.max.locked.memory Is it possible

Re: Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Ted Yu
Have you read this thread ? http://search-hadoop.com/m/uOzYttXZcg1M6oKf2/HDFS+cache=RE+hadoop+hdfs+cache+question+do+client+processes+share+cache+ Cheers On Mon, Jan 25, 2016 at 1:23 PM, Jia Zou <jacqueline...@gmail.com> wrote: > I configured HDFS to cache file in HDFS's cache, like

Re: Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Ted Yu
Please see also: http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html According to Chris Nauroth, an hdfs committer, it's extremely difficult to use the feature correctly. The feature also brings operational complexity. Since off-heap memory

how to save Matrix type result to hdfs file using java

2016-01-24 Thread zhangjp
Hi all, I have calculated a covariance?? it's a Matrix type ,now i want to save the result to hdfs, how can i do it? thx

show to save Matrix type result to hdfs file using java

2016-01-24 Thread zhangjp
Hi all, I have calculated a covariance?? it's a Matrix type ,now i want to save the result to hdfs, how can i do it? thx

Re: how to save Matrix type result to hdfs file using java

2016-01-24 Thread Yanbo Liang
Matrix can be save as column of type MatrixUDT.

?????? how to save Matrix type result to hdfs file using java

2016-01-24 Thread zhangjp
Hi Yanbo, I'm using java language and the environment is spark 1.4.1. Can u tell me how to do it more detail , the follows is my code, how can i save the cov to hdfs file ? " RowMatrix mat = new RowMatrix(rows.rdd()); Matrix cov = mat.computeCovar

Re: python - list objects in HDFS directory

2016-01-23 Thread Ted Yu
Is 'hadoop' / 'hdfs' command accessible to your python script ? If so, you can call 'hdfs dfs -ls' from python. Cheers On Sat, Jan 23, 2016 at 4:08 AM, Andrew Holway < andrew.hol...@otternetworks.de> wrote: > Hello, > > I would like to make a list of files (parquet or json

distributeBy using advantage of HDFS or RDD partitioning

2016-01-13 Thread Deenar Toraskar
Hi I have data in HDFS partitioned by a logical key and would like to preserve the partitioning when creating a dataframe for the same. Is it possible to create a dataframe that preserves partitioning from HDFS or the underlying RDD? Regards Deenar

Re: distributeBy using advantage of HDFS or RDD partitioning

2016-01-13 Thread Simon Elliston Ball
DataFrame operations which shuffle, you will end up implicitly re-partitioning to spark.sql.shuffle.partitions (default 200). Simon > On 13 Jan 2016, at 10:09, Deenar Toraskar <deenar.toras...@gmail.com> wrote: > > Hi > > I have data in HDFS partitioned by a logic

Read HDFS file from an executor(closure)

2016-01-12 Thread Udit Mehta
Hi, Is there a way to read a text file from inside a spark executor? I need to do this for an streaming application where we need to read a file(whose contents would change) from a closure. I cannot use the "sc.textFile" method since spark context is not serializable. I also cannot read a file

Re: Problems with reading data from parquet files in a HDFS remotely

2016-01-08 Thread Henrik Baastrup
Hi Ewan, Thank you for your answer. I have already tried what you suggest. If I use: "hdfs://172.27.13.57:7077/user/hdfs/parquet-multi/BICC" I get the AssertionError exception: Exception in thread "main" java.lang.AssertionError: assertion failed: No pre

Re: Problems with reading data from parquet files in a HDFS remotely

2016-01-08 Thread Henrik Baastrup
File coreSite = new File("/etc/hadoop/conf/core-site.xml"); File hdfsSite = new File("/etc/hadoop/conf/hdfs-site.xml"); Configuration hConf = sc.hadoopConfiguration(); hConf.addResource(new Path(coreSite.getAbsolutePath())); hConf.addResource(new P

Re: Problems with reading data from parquet files in a HDFS remotely

2016-01-07 Thread Prem Sure
rg.apache.spark.sql.SQLContext(sc) > # val reader = sqlCont.read > # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC") > # dataFrame.registerTempTable("BICC") > # val recSet = sqlCont.sql("SELECT > protocolCode,beginTime,endTime,called,calling F

Problems with reading data from parquet files in a HDFS remotely

2016-01-07 Thread Henrik Baastrup
all function correct: # val sqlCont = new org.apache.spark.sql.SQLContext(sc) # val reader = sqlCont.read # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC") # dataFrame.registerTempTable("BICC") # val recSet = sqlCont.sql("SELECT protocolCode,beginTime,en

Re: copy/mv hdfs file to another directory by spark program

2016-01-04 Thread Don Drake
You will need to use the HDFS API to do that. Try something like: val conf = sc.hadoopConfiguration val fs = org.apache.hadoop.fs.FileSystem.get(conf) fs.rename(new org.apache.hadoop.fs.Path("/path/on/hdfs/file.txt"), new org.apache.hadoop.fs.Path("/path/on/hdfs/other/file.t

copy/mv hdfs file to another directory by spark program

2016-01-04 Thread Zhiliang Zhu
For some file on hdfs, it is necessary to copy/move it to some another specific hdfs  directory, and the directory name would keep unchanged.Just need finish it in spark program, but not hdfs commands.Is there any codes, it seems not to be done by searching spark doc ... Thanks in advance! 

Re: copy/mv hdfs file to another directory by spark program

2016-01-04 Thread ayan guha
My guess is No, unless you are okay to read the data and write it back again. On Tue, Jan 5, 2016 at 2:07 PM, Zhiliang Zhu <zchl.j...@yahoo.com.invalid> wrote: > > For some file on hdfs, it is necessary to copy/move it to some another > specific hdfs directory, and the directory

Re: How to load partial data from HDFS using Spark SQL

2016-01-02 Thread swetha kasireddy
from table where id = ") > //filtered data frame > df.count > > On Sat, Jan 2, 2016 at 11:56 AM, SRK <swethakasire...@gmail.com> wrote: > >> Hi, >> >> How to load partial data from hdfs using Spark SQL? Suppose I want to load >> data based on a filter like &

How to load partial data from HDFS using Spark SQL

2016-01-01 Thread SRK
Hi, How to load partial data from hdfs using Spark SQL? Suppose I want to load data based on a filter like "Select * from table where id = " using Spark SQL with DataFrames, how can that be done? The idea here is that I do not want to load the whole data into memory when I use the

Re: How to load partial data from HDFS using Spark SQL

2016-01-01 Thread UMESH CHAUDHARY
Ok, so whats wrong in using : var df=HiveContext.sql("Select * from table where id = ") //filtered data frame df.count On Sat, Jan 2, 2016 at 11:56 AM, SRK <swethakasire...@gmail.com> wrote: > Hi, > > How to load partial data from hdfs using Spark SQL? Suppose I

Re: Monitoring Spark HDFS Reads and Writes

2015-12-31 Thread Steve Loughran
nsive a set of applications are. Closest thing I have seen is > the HDFS DataNode Logs in YARN but they don't seem to have Spark > applications specific reads and writes. > > 2015-12-21 18:29:15,347 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /127.0.0.1:53

Re: Monitoring Spark HDFS Reads and Writes

2015-12-31 Thread Arkadiusz Bicz
Hello, Spark collect HDFS read/write metrics per application/job see details http://spark.apache.org/docs/latest/monitoring.html. I have connected spark metrics to Graphite and then doing nice graphs display on Graphana. BR, Arek On Thu, Dec 31, 2015 at 2:00 PM, Steve Loughran <

Monitoring Spark HDFS Reads and Writes

2015-12-30 Thread alvarobrandon
Hello: Is there anyway of monitoring the number of Bytes or blocks read and written by an Spark application?. I'm running Spark with YARN and I want to measure how I/O intensive a set of applications are. Closest thing I have seen is the HDFS DataNode Logs in YARN but they don't seem to have

Re: 回复: how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance

2015-12-29 Thread Sourav Mazumder
ow to use sparkR or spark MLlib load csv file on hdfs > thencalculate covariance > > > > > > Now i have huge columns about 5k -20k, so if i want to Calculate > covariance matrix ,which is the best method or common method ? > > > > -- 原始邮件

how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread zhangjp
hi all, I want to use sparkR or spark MLlib load csv file on hdfs then calculate covariance, how to do it . thks.

Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread Yanbo Liang
gt; hi all, > I want to use sparkR or spark MLlib load csv file on hdfs then > calculate covariance, how to do it . > thks. >

Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread Andy Davidson
"user @spark" <user@spark.apache.org> Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance > Load csv file: > df <- read.df(sqlContext, "file-path", source = "com.databricks.spark.csv", > header = "true")

Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread Felix Cheung
ber 28, 2015 10:24 AM Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance To: zhangjp <592426...@qq.com>, Yanbo Liang <yblia...@gmail.com> Cc: user <user@spark.apache.org> Hi Yanbo I use spark.csv to load my data se

?????? how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance

2015-12-28 Thread zhangjp
5 ??: "Andy Davidson"<a...@santacruzintegration.com>; "zhangjp"<592426...@qq.com>; "Yanbo Liang"<yblia...@gmail.com>; : "user"<user@spark.apache.org>; : Re: how to use sparkR or spark MLlib load csv file on hdfs thencal

RE: 回复: how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance

2015-12-28 Thread Sun, Rui
: how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance Now i have huge columns about 5k -20k, so if i want to Calculate covariance matrix ,which is the best method or common method ? -- 原始邮件 -- 发件人: "Felix Cheung";&

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Jan Holmberg
ot;Batman")).toDF("year","title") df.write.partitionBy("year").avro("/tmp/data") val df2 = Seq((2013, "Batman")).toDF("year","title") df2.write.partitionBy("year").avro("/tmp/data") As you can see,

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Michael Armbrust
e.partitionBy("year").avro("/tmp/data") >> >> val df2 = Seq((2013, "Batman")).toDF("year","title") >> >> df2.write.partitionBy("year").avro("/tmp/data") >> >> >> As yo

Writing partitioned Avro data to HDFS

2015-12-22 Thread Jan Holmberg
Hi, I'm stuck with writing partitioned data to hdfs. Example below ends up with 'already exists' -error. I'm wondering how to handle streaming use case. What is the intended way to write streaming data to hdfs? What am I missing? cheers, -jan import com.databricks.spark.avro._ import

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Yash Sharma
mobile, excuse brevity. On Dec 22, 2015 2:31 PM, "Jan Holmberg" <jan.holmb...@perigeum.fi> wrote: > Hi, > I'm stuck with writing partitioned data to hdfs. Example below ends up > with 'already exists' -error. > > I'm wondering how to handle streaming use case. > &

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Jan Holmberg
ng example where each batch would create a new distinct directory. Granularity has no impact. No matter how data is partitioned, second 'batch' always fails with existing base dir. scala> df2.write.partitionBy("year").avro("/tmp/data") org.apache.spark.sql.AnalysisException:

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Yash Sharma
e a new distinct directory. > > Granularity has no impact. No matter how data is partitioned, second > 'batch' always fails with existing base dir. > > scala> df2.write.partitionBy("year").avro("/tmp/data") > or

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Yash Sharma
ot;) > > df.write.partitionBy("year").avro("/tmp/data") > > val df2 = Seq((2013, "Batman")).toDF("year","title") > > df2.write.partitionBy("year").avro("/tmp/data") > > > As you can see, it complains abou

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Jan Holmberg
uot;/tmp/data") val df2 = Seq((2013, "Batman")).toDF("year","title") df2.write.partitionBy("year").avro("/tmp/data") As you can see, it complains about the target directory (/tmp/data) and not about the partitioni

Re: HDFS

2015-12-14 Thread Akhil Das
com> wrote: > hi Folks > > I am using standalone cluster of 50 servers on aws. i loaded data on hdfs, > why i am getting Locality Level as ANY for data on hdfs, i have 900+ > partitions. > > > -- > with Regards > Shahid Ashraf >

HDFS

2015-12-11 Thread shahid ashraf
hi Folks I am using standalone cluster of 50 servers on aws. i loaded data on hdfs, why i am getting Locality Level as ANY for data on hdfs, i have 900+ partitions. -- with Regards Shahid Ashraf

Re: can i write only RDD transformation into hdfs or any other storage system

2015-12-09 Thread kali.tumm...@gmail.com
Hi Prateek, you mean writing spark output to any storage system ? yes you can . Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-i-write-only-RDD-transformation-into-hdfs-or-any-other-storage-system-tp25637p25651.html Sent from the Apache

Logging spark output to hdfs file

2015-12-08 Thread sunil m
Hi! I configured log4j.properties file in conf folder of spark with following values... log4j.appender.file.File=hdfs:// I expected all log files to log output to the file in HDFS. Instead files are created locally. Has anybody tried logging to HDFS by configuring log4j.properties? Warm

Re: Logging spark output to hdfs file

2015-12-08 Thread Jörn Franke
This would require a special HDFS log4j appender. Alternatively try the flume log4j appender > On 08 Dec 2015, at 13:00, sunil m <260885smanik...@gmail.com> wrote: > > Hi! > I configured log4j.properties file in conf folder of spark with following > values... > > lo

Re: epoch date format to normal date format while loading the files to HDFS

2015-12-08 Thread Andy Davidson
at 6:26 AM To: Andrew Davidson <a...@santacruzintegration.com> Subject: epoch date format to normal date format while loading the files to HDFS > Hi Andy, > > How are you? i need your help again. > > I have written a spark streaming program in Java to access twitter tweets a

Re: can i write only RDD transformation into hdfs or any other storage system

2015-12-08 Thread Ted Yu
Can you clarify your use case ? Apart from hdfs, S3 (and possibly others) can be used. Cheers On Tue, Dec 8, 2015 at 9:40 AM, prateek arora <prateek.arora...@gmail.com> wrote: > Hi > > Is it possible into spark to write only RDD transformation into hdfs or any > ot

can i write only RDD transformation into hdfs or any other storage system

2015-12-08 Thread prateek arora
Hi Is it possible into spark to write only RDD transformation into hdfs or any other storage system ? Regards Prateek -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-i-write-only-RDD-transformation-into-hdfs-or-any-other-storage-system-tp25637.html

Re: Spark 1.5.2 getting stuck when reading from HDFS in YARN client mode

2015-12-05 Thread manasdebashiskar
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-5-2-getting-stuck-when-reading-from-HDFS-in-YARN-client-mode-tp25527p25589.html Sent from the Apache Spark User List mailing list archive at

RE: starting spark-shell throws /tmp/hive on HDFS should be writable error

2015-12-02 Thread Lin, Hao
/hive on HDFS should be writable error Hi, Actually I went back to 1.3, 1.3.1 and 1.4 and built spark from source code with no luck. So I am not sure if any good result is going to come from back tracking? Cheers, Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running

RE: starting spark-shell throws /tmp/hive on HDFS should be writable error

2015-12-02 Thread Lin, Hao
I actually don't have the folder /tmp/hive created in my master node, is that a problem? From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: Wednesday, December 02, 2015 5:40 PM To: Lin, Hao; user@spark.apache.org Subject: RE: starting spark-shell throws /tmp/hive on HDFS should be writable

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-26 Thread Gylfi
HDFS has a default replication factor of 3 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471p25497.html Sent from the Apache Spark User List mailing list archive at Nabble.com

controlling parquet file sizes for faster transfer to S3 from HDFS

2015-11-26 Thread AlexG
0 actually gives smaller parts. I'd like to be able to specify the size of the parts- directly rather than guess and check what coalesce value to use. Why I care: my data is ~3Tb in Parquet form, with about 16 thousand files of around 200MB each. Transferring this from HDFS on EC2 to S3 based on th

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-25 Thread Alex Gittens
Thanks, the issue was indeed the dfs replication factor. To fix it without entirely clearing out HDFS and rebooting, I first ran hdfs dfs -setrep -R -w 1 / to reduce all the current files' replication factor to 1 recursively from the root, then I changed the dfs.replication factor in ephemeral

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-25 Thread Ilya Ganelin
gt; entirely clearing out HDFS and rebooting, I first ran > hdfs dfs -setrep -R -w 1 / > to reduce all the current files' replication factor to 1 recursively from > the root, then I changed the dfs.replication factor in > ephemeral-hdfs/conf/hdfs-site.xml and ran ephemeral-hdfs/sbin/stop-a

Re: Spark Streaming idempotent writes to HDFS

2015-11-25 Thread Steve Loughran
and be able to set the output filenames. that's how everything else does it: relies on rename() being atomic and O(1) on HDFS. Just create the temp dir with the same parent dir as the destination, so in encrypted HDFS they are both in the same encryption zone. And know that renames in S3n/s3a

Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread AlexG
I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster with 16.73 Tb storage, using distcp. The dataset is a collection of tar files of about 1.7 Tb each. Nothing else was stored in the HDFS, but after completing the download, the namenode page says that 11.59 Tb are in use

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Koert Kuipers
what is your hdfs replication set to? On Wed, Nov 25, 2015 at 1:31 AM, AlexG <swift...@gmail.com> wrote: > I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 > cluster > with 16.73 Tb storage, using > distcp. The dataset is a collection of tar files of

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Ye Xianjin
Hi AlexG: Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 * 3 = 11.4TB. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote: > I downloaded a 3.8 T dataset from S3 to a freshly launched sp

Re: Spark Streaming idempotent writes to HDFS

2015-11-24 Thread Michael
to process log >> files from S3 and save them on hadoop to later analyze them with >> sparkSQL. >> Everything works well except when I kill the spark application and >> restart it: it picks up from the latest processed batch and reprocesses >> it which results in dupl

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-24 Thread Andy Davidson
sands of empty files being created on HDFS? > > Hi Andy > > You can try sc.wholeTextFiles() instead of sc.textFile() > > Regards > Sab > > On 24-Nov-2015 4:01 am, "Andy Davidson" <a...@santacruzintegration.com> wrote: >> Hi Xiao and Sabarish &g

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-24 Thread Andy Davidson
gt;, "user @spark" <user@spark.apache.org> Subject: Re: newbie : why are thousands of empty files being created on HDFS? > I'm seeing similar slowness in saveAsTextFile(), but only in Python. > > I'm sorting data in a dataframe, then transform it and get a RDD, and then > coale

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Andy Davidson
calls sample( 0.01).filter(not null).saveAsTextFile(). This takes about 35 min, to scan 500,000 JSON strings and write 5000 to disk. The total data writing in 38M. The data is read from HDFS. My understanding is Spark can not know in advance how HDFS partitioned the data. Spark knows I have

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Xiao Li
0,000 JSON strings and write 5000 to disk. > The total data writing in 38M. > > The data is read from HDFS. My understanding is Spark can not know in > advance how HDFS partitioned the data. Spark knows I have a master and 3 > slaves machines. It knows how many works/executors are

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Andy Davidson
') The data was originally collected using spark stream. I noticed that the number of default partitions == the number of files create on hdfs. I bet each file is one spark streaming mini-batchI suspect if I concatenate these into a small number of files things will run much faster. I suspect I

Spark Streaming idempotent writes to HDFS

2015-11-23 Thread Michael
it which results in duplicate data on hdfs. How can I make the writing step on hdfs idempotent ? I couldn't find any way to control for example the filenames of the parquet files being written, the idea being to include the batch time so that the same batch gets written always on the same path. I've

Re: Spark Streaming idempotent writes to HDFS

2015-11-23 Thread Burak Yavuz
analyze them with > sparkSQL. > Everything works well except when I kill the spark application and > restart it: it picks up from the latest processed batch and reprocesses > it which results in duplicate data on hdfs. > > How can I make the writing step on hdfs idempotent ? I

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Don Drake
; > > The data was originally collected using spark stream. I noticed that the > number of default partitions == the number of files create on hdfs. I bet > each file is one spark streaming mini-batchI suspect if I concatenate > these into a small number of files things w

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-21 Thread Sabarish Sasidharan
idson" <a...@santacruzintegration.com> wrote: > I start working on a very simple ETL pipeline for a POC. It reads a in a > data set of tweets stored as JSON strings on in HDFS and randomly selects > 1% of the observations and writes them to HDFS. It seems to run very > slowly. E.G. To write 4720 observat

Does spark streaming write ahead log writes all received data to HDFS ?

2015-11-20 Thread kali.tumm...@gmail.com
Hi All, If write ahead logs are enabled in spark streaming does all the received data gets written to HDFS path ? or it only writes the metadata. How does clean up works , does HDFS path gets bigger and bigger up everyday do I need to write an clean up job to delete data from write ahead logs

FW: starting spark-shell throws /tmp/hive on HDFS should be writable error

2015-11-20 Thread Mich Talebzadeh
From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: 20 November 2015 21:14 To: u...@hive.apache.org Subject: starting spark-shell throws /tmp/hive on HDFS should be writable error Hi, Has this been resolved. I don't think this has anything to do with /tmp/hive directory permission

spark streaming problem saveAsTextFiles() does not write valid JSON to HDFS

2015-11-19 Thread Andy Davidson
expect I.E. In valid JSON format. The key names are double quoted. Boolean values are the works true or false in lower case. When I run in my cluster the only difference is I call data.saveAsTextFiles() using an hdfs: URI instead of using file:/// . When the files are written to HDFS the JSON

Joining HDFS and JDBC data sources - benchmarks

2015-11-13 Thread Eran Medan
Hi I'm looking for some benchmarks on joining data frames where most of the data is in HDFS (e.g. in parquet) and some "reference" or "metadata" is still in RDBMS. I am only looking at the very first join before any caching happens, and I assume there will be loss of par

RE: hdfs-ha on mesos - odd bug

2015-11-11 Thread Buttler, David
I have verified that this error exists on my system as well, and the suggested workaround also works. Spark version: 1.5.1; 1.5.2 Mesos version: 0.21.1 CDH version: 4.7 I have set up the spark-env.sh to contain HADOOP_CONF_DIR pointing to the correct place, and I have also linked in the hdfs

Re: Is it possible Running SparkR on 2 nodes without HDFS

2015-11-10 Thread Sanjay Subramanian
Cool thanksI have a CDH 5.4.8 (Cloudera Starving Developers Version) with 1 NN and 4 DN and SPark is running but its 1.3.xI want to leverage this HDFS hive cluster for SparkR because we do all data munging here and produce datasets for ML. I am thinking of the following idea  1. Add 2 datanodes

Re: Is it possible Running SparkR on 2 nodes without HDFS

2015-11-10 Thread Ali Tajeldin EDU
make sure "/mnt/local/1024gbxvdf1/all_adleads_cleaned_commas_in_quotes_good_file.csv" is accessible on your slave node. -- Ali On Nov 9, 2015, at 6:06 PM, Sanjay Subramanian wrote: > hey guys > > I have a 2 node SparkR (1 master 1 slave)cluster on AWS

Is it possible Running SparkR on 2 nodes without HDFS

2015-11-09 Thread Sanjay Subramanian
hey guys I have a 2 node SparkR (1 master 1 slave)cluster on AWS using  spark-1.5.1-bin-without-hadoop.tgz Running the SparkR job on the master node  /opt/spark-1.5.1-bin-hadoop2.6/bin/sparkR --master   spark://ip-xx-ppp-vv-ddd:7077 --packages com.databricks:spark-csv_2.10:1.2.0   --executor-cores

Re: Looking for the method executors uses to write to HDFS

2015-11-06 Thread Reynold Xin
rquet file from the driver. I could use the HDFS API > but I am worried that it won't work on a secure cluster. I assume that the > method the executors use to write to HDFS takes care of managing Hadoop > security. However, I can't find the place where HDFS write happens in the > spark source.

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
I am not looking for Spark Sql specifically. My usecase is that I need to save an RDD as a parquet file in hdfs at the end of a batch and load it back and convert it into an RDD in the next batch. The RDD has a String and a Long as the key/value pairs. On Wed, Nov 4, 2015 at 11:52 PM, Stefano

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
How to convert a parquet file that is saved in hdfs to an RDD after reading the file from hdfs? On Thu, Nov 5, 2015 at 10:02 AM, Igor Berman <igor.ber...@gmail.com> wrote: > Hi, > we are using avro with compression(snappy). As soon as you have enough > partitions, the saving won

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread Igor Berman
Hi, we are using avro with compression(snappy). As soon as you have enough partitions, the saving won't be a problem imho. in general hdfs is pretty fast, s3 is less so the issue with storing data is that you will loose your partitioner(even though rdd has it) at loading moment. There is PR

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
gt; *e.g. if u have dataframe and working from java - toJavaRDD >> <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#toJavaRDD()>* >> () >> >> On 5 November 2015 at 21:13, swetha kasireddy <swethakasire...@gmail.com> >> wr

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread Igor Berman
gmail.com> wrote: > How to convert a parquet file that is saved in hdfs to an RDD after > reading the file from hdfs? > > On Thu, Nov 5, 2015 at 10:02 AM, Igor Berman <igor.ber...@gmail.com> > wrote: > >> Hi, >> we are using avro with compression(snappy). As s

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
ng from java - toJavaRDD > <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#toJavaRDD()>* > () > > On 5 November 2015 at 21:13, swetha kasireddy <swethakasire...@gmail.com> > wrote: > >> How to convert a parquet file that is saved

Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-04 Thread swetha
Hi, What is the efficient approach to save an RDD as a file in HDFS and retrieve it back? I was thinking between Avro, Parquet and SequenceFileFormart. We currently use SequenceFileFormart for one of our use cases. Any example on how to store and retrieve an RDD in an Avro and Parquet file

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-04 Thread Stefano Baghino
lt;swethakasire...@gmail.com> wrote: > Hi, > > What is the efficient approach to save an RDD as a file in HDFS and > retrieve > it back? I was thinking between Avro, Parquet and SequenceFileFormart. We > currently use SequenceFileFormart for one of our use cases. > > Any e

Looking for the method executors uses to write to HDFS

2015-11-04 Thread Tóth Zoltán
Hi, I'd like to write a parquet file from the driver. I could use the HDFS API but I am worried that it won't work on a secure cluster. I assume that the method the executors use to write to HDFS takes care of managing Hadoop security. However, I can't find the place where HDFS write happens

Re: Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-11-03 Thread Ted Yu
I am a bit curious: why is the synchronization on finalLock is needed ? Thanks > On Oct 23, 2015, at 8:25 AM, Anubhav Agarwal <anubha...@gmail.com> wrote: > > I have a spark job that creates 6 million rows in RDDs. I convert the RDD > into Data-frame and write it to HDFS. Cu

Re: Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-11-03 Thread Anubhav Agarwal
creates 6 million rows in RDDs. I convert the RDD > into Data-frame and write it to HDFS. Currently it takes 3 minutes to write > it to HDFS. > > Here is the snippet:- > RDDList.parallelStream().forEach(mapJavaRDD -> { > if (mapJavaR

<    1   2   3   4   5   6   7   8   9   10   >