Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread yeshwanth kumar
> Use as a format orc, parquet or avro because they support any >>>> compression type with parallel processing. Alternatively split your file in >>>> several smaller ones. Another alternative would be bzip2 (but slower in >>>> general) or Lzo (usually it is not incl

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread ayan guha
>> wrote: >>> >>>> Use as a format orc, parquet or avro because they support any >>>> compression type with parallel processing. Alternatively split your file in >>>> several smaller ones. Another alternative would be bzip2 (but slower in >>>> g

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread yeshwanth kumar
>>> compression type with parallel processing. Alternatively split your file in >>> several smaller ones. Another alternative would be bzip2 (but slower in >>> general) or Lzo (usually it is not included by default in many >>> distributions). >>> >

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread ayan guha
<yeshwant...@gmail.com> wrote: >> >> Hi, >> >> we are running Hive on Spark, we have an external table over snappy >> compressed csv file of size 917.4 M >> HDFS block size is set to 256 MB >> >> as per my Understanding, if i run a query over that exte

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread yeshwanth kumar
e on Spark, we have an external table over snappy > compressed csv file of size 917.4 M > HDFS block size is set to 256 MB > > as per my Understanding, if i run a query over that external table , it > should launch 4 tasks. one for each block. > but i am seeing one executor an

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread Jörn Franke
). > On 21 Nov 2016, at 23:17, yeshwanth kumar <yeshwant...@gmail.com> wrote: > > Hi, > > we are running Hive on Spark, we have an external table over snappy > compressed csv file of size 917.4 M > HDFS block size is set to 256 MB > > as per my Understanding, if i run

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread Aniket Bhatnagar
le of size 917.4 M > HDFS block size is set to 256 MB > > as per my Understanding, if i run a query over that external table , it > should launch 4 tasks. one for each block. > but i am seeing one executor and one task processing all the file. > > trying to understand the rea

RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread yeshwanth kumar
Hi, we are running Hive on Spark, we have an external table over snappy compressed csv file of size 917.4 M HDFS block size is set to 256 MB as per my Understanding, if i run a query over that external table , it should launch 4 tasks. one for each block. but i am seeing one executor and one

RDD to HDFS - Kerberos - authentication error - RetryInvocationHandler

2016-11-11 Thread Gerard Casey
Hi all, I have an RDD that I wish to write to HDFS. data.saveAsTextFile("hdfs://path/vertices") This returns: WARN RetryInvocationHandler: Exception while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over null. Not retrying because try onc

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-10 Thread Mich Talebzadeh
to get data from source as is and create file system format (csv, etc) as target, then those files can land on a directory. A cron job then can put those files in HDFS directories and the rest our choice how to treat those files. Hive external tables can be used plus using Hive or Spark to ingest data

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Michael Segel
that I have to do… -Mike On Nov 9, 2016, at 4:16 PM, Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote: Thanks guys, Sounds like let Informatica get the data out of RDBMS and create mapping to flat files that will be delivered to a directory vis

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Mich Talebzadeh
Thanks guys, Sounds like let Informatica get the data out of RDBMS and create mapping to flat files that will be delivered to a directory visible by HDFS host. Then push the csv files into HDFS. then there are number of options to work on: 1. run cron or oozie to get data out of HDFS

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Jörn Franke
alebza...@gmail.com> wrote: > > Hi, > > I am exploring the idea of flexibility with importing multiple RDBMS tables > using Informatica that customer has into HDFS. > > I don't want to use connectivity tools from Informatica to Hive etc. > > So this is what

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread ayan guha
Yes, it can be done and a standard practice. I would suggest a mixed approach: use Informatica to create files in hdfs and have hive staging tables as external tables on those directories. Then that point onwards use spark. Hth Ayan On 10 Nov 2016 04:00, "Mich Talebzadeh" <

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Mich Talebzadeh
ich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > Hi, > > I am exploring the idea of flexibility with importing multiple RDBMS > tables using Informatica that customer has into HDFS. > > I don't want to use connectivity tools from Informatica to Hive etc. > > So

importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Mich Talebzadeh
Hi, I am exploring the idea of flexibility with importing multiple RDBMS tables using Informatica that customer has into HDFS. I don't want to use connectivity tools from Informatica to Hive etc. So this is what I have in mind 1. If possible get the tables data out using Informatica

Re: Optimized way to use spark as db to hdfs etl

2016-11-06 Thread Sabarish Sasidharan
;rohit.ve...@rokittech.com> wrote: > >> I am using spark to read from database and write in hdfs as parquet file. >> Here is code snippet. >> >> private long etlFunction(SparkSession spark){ >> spark.sqlContext().setConf("spark.sql.parquet.compres

Re: Optimized way to use spark as db to hdfs etl

2016-11-05 Thread Deepak Sharma
read from database and write in hdfs as parquet file. > Here is code snippet. > > private long etlFunction(SparkSession spark){ > spark.sqlContext().setConf("spark.sql.parquet.compression.codec", > “SNAPPY"); > Properties properties = new Properties(); > properti

Optimized way to use spark as db to hdfs etl

2016-11-05 Thread Rohit Verma
I am using spark to read from database and write in hdfs as parquet file. Here is code snippet. private long etlFunction(SparkSession spark){ spark.sqlContext().setConf("spark.sql.parquet.compression.codec", “SNAPPY"); Properties properties = new Properties(); propert

Slow Parquet write to HDFS using Spark

2016-11-03 Thread morfious902002
I am using Spark 1.6.1 and writing to HDFS. In some cases it seems like all the work is being done by one thread. Why is that? Also, I need parquet.enable.summary-metadata to register the parquet files to Impala. Df.write().partitionBy("COLUMN").parquet(outputFileLocation); It a

Spark 1.6.2 Concurrent append to a HDFS folder with different partition key

2016-09-24 Thread Shing Hing Man
I am trying to prototype using a single instance SqlContext and use it toappend Dataframes,partition by a field, to the same HDFS folder from multiple threads. (Each thread will work with a DataFrame having different partition column value.) I get the exception16/09/24 16:45:12 ERROR

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-19 Thread Gene Pang
gt; I am seeing similar issues when I was working on Oracle with Tableau as > the dashboard. > > Currently I have a batch layer that gets streaming data from > > source -> Kafka -> Flume -> HDFS > > It stored on HDFS as text files and a cron process sinks H

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-18 Thread Mich Talebzadeh
On 18 September 2016 at 11:08, Jörn Franke <jornfra...@gmail.com> wrote: > Ignite has a special cache for HDFS data (which is not a Java cache), for > rdds etc. So you are right it is in this sense very different. > > Besides caching, from what I see from data scientists is that f

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-18 Thread Jörn Franke
Ignite has a special cache for HDFS data (which is not a Java cache), for rdds etc. So you are right it is in this sense very different. Besides caching, from what I see from data scientists is that for interactive queries and models evaluation they anyway do not browse the complete data. Even

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-18 Thread Mich Talebzadeh
with a certain refresh of these tables and let the customer decide. I stand corrected otherwise. BTW I did these simple test on using Zeppelin (running on Spark Standalone mode) 1) Read data using Spark sql from Flume text files on HDFS (real time) 2) Read data using Spark sql from ORC table

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-18 Thread Sean Owen
Alluxio isn't a database though; it's storage. I may be still harping on the wrong solution for you, but as we discussed offline, that's also what Impala, Drill et al are for. Sorry if this was mentioned before but Ignite is what GridGain became, if that helps. On Sat, Sep 17, 2016 at 11:00 PM,

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-18 Thread Jörn Franke
). If the data is processed in parallel then IO will be done in parallel thanks to the architecture of HDFS. Oracle Exadata exploits similar concepts. The advantage of Ignite compared to e.g.Exadata would be that you have also the indexes of ORC and Parquet in-memory which avoids reading data in-memory

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Mich Talebzadeh
>> HTH. >> >> -Todd >> >> >> On Sat, Sep 17, 2016 at 12:53 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hi, >>> >>> I am seeing similar issues when I was working on Oracle with Tableau as >>&g

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Mich Talebzadeh
h Tableau as >> the dashboard. >> >> Currently I have a batch layer that gets streaming data from >> >> source -> Kafka -> Flume -> HDFS >> >> It stored on HDFS as text files and a cron process sinks Hive table with >> the the external table buil

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Todd Nist
<mich.talebza...@gmail.com > wrote: > Hi, > > I am seeing similar issues when I was working on Oracle with Tableau as > the dashboard. > > Currently I have a batch layer that gets streaming data from > > source -> Kafka -> Flume -> HDFS > > It stored on HDFS as t

Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Mich Talebzadeh
Hi, I am seeing similar issues when I was working on Oracle with Tableau as the dashboard. Currently I have a batch layer that gets streaming data from source -> Kafka -> Flume -> HDFS It stored on HDFS as text files and a cron process sinks Hive table with the the external ta

Re: Spark Streaming-- for each new file in HDFS

2016-09-16 Thread Steve Loughran
opy/rename the file once complete. (things are slightly complicated by the fact that HDFS doesn' t update modtimes until (a) the file is closed or (b) enough data has been written that the write spans a block boundary. That means that incremental writes to HDFS may appear to work, but once yo

Re: Spark Streaming-- for each new file in HDFS

2016-09-15 Thread Peyman Mohajerian
You can listen to files in a specific directory using: Take a look at: http://spark.apache.org/docs/latest/streaming-programming-guide.html streamingContext.fileStream On Thu, Sep 15, 2016 at 10:31 AM, Jörn Franke wrote: > Hi, > I recommend that the third party

Re: Spark Streaming-- for each new file in HDFS

2016-09-15 Thread Jörn Franke
Hi, I recommend that the third party application puts an empty file with the same filename as the original file, but the extension ".uploaded". This is an indicator that the file has been fully (!) written to the fs. Otherwise you risk only reading parts of the file. Then, you can have a file

Spark Streaming-- for each new file in HDFS

2016-09-15 Thread Kappaganthu, Sivaram (ES)
Hello, I am a newbie to spark and I have below requirement. Problem statement : A third party application is dumping files continuously in a server. Typically the count of files is 100 files per hour and each file is of size less than 50MB. My application has to process those files. Here

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha
some other key). >>>> >>>> On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake <esal...@gmail.com> >>>> wrote: >>>> >>>>> 1.) What needs to be parallelized is the work for each of those 6M >>>>> rows, not the 80K files.

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
gt; rows, not the 80K files. Let me elaborate this with a simple for loop if we >>>> were to write this serially. >>>> >>>> For each line L out of 6M in the first file{ >>>> process the file corresponding to L out of those 80K files. >>>> }

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha
t;>> 1.) What needs to be parallelized is the work for each of those 6M rows, >>> not the 80K files. Let me elaborate this with a simple for loop if we were >>> to write this serially. >>> >>> For each line L out of 6M in the first file{ >>> p

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
6M rows, >> not the 80K files. Let me elaborate this with a simple for loop if we were >> to write this serially. >> >> For each line L out of 6M in the first file{ >> process the file corresponding to L out of those 80K files. >> } >> >> The 80K files a

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha
onding to L out of those 80K files. > } > > The 80K files are in HDFS and to read all that content into each worker is > not possible due to size. > > 2. No. multiple rows may point to rthe same file but they operate on > different records within the file. > > 3.

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
files are in HDFS and to read all that content into each worker is not possible due to size. 2. No. multiple rows may point to rthe same file but they operate on different records within the file. 3. End goal is to write back 6M processed information. This is simple map only type scenario. One

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha
Question: 1. Why you can not read all 80K files together? ie, why you have a dependency on first text file? 2. Your first text file has 6M rows, but total number of files~80K. is there a scenario where there may not be a file in HDFS corresponding to the row in first text file? 3. May be a follow

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
13 Sep 2016 11:39 p.m., "Saliya Ekanayake" <esal...@gmail.com> wrote: > >> Just wonder if this is possible with Spark? >> >> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <esal...@gmail.com> >> wrote: >> >>> Hi, >>> >

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Raghavendra Pandey
: > Just wonder if this is possible with Spark? > > On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <esal...@gmail.com> > wrote: > >> Hi, >> >> I've got a text file where each line is a record. For each record, I need >> to process a file in HDFS. >

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
Just wonder if this is possible with Spark? On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <esal...@gmail.com> wrote: > Hi, > > I've got a text file where each line is a record. For each record, I need > to process a file in HDFS. > > So if I represent these reco

Access HDFS within Spark Map Operation

2016-09-11 Thread Saliya Ekanayake
Hi, I've got a text file where each line is a record. For each record, I need to process a file in HDFS. So if I represent these records as an RDD and invoke a map() operation on them how can I access the HDFS within that map()? Do I have to create a Spark context within map

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-27 Thread kant kodali
I understand now that for I cannot use spark streaming window operation without checkpointing to HDFS as pointed out by @Ofir but Without window operation I don't think we can do much with spark streaming. so since it is very essential can I use Cassandra as a distributed storage? If so, can I see

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Steve Loughran
t;partition with both groups thinking they are in charge", which is more dangerous. then there's "partitioning event not detected", which may be bad. so: consider the failure modes and then consider not so much whether the tech you are using is vulnerable to it, but "if

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread kant kodali
rue I will be left with the following stack Spark + Mesos scheduler + Distributed File System or to be precise I should say Distributed Storage since S3 is an object store so I guess this will be HDFS for us + etcd & consul. Now the big question for me is how do I set all this up

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Carlile, Ken
We use Spark with NFS as the data store, mainly using Dr. Jeremy Freeman’s Thunder framework. Works very well (and I see HUGE throughput on the storage system during loads). I haven’t seen (or heard from the devs/users) a need for HDFS or S3. —Ken On Aug 25, 2016, at 8:02 PM

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Mich Talebzadeh
eper can do >> many things but for our use case all we need is for high availability and >> given the devops people frustrations here in our company who had extensive >> experience managing large clusters in the past we would be very happy to >> avoid Zookeeper. I also heard that Mesos can provide High Availability >> through etcd and consul and if that is true I will be left with the >> following stack >> >> >> >> >> >> Spark + Mesos scheduler + Distributed File System or to be precise I >> should say Distributed Storage since S3 is an object store so I guess this >> will be HDFS for us + etcd & consul. Now the big question for me is how do >> I set all this up >> >> >> >> >> >>

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread kant kodali
torage since S3 is an object store so I guess this will be HDFS for us + etcd & consul. Now the big question for me is how do I set all this up

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Steve Loughran
if that is true I will be left with the following stack Spark + Mesos scheduler + Distributed File System or to be precise I should say Distributed Storage since S3 is an object store so I guess this will be HDFS for us + etcd & consul. Now the big question for me is how do I set all this up [https

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
outputs, and will do so with good data locality. On Thu, Aug 25, 2016 at 3:36 PM, ayan guha < guha.a...@gmail.com > wrote: At the core of it map reduce relies heavily on data locality. You would lose the ability to process data closest to where it resides if you do not use hdfs. S3 or NFS will no

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
) will use the map outputs, and will do so with good data locality. On Thu, Aug 25, 2016 at 3:36 PM, ayan guha < guha.a...@gmail.com > wrote: At the core of it map reduce relies heavily on data locality. You would lose the ability to process data closest to where it resides if you do not use hdfs. S3

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
guha <guha.a...@gmail.com> wrote: > At the core of it map reduce relies heavily on data locality. You would > lose the ability to process data closest to where it resides if you do not > use hdfs. > S3 or NFS will not able to provide that. > On 26 Aug 2016 07:49, "kant k

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Michael Gummelt
> You would lose the ability to process data closest to where it resides if you do not use hdfs. This isn't true. Many other data sources (e.g. Cassandra) support locality. On Thu, Aug 25, 2016 at 3:36 PM, ayan guha <guha.a...@gmail.com> wrote: > At the core of it map reduce re

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread ayan guha
At the core of it map reduce relies heavily on data locality. You would lose the ability to process data closest to where it resides if you do not use hdfs. S3 or NFS will not able to provide that. On 26 Aug 2016 07:49, "kant kodali" <kanth...@gmail.com> wrote: > yeah so its

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
os can provide High Availability through etcd and consul and if that is true I will be left with the following stack Spark + Mesos scheduler + Distributed File System or to be precise I should say Distributed Storage since S3 is an object store so I guess this will be HDFS for us + etcd & consul. N

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Michael Gummelt
gt; avoid Zookeeper. I also heard that Mesos can provide High Availability > through etcd and consul and if that is true I will be left with the > following stack > > Spark + Mesos scheduler + Distributed File System or to be precise I > should say Distributed Storage since S3 is an o

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
is an object store so I guess this will be HDFS for us + etcd & consul. Now the big question for me is how do I set all this up On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote: Just to add one concrete example regarding HDFS dependency. Have a look at checkpointing h

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Ofir Manor
Just to add one concrete example regarding HDFS dependency. Have a look at checkpointing https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing For example, for Spark Streaming, you can not do any window operation in a cluster without checkpointing to HDFS (or S3

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mich Talebzadeh
Hi Kant, I trust the following would be of use. Big Data depends on Hadoop Ecosystem from whichever angle one looks at it. In the heart of it and with reference to points you raised about HDFS, one needs to have a working knowledge of Hadoop Core System including HDFS, Map-reduce algorithm

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
te: >> >>> @Mich I understand why I would need Zookeeper. It is there for fault >>> tolerance given that spark is a master-slave architecture and when a mater >>> goes down zookeeper will run a leader election algorithm to elect a new >>> leader ho

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
a new >> leader however DevOps hate Zookeeper they would be much happier to go with >> etcd & consul and looks like if we mesos scheduler we should be able to >> drop Zookeeper. >> >> HDFS I am still trying to understand why I would need for spark. I >> unde

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Sean Owen
ure and when a mater > goes down zookeeper will run a leader election algorithm to elect a new > leader however DevOps hate Zookeeper they would be much happier to go with > etcd & consul and looks like if we mesos scheduler we should be able to > drop Zookeeper. > > HDFS I am stil

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
& consul and looks like if we mesos scheduler we should be able to drop Zookeeper. HDFS I am still trying to understand why I would need for spark. I understand the purpose of distributed file systems in general but I don't understand in the context of spark since many people say you can run a s

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Peter Figliozzi
Spark is a parallel computing framework. There are many ways to give it data to chomp down on. If you don't know why you would need HDFS, then you don't need it. Same goes for Zookeeper. Spark works fine without either. Much of what we read online comes from people with specialized problems

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mich Talebzadeh
You can use Spark on Oracle as a query tool. It all depends on the mode of the operation. If you running Spark with yarn-client/cluster then you will need yarn. It comes as part of Hadoop core (HDFS, Map-reduce and Yarn). I have not gone and installed Yarn without installing Hadoop. What

What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-24 Thread kant kodali
What do I loose if I run spark without using HDFS or Zookeper ? which of them is almost a must in practice?

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Sun Rui
d.com <mailto:tony@tendcloud.com> > > From: Sun Rui <mailto:sunrise_...@163.com> > Date: 2016-08-24 22:17 > To: Saisai Shao <mailto:sai.sai.s...@gmail.com> > CC: tony@tendcloud.com <mailto:tony@tendcloud.com>; user > <mailto:user@spark.apa

Re: Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Saisai Shao
oud.com > > > *From:* Sun Rui <sunrise_...@163.com> > *Date:* 2016-08-24 22:17 > *To:* Saisai Shao <sai.sai.s...@gmail.com> > *CC:* tony@tendcloud.com; user <user@spark.apache.org> > *Subject:* Re: Can we redirect Spark shuffle spill data to HDFS or >

Re: Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread tony....@tendcloud.com
@tendcloud.com From: Sun Rui Date: 2016-08-24 22:17 To: Saisai Shao CC: tony@tendcloud.com; user Subject: Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio? Yes, I also tried FUSE before, it is not stable and I don’t recommend it On Aug 24, 2016, at 22:15, Saisai Shao

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Sun Rui
>), but not so stable as I tried > before. > > On Wed, Aug 24, 2016 at 10:09 PM, Sun Rui <sunrise_...@163.com > <mailto:sunrise_...@163.com>> wrote: > For HDFS, maybe you can try mount HDFS as NFS. But not sure about the > stability, and also there is additiona

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Saisai Shao
Also fuse is another candidate (https://wiki.apache.org/hadoop/MountableHDFS), but not so stable as I tried before. On Wed, Aug 24, 2016 at 10:09 PM, Sun Rui <sunrise_...@163.com> wrote: > For HDFS, maybe you can try mount HDFS as NFS. But not sure about the > stabili

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Sun Rui
For HDFS, maybe you can try mount HDFS as NFS. But not sure about the stability, and also there is additional overhead of network I/O and replica of HDFS files. > On Aug 24, 2016, at 21:02, Saisai Shao <sai.sai.s...@gmail.com> wrote: > > Spark Shuffle uses Java File related API

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Saisai Shao
data, spark will do shuffle and the > shuffle data will write to local disk. Because we have limited capacity at > local disk, the shuffled data will occupied all of the local disk and then > will be failed. So is there a way we can write the shuffle spill data to > HDFS? Or if we

Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread tony....@tendcloud.com
data to HDFS? Or if we introduce alluxio in our system, can the shuffled data write to alluxio? Thanks and Regards, 阎志涛(Tony) 北京腾云天下科技有限公司 邮箱:tony@tendcloud.com 电话:13911815695 微信

回复:saving DF to HDFS in parquet format very slow in SparkSQL app

2016-08-09 Thread luohui20001
maybe this problem is not so easy to understand, so I attached my full code. Hope this could help in solving the problem. ThanksBest regards! San.Luo - 原始邮件 - 发件人:<luohui20...@sina.com> 收件人:"user" <user@spark.apache.org>

saving DF to HDFS in parquet format very slow in SparkSQL app

2016-08-09 Thread luohui20001
hi there:I got a problem in saving a DF to HDFS as parquet format very slow. And I attached a pic which shows a lot of time is spent in getting result.the code is :streamingData.write.mode(SaveMode.Overwrite).parquet("/data/streamingData") I don't quite understand why my app

Re: hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Gourav Sengupta
; On Aug 8, 2016, at 1:35 PM, Sumit Khanna <sumit.kha...@askme.in> wrote: >> >> Hello, >> >> the use case is as follows : >> >> say I am inserting 200K rows as dataframe.write.formate("parquet") etc >> etc (like a basic write to hdfs command

Re: hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Chanh Le
kha...@askme.in>> wrote: >> >> Hello, >> >> the use case is as follows : >> >> say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc >> (like a basic write to hdfs command), but say due to some reason or rhyme &g

Re: hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Gourav Sengupta
or nothing inside > (Spark job was failed). > > > > > > On Aug 8, 2016, at 1:35 PM, Sumit Khanna <sumit.kha...@askme.in> wrote: > > Hello, > > the use case is as follows : > > say I am inserting 200K rows as dataframe.write.formate("parquet") et

Re: hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Chanh Le
016, at 1:35 PM, Sumit Khanna <sumit.kha...@askme.in> wrote: > > Hello, > > the use case is as follows : > > say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc > (like a basic write to hdfs command), but say due to some reason or

hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Sumit Khanna
Hello, the use case is as follows : say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc (like a basic write to hdfs command), but say due to some reason or rhyme my job got killed, when the run was in the mid of it, meaning lets say I was only able to insert

Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-31 Thread Andrew Ehrlich
I'm wondering if there is a do processing one each of these (process in this > case is just getting the bytes length but will be something else in real > world) and then write the contents to separate HDFS files. > If this doesn't make sense, would it make more sense to have all contents in

Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-30 Thread Bhaarat Sharma
I'm wondering if there is a do processing one each of these (process in this case is just getting the bytes length but will be something else in real world) and then write the contents to separate HDFS files. If this doesn't make sense, would it make more sense to have all contents in a single HDF

Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-30 Thread ayan guha
This sounds a bad idea, given hdfs does not work well with small files. On Sun, Jul 31, 2016 at 8:57 AM, Bhaarat Sharma <bhaara...@gmail.com> wrote: > I am reading bunch of files in PySpark using binaryFiles. Then I want to > get the number of bytes for each file and write this numbe

How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-30 Thread Bhaarat Sharma
I am reading bunch of files in PySpark using binaryFiles. Then I want to get the number of bytes for each file and write this number to an HDFS file with the corresponding name. Example: if directory /myimages has one.jpg, two.jpg, and three.jpg then I want three files one-success.jpg, two

Re: how to copy local files to hdfs quickly?

2016-07-30 Thread Andy Davidson
<a...@santacruzintegration.com> Date: Wednesday, July 27, 2016 at 4:25 PM To: "user @spark" <user@spark.apache.org> Subject: how to copy local files to hdfs quickly? > I have a spark streaming app that saves JSON files to s3:// . It works fine > > Now I need to calculate

Re: use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming.

2016-07-30 Thread Andy Davidson
vidson <a...@santacruzintegration.com>, Pedro Rodriguez <ski.rodrig...@gmail.com> Cc: "user @spark" <user@spark.apache.org> Subject: use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming. > Hi Pedro > > I did some experi

use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming.

2016-07-29 Thread Andy Davidson
Hi Pedro I did some experiments. I using one of our relatively small data set. The data set is loaded into 3 or 4 data frames. I then call count() Looks like using bigger files and reading from HDFS is a good solution for reading data. I guess I¹ll need to do something similar to this to deal

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread ayan guha
gt;>>> Hey, >>>> >>>> the very first run : >>>> >>>> glossary : >>>> >>>> delta_df := current run / execution changes dataframe. >>>> >>>> def deduplicate : >>>> apply windowin

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread Sumit Khanna
;> apply windowing function and group by >>> >>> def partitionDataframe(delta_df) : >>> get unique keys of that data frame and then return an array of data >>> frames each containing just that very same key as the column. >>> this will give the

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread ayan guha
te(delta_df : delta_df [ with all unique primary / >> deduplicating key column ] >> 1. partitionDataframe(delta_df) : Array[delta_df(i to # partitons)] >> 2. write the dataframe to corresponding parent hdfs path + partiton dir_ >> >> subsequent runs : >> >> for e

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread Sumit Khanna
a_df) : Array[delta_df(i to # partitons)] > 2. write the dataframe to corresponding parent hdfs path + partiton dir_ > > subsequent runs : > > for each partition : > 0. partitionDataframe(delta_df) : Array[delta_df(i to # partitons)] > 1. load df from previous hdfs location o

correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread Sumit Khanna
. write the dataframe to corresponding parent hdfs path + partiton dir_ subsequent runs : for each partition : 0. partitionDataframe(delta_df) : Array[delta_df(i to # partitons)] 1. load df from previous hdfs location of that partition 2. filter the above df(p) where p is the partiton

how to copy local files to hdfs quickly?

2016-07-27 Thread Andy Davidson
I have a spark streaming app that saves JSON files to s3:// . It works fine Now I need to calculate some basic summary stats and am running into horrible performance problems. I want to run a test to see if reading from hdfs instead of s3 makes difference. I am able to quickly copy my the data

Re: Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout

2016-07-26 Thread Cody Koeninger
0 messages from one > topic of Kafka, did some transformation, and wrote the result back to > another topic. But only found 988 messages in the second topic. I checked > log info and confirmed all messages was received by receivers. But I found a > hdfs writing time out

Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout

2016-07-26 Thread Andy Zhao
found a hdfs writing time out message printed from Class BatchedWriteAheadLog. I checkout source code and found code like this: /** Add received block. This event will get written to the write ahead log (if enabled). */ def addBlock(receivedBlockInfo: ReceivedBlockInfo

<    1   2   3   4   5   6   7   8   9   10   >