Re: Re: --jars option using hdfs jars cannot effect when spark standlone deploymode with cluster

2015-11-03 Thread our...@cnsuning.com
Subject: Re: --jars option using hdfs jars cannot effect when spark standlone deploymode with cluster Can you give a try putting the jar locally without hdfs? Thanks Best Regards On Wed, Oct 28, 2015 at 8:40 AM, our...@cnsuning.com <our...@cnsuning.com> wrote: hi all, when using c

Re: --jars option using hdfs jars cannot effect when spark standlone deploymode with cluster

2015-11-02 Thread Akhil Das
Can you give a try putting the jar locally without hdfs? Thanks Best Regards On Wed, Oct 28, 2015 at 8:40 AM, our...@cnsuning.com <our...@cnsuning.com> wrote: > hi all, >when using command: > spark-submit *--deploy-mode cluster --jars > hdfs:///user/spark/c

Re: streaming.twitter.TwitterUtils what is the best way to save twitter status to HDFS?

2015-11-01 Thread Akhil Das
You can use the .saveAsObjectFiles("hdfs://sigmoid/twitter/status/") since you want to store the Status object and for every batch it will create a directory under /status (name will mostly be the timestamp), since the data is small (hardly couple of MBs for 1 sec interval) it will not

Re: java how to configure streaming.dstream.DStream<> saveAsTextFiles() to work with hdfs?

2015-11-01 Thread Akhil Das
How are you submitting your job? You need to make sure HADOOP_CONF_DIR is pointing to your hadoop configuration directory (with core-site.xml, hdfs-site.xml files), If you have them set properly then make sure you are giving the full hdfs url like: dStream.saveAsTextFiles("hdfs://sigmoid-cl

--jars option using hdfs jars cannot effect when spark standlone deploymode with cluster

2015-10-27 Thread our...@cnsuning.com
hi all, when using command: spark-submit --deploy-mode cluster --jars hdfs:///user/spark/cypher.jar --class com.suning.spark.jdbc.MysqlJdbcTest hdfs:///user/spark/MysqlJdbcTest.jar the program throw exception that cannot find class in cypher.jar, the driver log show no --jars

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
t; wrote: >> >>> >>> > On 26 Oct 2015, at 09:28, Jinfeng Li <liji...@gmail.com> wrote: >>> > >>> > Replication factor is 3 and we have 18 data nodes. We check HDFS >>> webUI, data is evenly distributed among 18 machines. >>>

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
> Hi, I find that loading files from HDFS can incur huge amount of network > traffic. Input size is 90G and network traffic is about 80G. By my > understanding, local files should be read and thus no network communication > is needed. > > I use Spark 1.5.1, and the following is

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
Hm, how about the opposite question -- do you have just 1 executor? then again everything will be remote except for a small fraction of blocks. On Mon, Oct 26, 2015 at 9:28 AM, Jinfeng Li <liji...@gmail.com> wrote: > Replication factor is 3 and we have 18 data nodes. We check HDFS webU

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
n everything will be remote except for a small fraction of blocks. > > On Mon, Oct 26, 2015 at 9:28 AM, Jinfeng Li <liji...@gmail.com> wrote: > >> Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, >> data is evenly distributed among 18 machines. >> >&g

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
wrote: > > > On 26 Oct 2015, at 09:28, Jinfeng Li <liji...@gmail.com> wrote: > > > > Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, > data is evenly distributed among 18 machines. > > > > > every block in HDFS (usually 64-128-256

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Steve Loughran
> On 26 Oct 2015, at 09:28, Jinfeng Li <liji...@gmail.com> wrote: > > Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, data > is evenly distributed among 18 machines. > every block in HDFS (usually 64-128-256 MB) is distributed across three

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, data is evenly distributed among 18 machines. On Mon, Oct 26, 2015 at 5:18 PM Sean Owen <so...@cloudera.com> wrote: > Have a look at your HDFS replication, and where the blocks are for these > files. For example

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
hortonworks.com> > wrote: > >> >> > On 26 Oct 2015, at 09:28, Jinfeng Li <liji...@gmail.com> wrote: >> > >> > Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, >> data is evenly distributed among 18 machines. >>

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
i <liji...@gmail.com> wrote: > >> Hi, I find that loading files from HDFS can incur huge amount of network >> traffic. Input size is 90G and network traffic is about 80G. By my >> understanding, local files should be read and thus no network communication >> is needed. >

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
Have a look at your HDFS replication, and where the blocks are for these files. For example, if you had only 2 HDFS data nodes, then data would be remote to 16 of 18 workers and always entail a copy. On Mon, Oct 26, 2015 at 9:12 AM, Jinfeng Li <liji...@gmail.com> wrote: > I cat /pro

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
> not all executors are local to all data. That can be the situation in many >> cases but not always. >> >> On Mon, Oct 26, 2015 at 8:57 AM, Jinfeng Li <liji...@gmail.com> wrote: >> >>> Hi, I find that loading files from HDFS can incur huge amount of network &g

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
>>> >>>> >>>> > On 26 Oct 2015, at 09:28, Jinfeng Li <liji...@gmail.com> wrote: >>>> > >>>> > Replication factor is 3 and we have 18 data nodes. We check HDFS >>>> webUI, data is evenly distributed among 18 machines.

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Steve Loughran
. Or are you sure it's not? HDFS stats are really the general filesystem stats: they measure data through the input and output streams, not whether they were to/from local or remote systems. Fixable, and metrics are always good, though as Hadoop (currently) uses Hadoop metrics 2, not the codahal

java how to configure streaming.dstream.DStream<> saveAsTextFiles() to work with hdfs?

2015-10-24 Thread Andy Davidson
not a good cluster solution. Any idea how I can configure spark so that it will write the output to hdfs? JavaDStream tweets = TwitterFilterQueryUtils.createStream(ssc, twitterAuth); DStream dStream = tweets.dstream(); String prefix = ³MyPrefix"; String suffix =

Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-10-23 Thread morfious902002
I have a spark job that creates 6 million rows in RDDs. I convert the RDD into Data-frame and write it to HDFS. Currently it takes 3 minutes to write it to HDFS. I am using spark 1.5.1 with YARN. Here is the snippet:- RDDList.parallelStream().forEach(mapJavaRDD

streaming.twitter.TwitterUtils what is the best way to save twitter status to HDFS?

2015-10-23 Thread Andy Davidson
I need to save the twitter status I receive so that I can do additional batch based processing on them in the future. Is it safe to assume HDFS is the best way to go? Any idea what is the best way to save twitter status to HDFS? JavaStreamingContext ssc = new JavaStreamingContext(jsc

Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-10-23 Thread Anubhav Agarwal
I have a spark job that creates 6 million rows in RDDs. I convert the RDD into Data-frame and write it to HDFS. Currently it takes 3 minutes to write it to HDFS. Here is the snippet:- RDDList.parallelStream().forEach(mapJavaRDD -> { if (mapJavaRDD != n

Re: Incremental load of RDD from HDFS?

2015-10-22 Thread Chris Spagnoli
in context: http://apache-spark-user-list.1001560.n3.nabble.com/Incremental-load-of-RDD-from-HDFS-tp25145p25166.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

Re: Storing Compressed data in HDFS into Spark

2015-10-22 Thread Akhil Das
Convert your data to parquet, it saves space and time. Thanks Best Regards On Mon, Oct 19, 2015 at 11:43 PM, ahaider3 <ahaid...@hawk.iit.edu> wrote: > Hi, > A lot of the data I have in HDFS is compressed. I noticed when I load this > data into spark and cache it, Spark unroll

Re: Storing Compressed data in HDFS into Spark

2015-10-22 Thread Igor Berman
check spark.rdd.compress On 19 October 2015 at 21:13, ahaider3 <ahaid...@hawk.iit.edu> wrote: > Hi, > A lot of the data I have in HDFS is compressed. I noticed when I load this > data into spark and cache it, Spark unrolls the data like normal but stores > the data unco

Re: Storing Compressed data in HDFS into Spark

2015-10-22 Thread Adnan Haider
<igor.ber...@gmail.com> wrote: > check spark.rdd.compress > > On 19 October 2015 at 21:13, ahaider3 <ahaid...@hawk.iit.edu> wrote: > >> Hi, >> A lot of the data I have in HDFS is compressed. I noticed when I load this >> data into spark and cache it, Spark

Incremental load of RDD from HDFS?

2015-10-20 Thread Chris Spagnoli
I am new to Spark, and this user community, so my apologies if this was answered elsewhere and I missed it (I did try search first). We have multiple large RDDs stored across a HDFS via Spark (by calling pairRDD.saveAsNewAPIHadoopFile()), and one thing we need to do is re-load a given RDD

Re: Incremental load of RDD from HDFS?

2015-10-20 Thread Ali Tajeldin EDU
ark, and this user community, so my apologies if this was > answered elsewhere and I missed it (I did try search first). > > We have multiple large RDDs stored across a HDFS via Spark (by calling > pairRDD.saveAsNewAPIHadoopFile()), and one thing we need to do is re-load a > given

Storing Compressed data in HDFS into Spark

2015-10-19 Thread ahaider3
Hi, A lot of the data I have in HDFS is compressed. I noticed when I load this data into spark and cache it, Spark unrolls the data like normal but stores the data uncompressed in memory. For example, suppose /data/ is an RDD with compressed partitions on HDFS. I then cache the data. When I call

Accessing HDFS HA from spark job (UnknownHostException error)

2015-10-16 Thread kyarovoy
I have Apache Mesos 0.22.1 cluster (3 masters & 5 slaves), running Cloudera HDFS (2.5.0-cdh5.3.1) in HA configuration and Spark 1.5.1 framework. When I try to spark-submit compiled HdfsTest.scala example app (from Spark 1.5.1 sources) - it fails with "java.lang.IllegalArgument

Data skipped while writing Spark Streaming output to HDFS

2015-10-12 Thread Sathiskumar
I'm running a Spark Streaming application for every 10 seconds, its job is to consume data from kafka, transform it and store it into HDFS based on the key. i.e, a file per unique key. I'm using the Hadoop's saveAsHadoopFile() API to store the output, I see that a file gets generated for every

Re: Data skipped while writing Spark Streaming output to HDFS

2015-10-12 Thread Shixiong Zhu
nsume data from kafka, transform it and store it into HDFS based on the > key. i.e, a file per unique key. I'm using the Hadoop's saveAsHadoopFile() > API to store the output, I see that a file gets generated for every unique > key, but the issue is that only one row gets stored for each of

How can I read file from HDFS i sparkR from RStudio

2015-10-08 Thread Amit Behera
Hi All, I am very new to SparkR. I am able to run a sample code from example given in the link : http://www.r-bloggers.com/installing-and-starting-sparkr-locally-on-windows-os-and-rstudio/ Then I am trying to read a file from HDFS in RStudio, but unable to read. Below is my code

RE: How can I read file from HDFS i sparkR from RStudio

2015-10-08 Thread Sun, Rui
Amit, sqlContext <- sparkRSQL.init(sc) peopleDF <- read.df(sqlContext, "hdfs://master:9000/sears/example.csv") have you restarted the R session in RStudio between the two lines? From: Amit Behera [mailto:amit.bd...@gmail.com] Sent: Thursday, October 8, 2015 5:59 PM To: user@

ClassCastException while reading data from HDFS through Spark

2015-10-07 Thread Vinoth Sankar
I'm just reading data from HDFS through Spark. It throws *java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.BytesWritable* at line no 6. I never used LongWritable in my code, no idea how the data was in that format. Note : I'm not using

Re: ClassCastException while reading data from HDFS through Spark

2015-10-07 Thread UMESH CHAUDHARY
wrote: > I'm just reading data from HDFS through Spark. It throws > *java.lang.ClassCastException: > org.apache.hadoop.io.LongWritable cannot be cast to > org.apache.hadoop.io.BytesWritable* at line no 6. I never used > LongWritable in my code, no idea how the data was in that format. >

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Matt Narrell
One. I read in LZO compressed files from HDFS Perform a map operation cache the results of this map operation call saveAsHadoopFile to write LZO back to HDFS. Without the cache, the job will stall. mn > On Oct 5, 2015, at 7:25 PM, Mohammed Guller <moham...@glassbeam.com&

RE: laziness in textFile reading from HDFS?

2015-10-06 Thread Mohammed Guller
reading from HDFS? One. I read in LZO compressed files from HDFS Perform a map operation cache the results of this map operation call saveAsHadoopFile to write LZO back to HDFS. Without the cache, the job will stall. mn > On Oct 5, 2015, at 7:25 PM, Mohammed Guller <moham...@glassbeam.com&

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Matt Narrell
spark.apache.org > Subject: Re: laziness in textFile reading from HDFS? > > One. > > I read in LZO compressed files from HDFS Perform a map operation cache the > results of this map operation call saveAsHadoopFile to write LZO back to HDFS. > > Without the cache, th

RE: laziness in textFile reading from HDFS?

2015-10-06 Thread Mohammed Guller
-hadoop-throws-exception-for-large-lzo-files Mohammed -Original Message- From: Matt Narrell [mailto:matt.narr...@gmail.com] Sent: Tuesday, October 6, 2015 4:08 PM To: Mohammed Guller Cc: davidkl; user@spark.apache.org Subject: Re: laziness in textFile reading from HDFS? Agreed. This is spark

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Jonathan Coveney
hadoop-throws-exception-for-large-lzo-files > > > Mohammed > > > -Original Message- > From: Matt Narrell [mailto:matt.narr...@gmail.com <javascript:;>] > Sent: Tuesday, October 6, 2015 4:08 PM > To: Mohammed Guller > Cc: davidkl; user@spark.apache.org

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Jonathan Coveney
hadoop-throws-exception-for-large-lzo-files > > > Mohammed > > > -Original Message- > From: Matt Narrell [mailto:matt.narr...@gmail.com <javascript:;>] > Sent: Tuesday, October 6, 2015 4:08 PM > To: Mohammed Guller > Cc: davidkl; user@spark.apache.org

RE: laziness in textFile reading from HDFS?

2015-10-05 Thread Mohammed Guller
: laziness in textFile reading from HDFS? Is there any more information or best practices here? I have the exact same issues when reading large data sets from HDFS (larger than available RAM) and I cannot run without setting the RDD persistence level to MEMORY_AND_DISK_SER, and using nearly all

Re: HDFS small file generation problem

2015-10-03 Thread nibiau
ndredi 2 Octobre 2015 18:37:22 Objet: Re: HDFS small file generation problem Ok thanks, but can I also update data instead of insert data ? - Mail original - De: "Brett Antonides" <banto...@gmail.com> À: user@spark.apache.org Envoyé: Vendredi 2 Octobre 2015 18:18:18 Objet

Re: HDFS small file generation problem

2015-10-03 Thread nibiau
Hello, So, does Hive is a solution for my need : - I receive small messages (10KB) identified by ID (product ID for example) - Each message I receive is the last picture of my product ID, so I just want basically to store last picture products inside HDFS in order to process batch on it later

Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Ted Yu
bq. val dist = sc.parallelize(l) Following the above, can you call, e.g. count() on dist before saving ? Cheers On Fri, Oct 2, 2015 at 1:21 AM, jarias <ja...@elrocin.es> wrote: > Dear list, > > I'm experimenting a problem when trying to write any RDD to HDFS. I've > tr

Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke
: > Hello, > So, does Hive is a solution for my need : > - I receive small messages (10KB) identified by ID (product ID for example) > - Each message I receive is the last picture of my product ID, so I just > want basically to store last picture products inside HDFS > in order to p

Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke
eceive is the last picture of my product ID, so I just > want basically to store last picture products inside HDFS > in order to process batch on it later. > > If I use Hive I suppose I have to use INSERT and UPDATE records and > periodically CONCATENATE. > After a CONCATENATE I sup

Re: RE : Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke
fied by ID (product ID for >> example) >> - Each message I receive is the last picture of my product ID, so I just >> want basically to store last picture products inside HDFS >> in order to process batch on it later. >> >> If I use Hive I suppose I have to

Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke
; Nicolas > > - Mail original - > De: nib...@free.fr > À: "Brett Antonides" <banto...@gmail.com> > Cc: user@spark.apache.org > Envoyé: Vendredi 2 Octobre 2015 18:37:22 > Objet: Re: HDFS small file generation problem > > Ok thanks, but can I also upda

RE : Re: HDFS small file generation problem

2015-10-03 Thread nibiau
xample) - Each message I receive is the last picture of my product ID, so I just want basically to store last picture products inside HDFS in order to process batch on it later. If I use Hive I suppose I have to use INSERT and UPDATE records and periodically CONCATENATE. After a CONCATENATE

Re: RE : Re: HDFS small file generation problem

2015-10-03 Thread nibiau
Thanks a lot, why you said "the most recent version" ? - Mail original - De: "Jörn Franke" <jornfra...@gmail.com> À: "nibiau" <nib...@free.fr> Cc: banto...@gmail.com, user@spark.apache.org Envoyé: Samedi 3 Octobre 2015 13:56:43 Objet: Re: RE : Re:

Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Jacinto Arias
2015 at 1:21 AM, jarias <ja...@elrocin.es > <mailto:ja...@elrocin.es>> wrote: > Dear list, > > I'm experimenting a problem when trying to write any RDD to HDFS. I've tried > with minimal examples, scala programs and pyspark programs both in local and > cluster mode

Re: RE : Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke
@spark.apache.org > Envoyé: Samedi 3 Octobre 2015 13:56:43 > Objet: Re: RE : Re: HDFS small file generation problem > > > > Yes the most recent version yes, or you can use phoenix on top of hbase. I > recommend to try out both and see which one is the most suitable. > > &

Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Ajay Chander
Hi Jacin, If I was you, first thing that I would do is, write a sample java application to write data into hdfs and see if it's working fine. Meta data is being created in hdfs, that means, communication to namenode is working fine but not to datanodes since you don't see any data inside the file

Re: laziness in textFile reading from HDFS?

2015-10-03 Thread Matt Narrell
Is there any more information or best practices here? I have the exact same issues when reading large data sets from HDFS (larger than available RAM) and I cannot run without setting the RDD persistence level to MEMORY_AND_DISK_SER, and using nearly all the cluster resources. Should I

saveAsTextFile creates an empty folder in HDFS

2015-10-02 Thread jarias
Dear list, I'm experimenting a problem when trying to write any RDD to HDFS. I've tried with minimal examples, scala programs and pyspark programs both in local and cluster modes and as standalone applications or shells. My problem is that when invoking the write command, a task is executed

Re: HDFS small file generation problem

2015-10-02 Thread nibiau
Hello, Yes but : - In the Java API I don't find a API to create a HDFS archive - As soon as I receive a message (with messageID) I need to replace the old existing file by the new one (name of file being the messageID), is it possible with archive ? Tks Nicolas - Mail original - De

Re: HDFS small file generation problem

2015-10-02 Thread Brett Antonides
to merge your many small files into larger files optimized for your HDFS block size * Since the CONCATENATE command operates on files in place it is transparent to any downstream processing Cheers, Brett On Fri, Oct 2, 2015 at 3:48 PM, <nib...@free.fr> wrote: > Hel

Re: SparkSQL: Reading data from hdfs and storing into multiple paths

2015-10-02 Thread Michael Armbrust
Once you convert your data to a dataframe (look at spark-csv), try df.write.partitionBy("", "mm").save("..."). On Thu, Oct 1, 2015 at 4:11 PM, haridass saisriram < haridass.saisri...@gmail.com> wrote: > Hi, > > I am trying to find a simple ex

Re: HDFS small file generation problem

2015-10-02 Thread nibiau
Ok thanks, but can I also update data instead of insert data ? - Mail original - De: "Brett Antonides" <banto...@gmail.com> À: user@spark.apache.org Envoyé: Vendredi 2 Octobre 2015 18:18:18 Objet: Re: HDFS small file generation problem I had a very similar pr

SparkSQL: Reading data from hdfs and storing into multiple paths

2015-10-01 Thread haridass saisriram
Hi, I am trying to find a simple example to read a data file on HDFS. The file has the following format a , b , c ,,mm a1,b1,c1,2015,09 a2,b2,c2,2014,08 I would like to read this file and store it in HDFS partitioned by year and month. Something like this /path/to/hdfs//mm I want

Re: Reading kafka stream and writing to hdfs

2015-09-30 Thread Akhil Das
Like: counts.saveAsTestFiles("hdfs://host:port/some/location") Thanks Best Regards On Tue, Sep 29, 2015 at 2:15 AM, Chengi Liu <chengi.liu...@gmail.com> wrote: > Hi, > I am going thru this example here: > > https://github.com/apache/spark/blob/master/exampl

Self Join reading the HDFS blocks TWICE

2015-09-29 Thread Data Science Education
tition_key < '2015-07-01' GROUP BY KEY1 ,KEY2 ) TAB2 ON TAB1.KEY1= TAB2.KEY1AND TAB1.KEY2= TAB2.KEY1 WHERE partition_key >= '2015-01-01' and partition_key < '2015-07-01' GROUP BY TAB1.KEY1, TAB1.KEY2""") I see that ~18,000 HDFS blocks are read TWICE and then the Shuffle

RE: laziness in textFile reading from HDFS?

2015-09-29 Thread Mohammed Guller
1) It is not required to have the same amount of memory as data. 2) By default the # of partitions are equal to the number of HDFS blocks 3) Yes, the read operation is lazy 4) It is okay to have more number of partitions than number of cores. Mohammed -Original Message- From: davidkl

Reading kafka stream and writing to hdfs

2015-09-28 Thread Chengi Liu
Hi, I am going thru this example here: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py If I want to write this data on hdfs. Whats the right way to do this? Thanks

Re: HDFS small file generation problem

2015-09-28 Thread Jörn Franke
I have to store them inside HDFS in order to treat them by PIG > jobs on-demand. > The problem is the fact that I generate a lot of small files in HDFS > (several millions) and it can be problematic. > I investigated to use Hbase or Archive file but I don't want to do it > fin

laziness in textFile reading from HDFS?

2015-09-28 Thread davidkl
to read an HDFS folder (containing multiple files), I understand that the number of partitions created are equal to the number of HDFS blocks, correct? Are those created in a lazy way? I mean, if the number of blocks/partitions is larger than the number of cores/threads the Spark driver was launched

Re: HDFS is undefined

2015-09-28 Thread Akhil Das
application. > > I have installed the cloudera manager. > it includes the spark version 1.2.0 > > > But now i want to use spark version 1.4.0. > > its also working fine. > > But when i try to access the HDFS in spark 1.4.0 in eclipse i am getting > the fo

Re: HDFS is undefined

2015-09-28 Thread Ted Yu
But now i want to use spark version 1.4.0. > > its also working fine. > > But when i try to access the HDFS in spark 1.4.0 in eclipse i am getting the > following error. > > "Exception in thread "main" java.nio.file.FileSystemNotFoundException: &g

Re: HDFS small file generation problem

2015-09-27 Thread ayan guha
I would suggest not to write small files to hdfs. rather you can hold them in memory, maybe off heap. and then you may flush it to hdfs using another job. similar to https://github.com/ptgoetz/storm-hdfs (not sure if spark already has something like it) On Sun, Sep 27, 2015 at 11:36 PM, <

HDFS small file generation problem

2015-09-27 Thread nibiau
Hello, I'm still investigating my small file generation problem generated by my Spark Streaming jobs. Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), and I have to store them inside HDFS in order to treat them by PIG jobs on-demand. The problem is the fact that I

Re: HDFS small file generation problem

2015-09-27 Thread Deenar Toraskar
You could try a couple of things a) use Kafka for stream processing, store current incoming events and spark streaming job ouput in Kafka rather than on HDFS and dual write to HDFS too (in a micro batched mode), so every x minutes. Kafka is more suited to processing lots of small events/ b

HDFS is undefined

2015-09-25 Thread Angel Angel
hello, I am running the spark application. I have installed the cloudera manager. it includes the spark version 1.2.0 But now i want to use spark version 1.4.0. its also working fine. But when i try to access the HDFS in spark 1.4.0 in eclipse i am getting the following error. "Exce

Re: Cache after filter Vs Writing back to HDFS

2015-09-22 Thread Akhil Das
Instead of .map you can try doing a .mapPartitions and see the performance. Thanks Best Regards On Fri, Sep 18, 2015 at 2:47 AM, Gavin Yue wrote: > For a large dataset, I want to filter out something and then do the > computing intensive work. > > What I am doing now: >

Cache after filter Vs Writing back to HDFS

2015-09-17 Thread Gavin Yue
For a large dataset, I want to filter out something and then do the computing intensive work. What I am doing now: Data.filter(somerules).cache() Data.count() Data.map(timeintensivecompute) But this sometimes takes unusually long time due to cache missing and recalculation. So I changed to

Re: hdfs-ha on mesos - odd bug

2015-09-15 Thread Marcelo Vanzin
n: nameservice1 > at > org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377) > at > org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310) This looks like you're trying to connect to an HA HDFS service but you have not provided the proper hdfs-site.xml for your app; t

Re: hdfs-ha on mesos - odd bug

2015-09-15 Thread Adrian Bridgett
Hi Sam, in short, no, it's a traditional install as we plan to use spot instances and didn't want price spikes to kill off HDFS. We're actually doing a bit of a hybrid, using spot instances for the mesos slaves, ondemand for the mesos masters. So for the time being, putting hdfs

Re: hdfs-ha on mesos - odd bug

2015-09-15 Thread Steve Loughran
> On 15 Sep 2015, at 08:55, Adrian Bridgett <adr...@opensignal.com> wrote: > > Hi Sam, in short, no, it's a traditional install as we plan to use spot > instances and didn't want price spikes to kill off HDFS. > > We're actually doing a bit of a hybrid, using spot

Re: hdfs-ha on mesos - odd bug

2015-09-15 Thread Iulian Dragoș
I've seen similar traces, but couldn't track down the failure completely. You are using Kerberos for your HDFS cluster, right? AFAIK Kerberos isn't supported in Mesos deployments. Can you resolve that host name (nameservice1) from the driver machine (ping nameservice1)? Can it be resolved from

Re: hdfs-ha on mesos - odd bug

2015-09-15 Thread Adrian Bridgett
can rebuild the data as well). OTOH this would mainly only be beneficial if spark/mesos understood the data locality which is probably some time off (we don't need this ability now). Indeed, the error we are seeing is orthogonal to the setup - however my understanding of ha-hdfs

Re: connecting to remote spark and reading files on HDFS or s3 in sparkR

2015-09-14 Thread Akhil Das
cluster. Can I connect to that from my > local sparkR in RStudio? if yes , how ? > > Can I read files which I have saved as parquet files on hdfs or s3 in > sparkR ? If yes , How? > > Thanks > -Roni > >

hdfs-ha on mesos - odd bug

2015-09-14 Thread Adrian Bridgett
I'm hitting an odd issue with running spark on mesos together with HA-HDFS, with an even odder workaround. In particular I get an error that it can't find the HDFS nameservice unless I put in a _broken_ url (discovered that workaround by mistake!). core-site.xml, hdfs-site.xml is distributed

Re: connecting to remote spark and reading files on HDFS or s3 in sparkR

2015-09-14 Thread roni
r). > > Thanks > Best Regards > > On Thu, Sep 10, 2015 at 11:20 PM, roni <roni.epi...@gmail.com> wrote: > >> I have spark installed on a EC2 cluster. Can I connect to that from my >> local sparkR in RStudio? if yes , how ? >> >> Can I read files which I h

Re: hdfs-ha on mesos - odd bug

2015-09-14 Thread Sam Bessalah
I don't know about the broken url. But are you running HDFS as a mesos framework? If so is it using mesos-dns? Then you should resolve the namenode via hdfs:/// On Mon, Sep 14, 2015 at 3:55 PM, Adrian Bridgett <adr...@opensignal.com> wrote: > I'm hitting an odd issue with runn

Re: [Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-10 Thread Ewan Leith
The last time I checked, if you launch EMR 4 with only Spark selected as an application, HDFS isn't correctly installed. Did you select another application like Hive at launch time as well as Spark? If not, try that. Thanks, Ewan -- Original message-- From: Dean Wampler Date

Re: [Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-10 Thread shahab
r, do you see the file if you type: > > hdfs dfs > -ls > hdfs://ipx-x-x-x:8020/user/hadoop/.sparkStaging/application_123344567_0018/spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar > > (with the correct server address for "ipx-x-x-x"). If not, is the server > address correct

Re: [Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-10 Thread Work
Ewan, What issue are you having with HDFS when only Spark is installed? I'm not aware of any issue like this. Thanks,  Jonathan — Sent from Mailbox On Wed, Sep 9, 2015 at 11:48 PM, Ewan Leith <ewan.le...@realitymine.com> wrote: > The last time I checked, if you lau

connecting to remote spark and reading files on HDFS or s3 in sparkR

2015-09-10 Thread roni
I have spark installed on a EC2 cluster. Can I connect to that from my local sparkR in RStudio? if yes , how ? Can I read files which I have saved as parquet files on hdfs or s3 in sparkR ? If yes , How? Thanks -Roni

reading files on HDFS /s3 in sparkR -failing

2015-09-10 Thread roni
I am trying this - ddf <- parquetFile(sqlContext, "hdfs:// ec2-52-26-180-130.us-west-2.compute.amazonaws.com:9000/IPF_14_1.parquet") and I get path[1]="hdfs:// ec2-52-26-180-130.us-west-2.compute.amazonaws.com:9000/IPF_14_1.parquet": No such file or directory when I

RE: reading files on HDFS /s3 in sparkR -failing

2015-09-10 Thread Sun, Rui
/28029134/how-can-i-access-s3-s3n-from-a-local-hadoop-2-6-installation , https://issues.apache.org/jira/browse/SPARK-7442 From: roni [mailto:roni.epi...@gmail.com] Sent: Friday, September 11, 2015 3:05 AM To: user@spark.apache.org Subject: reading files on HDFS /s3 in sparkR -failing I am trying

[Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-09 Thread shahab
Hi, I am using Spark on Amazon EMR. So far I have not succeeded to submit the application successfully, not sure what's problem. In the log file I see the followings. java.io.FileNotFoundException: File does not exist: hdfs://ipx-x-x-x:8020/user/hadoop/.sparkStaging/application_123344567_0018

Re: Is HDFS required for Spark streaming?

2015-09-09 Thread N B
the StreamingContext) as we don't have a real need for that type >>> of recovery. However, because the application does reduceeByKeyAndWindow >>> operations, checkpointing has to be turned on. Do you think this scenario >>> will also only work with HDFS or having local directories suffic

Re: Is HDFS required for Spark streaming?

2015-09-09 Thread Tathagata Das
irectory away first and >>>> re-create the StreamingContext) as we don't have a real need for that type >>>> of recovery. However, because the application does reduceeByKeyAndWindow >>>> operations, checkpointing has to be turned on. Do you think this scenario

Re: [Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-09 Thread Dean Wampler
If you log into the cluster, do you see the file if you type: hdfs dfs -ls hdfs://ipx-x-x-x:8020/user/hadoop/.sparkStaging/application_123344567_0018/spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar (with the correct server address for "ipx-x-x-x"). If not, is the server address correct an

read compressed hdfs files using SparkContext.textFile?

2015-09-08 Thread shenyan zhen
Hi, For hdfs files written with below code: rdd.saveAsTextFile(getHdfsPath(...), classOf [org.apache.hadoop.io.compress.GzipCodec]) I can see the hdfs files been generated: 0 /lz/streaming/am/144173460/_SUCCESS 1.6 M /lz/streaming/am/144173460/part-0.gz 1.6 M /lz

Re: Is HDFS required for Spark streaming?

2015-09-08 Thread Tathagata Das
e the StreamingContext) as we don't have a real need for that type >> of recovery. However, because the application does reduceeByKeyAndWindow >> operations, checkpointing has to be turned on. Do you think this scenario >> will also only work with HDFS or having local directories suffice?

Re: Is HDFS required for Spark streaming?

2015-09-08 Thread Cody Koeninger
ay first and > re-create the StreamingContext) as we don't have a real need for that type > of recovery. However, because the application does reduceeByKeyAndWindow > operations, checkpointing has to be turned on. Do you think this scenario > will also only work with HDFS or having local dire

Re: Is HDFS required for Spark streaming?

2015-09-05 Thread N B
reduceeByKeyAndWindow operations, checkpointing has to be turned on. Do you think this scenario will also only work with HDFS or having local directories suffice? Thanks Nikunj On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das <t...@databricks.com> wrote: > Shuffle spills will use local disk, HDFS n

Re: Small File to HDFS

2015-09-04 Thread Tao Lu
PM, Jörn Franke <jornfra...@gmail.com> wrote: > >> Well it is the same as in normal hdfs, delete file and put a new one with >> the same name works. >> >> Le jeu. 3 sept. 2015 à 21:18, <nib...@free.fr> a écrit : >> >>> HAR archive see

<    2   3   4   5   6   7   8   9   10   11   >