Subject: Re: --jars option using hdfs jars cannot effect when spark standlone
deploymode with cluster
Can you give a try putting the jar locally without hdfs?
Thanks
Best Regards
On Wed, Oct 28, 2015 at 8:40 AM, our...@cnsuning.com <our...@cnsuning.com>
wrote:
hi all,
when using c
Can you give a try putting the jar locally without hdfs?
Thanks
Best Regards
On Wed, Oct 28, 2015 at 8:40 AM, our...@cnsuning.com <our...@cnsuning.com>
wrote:
> hi all,
>when using command:
> spark-submit *--deploy-mode cluster --jars
> hdfs:///user/spark/c
You can use the .saveAsObjectFiles("hdfs://sigmoid/twitter/status/") since
you want to store the Status object and for every batch it will create a
directory under /status (name will mostly be the timestamp), since the data
is small (hardly couple of MBs for 1 sec interval) it will not
How are you submitting your job? You need to make sure HADOOP_CONF_DIR is
pointing to your hadoop configuration directory (with core-site.xml,
hdfs-site.xml files), If you have them set properly then make sure you are
giving the full hdfs url like:
dStream.saveAsTextFiles("hdfs://sigmoid-cl
hi all,
when using command:
spark-submit --deploy-mode cluster --jars hdfs:///user/spark/cypher.jar
--class com.suning.spark.jdbc.MysqlJdbcTest hdfs:///user/spark/MysqlJdbcTest.jar
the program throw exception that cannot find class in cypher.jar, the driver
log show no --jars
t; wrote:
>>
>>>
>>> > On 26 Oct 2015, at 09:28, Jinfeng Li <liji...@gmail.com> wrote:
>>> >
>>> > Replication factor is 3 and we have 18 data nodes. We check HDFS
>>> webUI, data is evenly distributed among 18 machines.
>>>
> Hi, I find that loading files from HDFS can incur huge amount of network
> traffic. Input size is 90G and network traffic is about 80G. By my
> understanding, local files should be read and thus no network communication
> is needed.
>
> I use Spark 1.5.1, and the following is
Hm, how about the opposite question -- do you have just 1 executor? then
again everything will be remote except for a small fraction of blocks.
On Mon, Oct 26, 2015 at 9:28 AM, Jinfeng Li <liji...@gmail.com> wrote:
> Replication factor is 3 and we have 18 data nodes. We check HDFS webU
n everything will be remote except for a small fraction of blocks.
>
> On Mon, Oct 26, 2015 at 9:28 AM, Jinfeng Li <liji...@gmail.com> wrote:
>
>> Replication factor is 3 and we have 18 data nodes. We check HDFS webUI,
>> data is evenly distributed among 18 machines.
>>
>&g
wrote:
>
> > On 26 Oct 2015, at 09:28, Jinfeng Li <liji...@gmail.com> wrote:
> >
> > Replication factor is 3 and we have 18 data nodes. We check HDFS webUI,
> data is evenly distributed among 18 machines.
> >
>
>
> every block in HDFS (usually 64-128-256
> On 26 Oct 2015, at 09:28, Jinfeng Li <liji...@gmail.com> wrote:
>
> Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, data
> is evenly distributed among 18 machines.
>
every block in HDFS (usually 64-128-256 MB) is distributed across three
Replication factor is 3 and we have 18 data nodes. We check HDFS webUI,
data is evenly distributed among 18 machines.
On Mon, Oct 26, 2015 at 5:18 PM Sean Owen <so...@cloudera.com> wrote:
> Have a look at your HDFS replication, and where the blocks are for these
> files. For example
hortonworks.com>
> wrote:
>
>>
>> > On 26 Oct 2015, at 09:28, Jinfeng Li <liji...@gmail.com> wrote:
>> >
>> > Replication factor is 3 and we have 18 data nodes. We check HDFS webUI,
>> data is evenly distributed among 18 machines.
>>
i <liji...@gmail.com> wrote:
>
>> Hi, I find that loading files from HDFS can incur huge amount of network
>> traffic. Input size is 90G and network traffic is about 80G. By my
>> understanding, local files should be read and thus no network communication
>> is needed.
>
Have a look at your HDFS replication, and where the blocks are for these
files. For example, if you had only 2 HDFS data nodes, then data would be
remote to 16 of 18 workers and always entail a copy.
On Mon, Oct 26, 2015 at 9:12 AM, Jinfeng Li <liji...@gmail.com> wrote:
> I cat /pro
> not all executors are local to all data. That can be the situation in many
>> cases but not always.
>>
>> On Mon, Oct 26, 2015 at 8:57 AM, Jinfeng Li <liji...@gmail.com> wrote:
>>
>>> Hi, I find that loading files from HDFS can incur huge amount of network
&g
>>>
>>>>
>>>> > On 26 Oct 2015, at 09:28, Jinfeng Li <liji...@gmail.com> wrote:
>>>> >
>>>> > Replication factor is 3 and we have 18 data nodes. We check HDFS
>>>> webUI, data is evenly distributed among 18 machines.
. Or are you sure it's not?
HDFS stats are really the general filesystem stats: they measure data through
the input and output streams, not whether they were to/from local or remote
systems. Fixable, and metrics are always good, though as Hadoop (currently)
uses Hadoop metrics 2, not the codahal
not a good cluster solution. Any idea how I can configure spark so
that it will write the output to hdfs?
JavaDStream tweets =
TwitterFilterQueryUtils.createStream(ssc, twitterAuth);
DStream dStream = tweets.dstream();
String prefix = ³MyPrefix";
String suffix =
I have a spark job that creates 6 million rows in RDDs. I convert the RDD
into Data-frame and write it to HDFS. Currently it takes 3 minutes to write
it to HDFS.
I am using spark 1.5.1 with YARN.
Here is the snippet:-
RDDList.parallelStream().forEach(mapJavaRDD
I need to save the twitter status I receive so that I can do additional
batch based processing on them in the future. Is it safe to assume HDFS is
the best way to go?
Any idea what is the best way to save twitter status to HDFS?
JavaStreamingContext ssc = new JavaStreamingContext(jsc
I have a spark job that creates 6 million rows in RDDs. I convert the RDD
into Data-frame and write it to HDFS. Currently it takes 3 minutes to write
it to HDFS.
Here is the snippet:-
RDDList.parallelStream().forEach(mapJavaRDD -> {
if (mapJavaRDD != n
in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Incremental-load-of-RDD-from-HDFS-tp25145p25166.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr
Convert your data to parquet, it saves space and time.
Thanks
Best Regards
On Mon, Oct 19, 2015 at 11:43 PM, ahaider3 <ahaid...@hawk.iit.edu> wrote:
> Hi,
> A lot of the data I have in HDFS is compressed. I noticed when I load this
> data into spark and cache it, Spark unroll
check spark.rdd.compress
On 19 October 2015 at 21:13, ahaider3 <ahaid...@hawk.iit.edu> wrote:
> Hi,
> A lot of the data I have in HDFS is compressed. I noticed when I load this
> data into spark and cache it, Spark unrolls the data like normal but stores
> the data unco
<igor.ber...@gmail.com> wrote:
> check spark.rdd.compress
>
> On 19 October 2015 at 21:13, ahaider3 <ahaid...@hawk.iit.edu> wrote:
>
>> Hi,
>> A lot of the data I have in HDFS is compressed. I noticed when I load this
>> data into spark and cache it, Spark
I am new to Spark, and this user community, so my apologies if this was
answered elsewhere and I missed it (I did try search first).
We have multiple large RDDs stored across a HDFS via Spark (by calling
pairRDD.saveAsNewAPIHadoopFile()), and one thing we need to do is re-load a
given RDD
ark, and this user community, so my apologies if this was
> answered elsewhere and I missed it (I did try search first).
>
> We have multiple large RDDs stored across a HDFS via Spark (by calling
> pairRDD.saveAsNewAPIHadoopFile()), and one thing we need to do is re-load a
> given
Hi,
A lot of the data I have in HDFS is compressed. I noticed when I load this
data into spark and cache it, Spark unrolls the data like normal but stores
the data uncompressed in memory. For example, suppose /data/ is an RDD with
compressed partitions on HDFS. I then cache the data. When I call
I have Apache Mesos 0.22.1 cluster (3 masters & 5 slaves), running Cloudera
HDFS (2.5.0-cdh5.3.1) in HA configuration and Spark 1.5.1 framework.
When I try to spark-submit compiled HdfsTest.scala example app (from Spark
1.5.1 sources) - it fails with "java.lang.IllegalArgument
I'm running a Spark Streaming application for every 10 seconds, its job is to
consume data from kafka, transform it and store it into HDFS based on the
key. i.e, a file per unique key. I'm using the Hadoop's saveAsHadoopFile()
API to store the output, I see that a file gets generated for every
nsume data from kafka, transform it and store it into HDFS based on the
> key. i.e, a file per unique key. I'm using the Hadoop's saveAsHadoopFile()
> API to store the output, I see that a file gets generated for every unique
> key, but the issue is that only one row gets stored for each of
Hi All,
I am very new to SparkR.
I am able to run a sample code from example given in the link :
http://www.r-bloggers.com/installing-and-starting-sparkr-locally-on-windows-os-and-rstudio/
Then I am trying to read a file from HDFS in RStudio, but unable to read.
Below is my code
Amit,
sqlContext <- sparkRSQL.init(sc)
peopleDF <- read.df(sqlContext, "hdfs://master:9000/sears/example.csv")
have you restarted the R session in RStudio between the two lines?
From: Amit Behera [mailto:amit.bd...@gmail.com]
Sent: Thursday, October 8, 2015 5:59 PM
To: user@
I'm just reading data from HDFS through Spark. It throws
*java.lang.ClassCastException:
org.apache.hadoop.io.LongWritable cannot be cast to
org.apache.hadoop.io.BytesWritable* at line no 6. I never used LongWritable
in my code, no idea how the data was in that format.
Note : I'm not using
wrote:
> I'm just reading data from HDFS through Spark. It throws
> *java.lang.ClassCastException:
> org.apache.hadoop.io.LongWritable cannot be cast to
> org.apache.hadoop.io.BytesWritable* at line no 6. I never used
> LongWritable in my code, no idea how the data was in that format.
>
One.
I read in LZO compressed files from HDFS
Perform a map operation
cache the results of this map operation
call saveAsHadoopFile to write LZO back to HDFS.
Without the cache, the job will stall.
mn
> On Oct 5, 2015, at 7:25 PM, Mohammed Guller <moham...@glassbeam.com&
reading from HDFS?
One.
I read in LZO compressed files from HDFS Perform a map operation cache the
results of this map operation call saveAsHadoopFile to write LZO back to HDFS.
Without the cache, the job will stall.
mn
> On Oct 5, 2015, at 7:25 PM, Mohammed Guller <moham...@glassbeam.com&
spark.apache.org
> Subject: Re: laziness in textFile reading from HDFS?
>
> One.
>
> I read in LZO compressed files from HDFS Perform a map operation cache the
> results of this map operation call saveAsHadoopFile to write LZO back to HDFS.
>
> Without the cache, th
-hadoop-throws-exception-for-large-lzo-files
Mohammed
-Original Message-
From: Matt Narrell [mailto:matt.narr...@gmail.com]
Sent: Tuesday, October 6, 2015 4:08 PM
To: Mohammed Guller
Cc: davidkl; user@spark.apache.org
Subject: Re: laziness in textFile reading from HDFS?
Agreed. This is spark
hadoop-throws-exception-for-large-lzo-files
>
>
> Mohammed
>
>
> -Original Message-
> From: Matt Narrell [mailto:matt.narr...@gmail.com <javascript:;>]
> Sent: Tuesday, October 6, 2015 4:08 PM
> To: Mohammed Guller
> Cc: davidkl; user@spark.apache.org
hadoop-throws-exception-for-large-lzo-files
>
>
> Mohammed
>
>
> -Original Message-
> From: Matt Narrell [mailto:matt.narr...@gmail.com <javascript:;>]
> Sent: Tuesday, October 6, 2015 4:08 PM
> To: Mohammed Guller
> Cc: davidkl; user@spark.apache.org
: laziness in textFile reading from HDFS?
Is there any more information or best practices here? I have the exact same
issues when reading large data sets from HDFS (larger than available RAM) and I
cannot run without setting the RDD persistence level to MEMORY_AND_DISK_SER,
and using nearly all
ndredi 2 Octobre 2015 18:37:22
Objet: Re: HDFS small file generation problem
Ok thanks, but can I also update data instead of insert data ?
- Mail original -
De: "Brett Antonides" <banto...@gmail.com>
À: user@spark.apache.org
Envoyé: Vendredi 2 Octobre 2015 18:18:18
Objet
Hello,
So, does Hive is a solution for my need :
- I receive small messages (10KB) identified by ID (product ID for example)
- Each message I receive is the last picture of my product ID, so I just want
basically to store last picture products inside HDFS
in order to process batch on it later
bq. val dist = sc.parallelize(l)
Following the above, can you call, e.g. count() on dist before saving ?
Cheers
On Fri, Oct 2, 2015 at 1:21 AM, jarias <ja...@elrocin.es> wrote:
> Dear list,
>
> I'm experimenting a problem when trying to write any RDD to HDFS. I've
> tr
:
> Hello,
> So, does Hive is a solution for my need :
> - I receive small messages (10KB) identified by ID (product ID for example)
> - Each message I receive is the last picture of my product ID, so I just
> want basically to store last picture products inside HDFS
> in order to p
eceive is the last picture of my product ID, so I just
> want basically to store last picture products inside HDFS
> in order to process batch on it later.
>
> If I use Hive I suppose I have to use INSERT and UPDATE records and
> periodically CONCATENATE.
> After a CONCATENATE I sup
fied by ID (product ID for
>> example)
>> - Each message I receive is the last picture of my product ID, so I just
>> want basically to store last picture products inside HDFS
>> in order to process batch on it later.
>>
>> If I use Hive I suppose I have to
; Nicolas
>
> - Mail original -
> De: nib...@free.fr
> À: "Brett Antonides" <banto...@gmail.com>
> Cc: user@spark.apache.org
> Envoyé: Vendredi 2 Octobre 2015 18:37:22
> Objet: Re: HDFS small file generation problem
>
> Ok thanks, but can I also upda
xample)
- Each message I receive is the last picture of my product ID, so I just want
basically to store last picture products inside HDFS
in order to process batch on it later.
If I use Hive I suppose I have to use INSERT and UPDATE records and
periodically CONCATENATE.
After a CONCATENATE
Thanks a lot, why you said "the most recent version" ?
- Mail original -
De: "Jörn Franke" <jornfra...@gmail.com>
À: "nibiau" <nib...@free.fr>
Cc: banto...@gmail.com, user@spark.apache.org
Envoyé: Samedi 3 Octobre 2015 13:56:43
Objet: Re: RE : Re:
2015 at 1:21 AM, jarias <ja...@elrocin.es
> <mailto:ja...@elrocin.es>> wrote:
> Dear list,
>
> I'm experimenting a problem when trying to write any RDD to HDFS. I've tried
> with minimal examples, scala programs and pyspark programs both in local and
> cluster mode
@spark.apache.org
> Envoyé: Samedi 3 Octobre 2015 13:56:43
> Objet: Re: RE : Re: HDFS small file generation problem
>
>
>
> Yes the most recent version yes, or you can use phoenix on top of hbase. I
> recommend to try out both and see which one is the most suitable.
>
>
&
Hi Jacin,
If I was you, first thing that I would do is, write a sample java
application to write data into hdfs and see if it's working fine. Meta data
is being created in hdfs, that means, communication to namenode is working
fine but not to datanodes since you don't see any data inside the file
Is there any more information or best practices here? I have the exact same
issues when reading large data sets from HDFS (larger than available RAM) and I
cannot run without setting the RDD persistence level to MEMORY_AND_DISK_SER,
and using nearly all the cluster resources.
Should I
Dear list,
I'm experimenting a problem when trying to write any RDD to HDFS. I've tried
with minimal examples, scala programs and pyspark programs both in local and
cluster modes and as standalone applications or shells.
My problem is that when invoking the write command, a task is executed
Hello,
Yes but :
- In the Java API I don't find a API to create a HDFS archive
- As soon as I receive a message (with messageID) I need to replace the old
existing file by the new one (name of file being the messageID), is it possible
with archive ?
Tks
Nicolas
- Mail original -
De
to merge
your many small files into larger files optimized for your HDFS block size
* Since the CONCATENATE command operates on files in place it is
transparent to any downstream processing
Cheers,
Brett
On Fri, Oct 2, 2015 at 3:48 PM, <nib...@free.fr> wrote:
> Hel
Once you convert your data to a dataframe (look at spark-csv), try
df.write.partitionBy("", "mm").save("...").
On Thu, Oct 1, 2015 at 4:11 PM, haridass saisriram <
haridass.saisri...@gmail.com> wrote:
> Hi,
>
> I am trying to find a simple ex
Ok thanks, but can I also update data instead of insert data ?
- Mail original -
De: "Brett Antonides" <banto...@gmail.com>
À: user@spark.apache.org
Envoyé: Vendredi 2 Octobre 2015 18:18:18
Objet: Re: HDFS small file generation problem
I had a very similar pr
Hi,
I am trying to find a simple example to read a data file on HDFS. The
file has the following format
a , b , c ,,mm
a1,b1,c1,2015,09
a2,b2,c2,2014,08
I would like to read this file and store it in HDFS partitioned by year and
month. Something like this
/path/to/hdfs//mm
I want
Like:
counts.saveAsTestFiles("hdfs://host:port/some/location")
Thanks
Best Regards
On Tue, Sep 29, 2015 at 2:15 AM, Chengi Liu <chengi.liu...@gmail.com> wrote:
> Hi,
> I am going thru this example here:
>
> https://github.com/apache/spark/blob/master/exampl
tition_key < '2015-07-01'
GROUP BY KEY1 ,KEY2 ) TAB2
ON TAB1.KEY1= TAB2.KEY1AND TAB1.KEY2= TAB2.KEY1
WHERE partition_key >= '2015-01-01' and partition_key < '2015-07-01'
GROUP BY TAB1.KEY1, TAB1.KEY2""")
I see that ~18,000 HDFS blocks are read TWICE and then the Shuffle
1) It is not required to have the same amount of memory as data.
2) By default the # of partitions are equal to the number of HDFS blocks
3) Yes, the read operation is lazy
4) It is okay to have more number of partitions than number of cores.
Mohammed
-Original Message-
From: davidkl
Hi,
I am going thru this example here:
https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py
If I want to write this data on hdfs.
Whats the right way to do this?
Thanks
I have to store them inside HDFS in order to treat them by PIG
> jobs on-demand.
> The problem is the fact that I generate a lot of small files in HDFS
> (several millions) and it can be problematic.
> I investigated to use Hbase or Archive file but I don't want to do it
> fin
to read an HDFS folder (containing multiple files), I
understand that the number of partitions created are equal to the number of
HDFS blocks, correct? Are those created in a lazy way? I mean, if the number
of blocks/partitions is larger than the number of cores/threads the Spark
driver was launched
application.
>
> I have installed the cloudera manager.
> it includes the spark version 1.2.0
>
>
> But now i want to use spark version 1.4.0.
>
> its also working fine.
>
> But when i try to access the HDFS in spark 1.4.0 in eclipse i am getting
> the fo
But now i want to use spark version 1.4.0.
>
> its also working fine.
>
> But when i try to access the HDFS in spark 1.4.0 in eclipse i am getting the
> following error.
>
> "Exception in thread "main" java.nio.file.FileSystemNotFoundException:
&g
I would suggest not to write small files to hdfs. rather you can hold them
in memory, maybe off heap. and then you may flush it to hdfs using another
job. similar to https://github.com/ptgoetz/storm-hdfs (not sure if spark
already has something like it)
On Sun, Sep 27, 2015 at 11:36 PM, <
Hello,
I'm still investigating my small file generation problem generated by my Spark
Streaming jobs.
Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb),
and I have to store them inside HDFS in order to treat them by PIG jobs
on-demand.
The problem is the fact that I
You could try a couple of things
a) use Kafka for stream processing, store current incoming events and spark
streaming job ouput in Kafka rather than on HDFS and dual write to HDFS too
(in a micro batched mode), so every x minutes. Kafka is more suited to
processing lots of small events/
b
hello,
I am running the spark application.
I have installed the cloudera manager.
it includes the spark version 1.2.0
But now i want to use spark version 1.4.0.
its also working fine.
But when i try to access the HDFS in spark 1.4.0 in eclipse i am getting
the following error.
"Exce
Instead of .map you can try doing a .mapPartitions and see the performance.
Thanks
Best Regards
On Fri, Sep 18, 2015 at 2:47 AM, Gavin Yue wrote:
> For a large dataset, I want to filter out something and then do the
> computing intensive work.
>
> What I am doing now:
>
For a large dataset, I want to filter out something and then do the
computing intensive work.
What I am doing now:
Data.filter(somerules).cache()
Data.count()
Data.map(timeintensivecompute)
But this sometimes takes unusually long time due to cache missing and
recalculation.
So I changed to
n: nameservice1
> at
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377)
> at
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
This looks like you're trying to connect to an HA HDFS service but you
have not provided the proper hdfs-site.xml for your app; t
Hi Sam, in short, no, it's a traditional install as we plan to use spot
instances and didn't want price spikes to kill off HDFS.
We're actually doing a bit of a hybrid, using spot instances for the
mesos slaves, ondemand for the mesos masters. So for the time being,
putting hdfs
> On 15 Sep 2015, at 08:55, Adrian Bridgett <adr...@opensignal.com> wrote:
>
> Hi Sam, in short, no, it's a traditional install as we plan to use spot
> instances and didn't want price spikes to kill off HDFS.
>
> We're actually doing a bit of a hybrid, using spot
I've seen similar traces, but couldn't track down the failure completely.
You are using Kerberos for your HDFS cluster, right? AFAIK Kerberos isn't
supported in Mesos deployments.
Can you resolve that host name (nameservice1) from the driver machine (ping
nameservice1)? Can it be resolved from
can rebuild the data as well). OTOH this would mainly only be
beneficial if spark/mesos understood the data locality which is probably
some time off (we don't need this ability now).
Indeed, the error we are seeing is orthogonal to the setup - however my
understanding of ha-hdfs
cluster. Can I connect to that from my
> local sparkR in RStudio? if yes , how ?
>
> Can I read files which I have saved as parquet files on hdfs or s3 in
> sparkR ? If yes , How?
>
> Thanks
> -Roni
>
>
I'm hitting an odd issue with running spark on mesos together with
HA-HDFS, with an even odder workaround.
In particular I get an error that it can't find the HDFS nameservice
unless I put in a _broken_ url (discovered that workaround by
mistake!). core-site.xml, hdfs-site.xml is distributed
r).
>
> Thanks
> Best Regards
>
> On Thu, Sep 10, 2015 at 11:20 PM, roni <roni.epi...@gmail.com> wrote:
>
>> I have spark installed on a EC2 cluster. Can I connect to that from my
>> local sparkR in RStudio? if yes , how ?
>>
>> Can I read files which I h
I don't know about the broken url. But are you running HDFS as a mesos
framework? If so is it using mesos-dns?
Then you should resolve the namenode via hdfs:///
On Mon, Sep 14, 2015 at 3:55 PM, Adrian Bridgett <adr...@opensignal.com>
wrote:
> I'm hitting an odd issue with runn
The last time I checked, if you launch EMR 4 with only Spark selected as an
application, HDFS isn't correctly installed.
Did you select another application like Hive at launch time as well as Spark?
If not, try that.
Thanks,
Ewan
-- Original message--
From: Dean Wampler
Date
r, do you see the file if you type:
>
> hdfs dfs
> -ls
> hdfs://ipx-x-x-x:8020/user/hadoop/.sparkStaging/application_123344567_0018/spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar
>
> (with the correct server address for "ipx-x-x-x"). If not, is the server
> address correct
Ewan,
What issue are you having with HDFS when only Spark is installed? I'm not aware
of any issue like this.
Thanks,
Jonathan
—
Sent from Mailbox
On Wed, Sep 9, 2015 at 11:48 PM, Ewan Leith <ewan.le...@realitymine.com>
wrote:
> The last time I checked, if you lau
I have spark installed on a EC2 cluster. Can I connect to that from my
local sparkR in RStudio? if yes , how ?
Can I read files which I have saved as parquet files on hdfs or s3 in
sparkR ? If yes , How?
Thanks
-Roni
I am trying this -
ddf <- parquetFile(sqlContext, "hdfs://
ec2-52-26-180-130.us-west-2.compute.amazonaws.com:9000/IPF_14_1.parquet")
and I get path[1]="hdfs://
ec2-52-26-180-130.us-west-2.compute.amazonaws.com:9000/IPF_14_1.parquet":
No such file or directory
when I
/28029134/how-can-i-access-s3-s3n-from-a-local-hadoop-2-6-installation
, https://issues.apache.org/jira/browse/SPARK-7442
From: roni [mailto:roni.epi...@gmail.com]
Sent: Friday, September 11, 2015 3:05 AM
To: user@spark.apache.org
Subject: reading files on HDFS /s3 in sparkR -failing
I am trying
Hi,
I am using Spark on Amazon EMR. So far I have not succeeded to submit the
application successfully, not sure what's problem. In the log file I see
the followings.
java.io.FileNotFoundException: File does not exist:
hdfs://ipx-x-x-x:8020/user/hadoop/.sparkStaging/application_123344567_0018
the StreamingContext) as we don't have a real need for that type
>>> of recovery. However, because the application does reduceeByKeyAndWindow
>>> operations, checkpointing has to be turned on. Do you think this scenario
>>> will also only work with HDFS or having local directories suffic
irectory away first and
>>>> re-create the StreamingContext) as we don't have a real need for that type
>>>> of recovery. However, because the application does reduceeByKeyAndWindow
>>>> operations, checkpointing has to be turned on. Do you think this scenario
If you log into the cluster, do you see the file if you type:
hdfs dfs
-ls
hdfs://ipx-x-x-x:8020/user/hadoop/.sparkStaging/application_123344567_0018/spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar
(with the correct server address for "ipx-x-x-x"). If not, is the server
address correct an
Hi,
For hdfs files written with below code:
rdd.saveAsTextFile(getHdfsPath(...), classOf
[org.apache.hadoop.io.compress.GzipCodec])
I can see the hdfs files been generated:
0 /lz/streaming/am/144173460/_SUCCESS
1.6 M /lz/streaming/am/144173460/part-0.gz
1.6 M /lz
e the StreamingContext) as we don't have a real need for that type
>> of recovery. However, because the application does reduceeByKeyAndWindow
>> operations, checkpointing has to be turned on. Do you think this scenario
>> will also only work with HDFS or having local directories suffice?
ay first and
> re-create the StreamingContext) as we don't have a real need for that type
> of recovery. However, because the application does reduceeByKeyAndWindow
> operations, checkpointing has to be turned on. Do you think this scenario
> will also only work with HDFS or having local dire
reduceeByKeyAndWindow
operations, checkpointing has to be turned on. Do you think this scenario
will also only work with HDFS or having local directories suffice?
Thanks
Nikunj
On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das <t...@databricks.com> wrote:
> Shuffle spills will use local disk, HDFS n
PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> Well it is the same as in normal hdfs, delete file and put a new one with
>> the same name works.
>>
>> Le jeu. 3 sept. 2015 à 21:18, <nib...@free.fr> a écrit :
>>
>>> HAR archive see
601 - 700 of 1329 matches
Mail list logo