PM
To: Dave Ariens
Cc: Tim Chen; Olivier Girardot; user@spark.apache.org
Subject: Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos
On Fri, Jun 26, 2015 at 3:09 PM, Dave Ariens
dari...@blackberry.commailto:dari...@blackberry.com wrote:
Would there be any way to have the task
. You can check the Hadoop
sources for details. Not sure if there's another way.
*From: *Marcelo Vanzin
*Sent: *Friday, June 26, 2015 6:20 PM
*To: *Dave Ariens
*Cc: *Tim Chen; Olivier Girardot; user@spark.apache.org
*Subject: *Re: Accessing Kerberos Secured HDFS Resources from Spark
in the slaves call
the UGI login with a principal/keytab provided to the driver?
From: Marcelo Vanzin
Sent: Friday, June 26, 2015 5:28 PM
To: Tim Chen
Cc: Olivier Girardot; Dave Ariens; user@spark.apache.org
Subject: Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos
On Fri, Jun 26
-keep-memory-dedicated-for-HDFS-and-Spark-on-cluster-nodes-tp23451.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands
in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Should-I-keep-memory-dedicated-for-HDFS-and-Spark-on-cluster-nodes-tp23451.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Hello All,
I am new to Spark. I have a very basic question.How do I write the output
of an action on a RDD to HDFS?
Thanks in advance for the help.
Cheers,
Ravi
Hi Chris,
Thanks for the quick reply and the welcome. I am trying to read a file from
hdfs and then writing back just the first line to hdfs.
I calling first() on the RDD to get the first line.
Sent from my iPhone
On Jun 22, 2015, at 7:42 PM, Chris Gore cdg...@cdgore.com wrote:
Hi Ravi
Hi Ravi,
Welcome, you probably want RDD.saveAsTextFile(“hdfs:///my_file”)
Chris
On Jun 22, 2015, at 5:28 PM, ravi tella ddpis...@gmail.com wrote:
Hello All,
I am new to Spark. I have a very basic question.How do I write the output of
an action on a RDD to HDFS?
Thanks in advance
You can use fileStream for that, look at the XMLInputFormat
https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java
of mahout. It should give you full XML object as on record, (as opposed to
an XML
Hi Ravi,
For this case, you could simply do
sc.parallelize([rdd.first()]).saveAsTextFile(“hdfs:///my_file”) using pyspark
or sc.parallelize(Array(rdd.first())).saveAsTextFile(“hdfs:///my_file”) using
Scala
Chris
On Jun 22, 2015, at 5:53 PM, ddpis...@gmail.com wrote:
Hi Chris,
Thanks
Like this?
val rawXmls = ssc.fileStream(path, classOf[XmlInputFormat],
classOf[LongWritable],
classOf[Text])
Thanks
Best Regards
On Mon, Jun 22, 2015 at 5:45 PM, Yong Feng fengyong...@gmail.com wrote:
Thanks a lot, Akhil
I saw this mail thread before, but still do not understand how
Hi All ,
What is the Best Way to install and Spark Cluster along side with Hadoop
Cluster , Any recommendation for below deployment topology will be a great
help
*Also Is it necessary to put the Spark Worker on DataNodes as when it read
block from HDFS it will be local to the Server / Worker
Thanks a lot, Akhil
I saw this mail thread before, but still do not understand how to use
XmlInputFormatof mahout in Spark Streaming (I am not Spark Streaming Expert
yet ;-)). Can you show me some sample code for explanation.
Thanks in advance,
Yong
On Mon, Jun 22, 2015 at 6:44 AM, Akhil Das
recommendation for below deployment topology will be a great
help
*Also Is it necessary to put the Spark Worker on DataNodes as when it read
block from HDFS it will be local to the Server / Worker or I can put the
Worker on any other nodes and if i do that will it affect the performance
of the Spark
Thanks Akhil
I will have a try and then go back to you
Yong
On Mon, Jun 22, 2015 at 8:25 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Like this?
val rawXmls = ssc.fileStream(path, classOf[XmlInputFormat],
classOf[LongWritable],
classOf[Text])
Thanks
Best Regards
On Mon, Jun
*Also Is it necessary to put the Spark Worker on DataNodes as when it
read block from HDFS it will be local to the Server / Worker or I can put
the Worker on any other nodes and if i do that will it affect the
performance of the Spark Data Processing ..*
Hadoop Option 1
Server 1 - NameNode
Hi Spark Experts
I have a customer who wants to monitor coming data files (with xml format),
and then analysize them after that put analysized data into DB. The size of
each file is about 30MB (or even less in future). Spark streaming seems
promising.
After learning Spark Streaming and also
Hi All,
In my usecase HDFS file as source for Spark Stream,
the job will process the data line by line but how will make sure to
maintain the offset line number(data already processed) while restarting/new
code push .
Team can you please reply on this is there any configuration in Spark
hey guys
After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS is
not supported by Databricks cloud.My speed bottleneck is to transfer ~1TB of
snapshot HDFS data (250+ external hive tables) to S3 :-(
I want to use databricks cloud but this to me is a starting disabler.The
You could consider using Zeppelin and spark on yarn as an alternative.
http://zeppelin.incubator.apache.org/
Simon
On 16 Jun 2015, at 17:58, Sanjay Subramanian
sanjaysubraman...@yahoo.com.INVALID wrote:
hey guys
After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS
Hi,
Spark on YARN should help in the memory management for Spark jobs.
Here is a good starting point:
https://spark.apache.org/docs/latest/running-on-yarn.html
YARN integrates well with HDFS and should be a good solution for a large
cluster.
What specific features are you looking for that HDFS
-without-HDFS-tp23260p23322.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Using sc.textFile will also read the file from HDFS one by one line through
iterator, don't need to fit all into memory, even you have small size of
memory, it still can be worked.
2015-06-12 13:19 GMT+08:00 SLiZn Liu sliznmail...@gmail.com:
Hmm, you have a good point. So should I load the file
of strings which
might be
the 1mil file names. But also it has 2.34 GB of long[] ! That's so
far, it
is still running. What are those long[] used for?
When Spark lists files it also needs all the extra metadata about
where the files are in the HDFS cluster. That is a lot more than just
needs all the extra metadata about
where the files are in the HDFS cluster. That is a lot more than just
the file's name - see the LocatedFileStatus class in the Hadoop docs
for an idea.
What you could try is to somehow break that input down into smaller
batches, if that's feasible for your app
Hi Spark Users,
I'm trying to load a literally big file (50GB when compressed as gzip file,
stored in HDFS) by receiving a DStream using `ssc.textFileStream`, as this
file cannot be fitted in my memory. However, it looks like no RDD will be
received until I copy this big file to a prior-specified
Spark lists files it also needs all the extra metadata about
where the files are in the HDFS cluster. That is a lot more than just
the file's name - see the LocatedFileStatus class in the Hadoop
docs for an idea.
What you could try is to somehow break that input down into smaller
batches
after 2h of running, now I got a 10GB long[], 1.3mil instances of long[]
So probably information about the files again.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
I've seen plenty of examples for creating HDFS files from pyspark but
haven't been able to figure out how to delete files from pyspark. Is there
an API I am missing for filesystem management? Or should I be including the
HDFS python modules?
Thanks,
Siegfried
in this use case? 50g need not to be in
memory. Give it a try with high number of partitions.
On 11 Jun 2015 23:09, SLiZn Liu sliznmail...@gmail.com wrote:
Hi Spark Users,
I'm trying to load a literally big file (50GB when compressed as gzip
file, stored in HDFS) by receiving a DStream using
Simplest way would be issuing a os.system with HDFS rm command from driver,
assuming it has hdfs connectivity, like a gateway node. Executors will have
nothing to do with it.
On 12 Jun 2015 08:57, Siegfried Bilstein sbilst...@gmail.com wrote:
I've seen plenty of examples for creating HDFS files
Exception {
FSDataOutputStream out = fs.create(pt_temp, true);
IOUtils.copyBytes(sourceContent, out, 4096, false);
out.close();
}
where is my fault?? or give it a function to write(append) to the hadoop
hdfs?
best
.nabble.com/spark-uses-too-much-memory-maybe-binaryFiles-with-more-than-1-million-files-in-HDFS-groupBy-or-reduc-tp23253.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user
Hi,
if you now want to write 1 file per partition, that's actually built into
Spark as
*saveAsTextFile*(*path*)Write the elements of the dataset as a text file
(or set of text files) in a given directory in the local filesystem, HDFS
or any other Hadoop-supported file system. Spark will call
-uses-too-much-memory-maybe-binaryFiles-with-more-than-1-million-files-in-HDFS-groupBy-or-reduc-tp23253p23257.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr
Or you can do sc.addJar(/path/to/the/jar), i haven't tested with HDFS path
though it works fine with local path.
Thanks
Best Regards
On Wed, Jun 10, 2015 at 10:17 AM, Jörn Franke jornfra...@gmail.com wrote:
I am not sure they work with HDFS pathes. You may want to look at the
source code
, it
is still running. What are those long[] used for?
When Spark lists files it also needs all the extra metadata about where the
files are in the HDFS cluster. That is a lot more than just the file's name
- see the LocatedFileStatus class in the Hadoop docs for an idea.
What you could try
After some time the driver accumulated 6.67GB of long[] . The executor mem
usage so far is low.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-uses-too-much-memory-maybe-binaryFiles-with-more-than-1-million-files-in-HDFS-groupBy-or-reduc
wrote:
Thanks Akhil:
The driver fails so fast to get a look at 4040. Is there any other way to
see the download and ship process of the files?
Is driver supposed to download these jars from HDFS to some location, then
ship them to excutors?
I can see from log that the driver
Thanks Akhil:
The driver fails so fast to get a look at 4040. Is there any other way to see
the download and ship process of the files?
Is driver supposed to download these jars from HDFS to some location, then ship
them to excutors?
I can see from log that the driver downloaded
by putting the jar with the class in it on the top of your
classpath.
Thanks
Best Regards
On Tue, Jun 9, 2015 at 9:05 AM, Dong Lei dong...@microsoft.com wrote:
Hi, spark-users:
I’m using spark-submit to submit multiple jars and files(all in HDFS) to
run a job, with the following command
By writing PDF files, do you mean something equivalent to a hadoop fs -put
/path?
I'm not sure how Pdfbox works though, have you tried writing individually
without spark?
We can potentially look if you have established that as a starting point to
see how Spark can be interfaced to write to HDFS
I would like to write pdf files using pdfbox to HDFS from my Spark
application. Can this be done?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Can-a-Spark-App-run-with-spark-submit-write-pdf-files-to-HDFS-tp23233.html
Sent from the Apache Spark User
I don't know anything about your use case, so take this with a grain of
salt, but typically if you are operating at a scale that benefits from
Spark, then you likely will not want to write your output records as
individual files into HDFS. Spark has built-in support for the Hadoop
SequenceFile
{
FSDataOutputStream out = fs.create(pt_temp, true);
IOUtils.copyBytes(sourceContent, out, 4096, false);
out.close();
}
where is my fault?? or give it a function to write(append) to the hadoop
hdfs?
best regards,
paul
Thanks So much!
I did put sleep on my code to have the UI available.
Now from the UI, I can see:
· In the “SparkProperty” Section, the spark.jars and spark.files are
set as what I want.
· In the “Classpath Entries” Section, my jars and files paths are
there(with a HDFS path
I am not sure they work with HDFS pathes. You may want to look at the
source code. Alternatively you can create a fat jar containing all jars
(let your build tool set correctly METAINF). This always works.
Le mer. 10 juin 2015 à 6:22, Dong Lei dong...@microsoft.com a écrit :
Thanks So much
Hi Jörn:
I start to check code and sadly it seems it does not work hdfs path:
In HTTPFileServer.scala:
def addFileToDir:
….
Files.copy
….
It looks like it only copy file from local
Hi, spark-users:
I'm using spark-submit to submit multiple jars and files(all in HDFS) to run a
job, with the following command:
Spark-submit
--class myClass
--master spark://localhost:7077/
--deploy-mode cluster
--jars hdfs://localhost/1.jar, hdfs://localhost/2.jar
--files hdfs
URI:
127.0.0.1:8020
at java.net.URI.checkPath(URI.java:1804)
at java.net.URI.init(URI.java:752)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
... 45 more
i set my path like:
file:///127.0.0.1:8020/user/cloudera/inputs/
(namenode of hadoop)
how i must set the path to hdfs
HDFS path should be something like; hdfs://
127.0.0.1:8020/user/cloudera/inputs/
On Mon, Jun 8, 2015 at 4:15 PM, Pa Rö paul.roewer1...@googlemail.com
wrote:
hello,
i submit my spark job with the following parameters:
./spark-1.1.0-bin-hadoop2.4/bin/spark-submit \
--class
your HDFS path to spark job is incorrect.
On 8 June 2015 at 16:24, Nirmal Fernando nir...@wso2.com wrote:
HDFS path should be something like; hdfs://
127.0.0.1:8020/user/cloudera/inputs/
On Mon, Jun 8, 2015 at 4:15 PM, Pa Rö paul.roewer1...@googlemail.com
wrote:
hello,
i submit my spark
)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
On 08/06/15 15:12, Ewan Leith wrote:
Try putting a * on the end of xmlDir, i.e.
xmlDir = fdfs:///abc/def/*
Rather than
xmlDir = Hdfs://abc/def
$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
I run my spark job via spark-submit and it works for an other HDFS directory
Try putting a * on the end of xmlDir, i.e.
xmlDir = fdfs:///abc/def/*
Rather than
xmlDir = Hdfs://abc/def
and see what happens. I don't know why, but that appears to be more reliable
for me with S3 as the filesystem.
I'm also using binaryFiles, but I've tried running the same command while
No luck I am afraid. After giving the namenode 16GB of RAM, I am still
getting an out of mem exception, kind of different one:
15/06/08 15:35:52 ERROR yarn.ApplicationMaster: User class threw
exception: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
Can you do a simple
sc.binaryFiles(hdfs:///path/to/files/*).count()
in the spark-shell and verify that part works?
Ewan
-Original Message-
From: Konstantinos Kougios [mailto:kostas.koug...@googlemail.com]
Sent: 08 June 2015 15:40
To: Ewan Leith; user@spark.apache.org
Subject: Re
It was giving the same error, which made me figure out it is the driver
but the driver running on hadoop - not the local one. So I did
--conf spark.driver.memory=8g
and now it is processing the files!
Cheers
On 08/06/15 15:52, Ewan Leith wrote:
Can you do a simple
sc.binaryFiles(hdfs
to persistent
HDFS - it always looks for 9000 port regardless of options I set for 9010
persistent HDFS. Have you figured out a solution? Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Required-settings-for-permanent-HDFS-Spark-on-EC2-tp22860p23157
Hi - I'm having similar problem with switching from ephemeral to persistent
HDFS - it always looks for 9000 port regardless of options I set for 9010
persistent HDFS. Have you figured out a solution? Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com
It says your namenode is down (connection refused on 8020), you can restart
your HDFS by going into hadoop directory and typing sbin/stop-dfs.sh and
then sbin/start-dfs.sh
Thanks
Best Regards
On Tue, Jun 2, 2015 at 5:03 AM, Su She suhsheka...@gmail.com wrote:
Hello All,
A bit scared I did
Ahh, this did the trick, I had to get the name node out of same mode
however before it fully worked.
Thanks!
On Tue, Jun 2, 2015 at 12:09 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
It says your namenode is down (connection refused on 8020), you can restart
your HDFS by going into hadoop
Hi,
In Spark 1.3.0 I've enabled event logging to write to an existing HDFS
folder on a Standalone cluster. This is generally working, all the logs are
being written. However, from the Master Web UI, the vast majority of
completed applications are labeled as not having a history:
http
, 2015 at 12:23 PM, Richard Marscher
rmarsc...@localytics.com wrote:
Hi,
In Spark 1.3.0 I've enabled event logging to write to an existing HDFS
folder on a Standalone cluster. This is generally working, all the logs are
being written. However, from the Master Web UI, the vast majority
the dagScheduler?
Thanks,
Richard
On Mon, Jun 1, 2015 at 12:23 PM, Richard Marscher rmarsc...@localytics.com
wrote:
Hi,
In Spark 1.3.0 I've enabled event logging to write to an existing HDFS
folder on a Standalone cluster. This is generally working, all the logs are
being written. However, from
these kill
commands, but I now can't connect to HDFS or start spark. I can't seem
to access Hue. I am afraid I accidentally killed an important process
related to HDFS. But, I am not sure what it would be as I couldn't
even kill the PIDs.
Is it a coincidence that HDFS failed randomly? Likely that I
, 2015, at 20:52, Sanjay Subramanian
sanjaysubraman...@yahoo.com.INVALID wrote:
hey guys
On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x ,
there are about 300+ hive tables.
The data is stored an text (moving slowly to Parquet) on HDFS.
I want to use SparkSQL
distribution CDH 5.2.x
, there are about 300+ hive tables.
The data is stored an text (moving slowly to Parquet) on HDFS.
I want to use SparkSQL and point to the Hive metadata and be able to
define JOINS etc using a programming structure like this
import org.apache.spark.sql.hive.HiveContext
val
in HDFS
hey guys
On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x ,
there are about 300+ hive tables.
The data is stored an text (moving slowly to Parquet) on HDFS.
I want to use SparkSQL and point to the Hive metadata and be able to define
JOINS etc using a programming
hey guys
On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x ,
there are about 300+ hive tables.The data is stored an text (moving slowly to
Parquet) on HDFS.I want to use SparkSQL and point to the Hive metadata and be
able to define JOINS etc using a programming
Any resolution to this? Im having the same problem.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-YARN-on-AWS-EMR-Issues-finding-file-on-hdfs-tp10214p22918.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Any resolution to this? I am having the same problem.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/zip-files-submitted-with-py-files-disappear-from-hdfs-after-a-while-on-EMR-tp22342p22919.html
Sent from the Apache Spark User List mailing list archive
Hi Ayan and Helena,
I've considered using Cassandra/HBase but ended up opting to save to worker
hdfs because I want to take advantage of the data locality since the data
will often be loaded to Spark for further processing. I was also under the
impression that saving to filesystem (instead of db
.
On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati
nisrina.luthfiy...@gmail.com mailto:nisrina.luthfiy...@gmail.com wrote:
Hi all,
I have a stream of data from Kafka that I want to process and store in hdfs
using Spark Streaming.
Each data has a date/time dimension and I want to write data
Hi all,
I have a stream of data from Kafka that I want to process and store in hdfs
using Spark Streaming.
Each data has a date/time dimension and I want to write data within the
same time dimension to the same hdfs directory. The data stream might be
unordered (by time dimension).
I'm wondering
from Kafka that I want to process and store in
hdfs using Spark Streaming.
Each data has a date/time dimension and I want to write data within the
same time dimension to the same hdfs directory. The data stream might be
unordered (by time dimension).
I'm wondering what are the best practices
with hiveContext as given below -
scala hiveContext.sql (CREATE TEMPORARY FUNCTION sample_to_upper AS
'com.abc.api.udf.MyUpper' USING JAR
'hdfs:///users/ravindra/customUDF2.jar')
I have put the udf jar in the hdfs at the path given above. The same
command works well in the hive shell
' USING JAR
'hdfs:///users/ravindra/customUDF2.jar')
I have put the udf jar in the hdfs at the path given above. The same
command works well in the hive shell but failing here in the spark shell.
And it fails as given below. -
15/05/10 00:41:51 ERROR Task: FAILED
hiveContext.sql (CREATE TEMPORARY FUNCTION sample_to_upper AS
'com.abc.api.udf.MyUpper' USING JAR
'hdfs:///users/ravindra/customUDF2.jar')
I have put the udf jar in the hdfs at the path given above. The same
command works well in the hive shell but failing here in the spark shell.
And it fails as given
Hi All,
I am trying to create custom udfs with hiveContext as given below -
scala hiveContext.sql (CREATE TEMPORARY FUNCTION sample_to_upper AS
'com.abc.api.udf.MyUpper' USING JAR
'hdfs:///users/ravindra/customUDF2.jar')
I have put the udf jar in the hdfs at the path given above. The same
Also Kafka has a Hadoop consumer API for doing such things, please refer to
http://kafka.apache.org/081/documentation.html#kafkahadoopconsumerapi
2015-05-06 12:22 GMT+08:00 MrAsanjar . afsan...@gmail.com:
why not try https://github.com/linkedin/camus - camus is kafka to HDFS
pipeline
On Tue
refer
to http://kafka.apache.org/081/documentation.html#kafkahadoopconsumerapi
2015-05-06 12:22 GMT+08:00 MrAsanjar . afsan...@gmail.com:
why not try https://github.com/linkedin/camus - camus is kafka to HDFS
pipeline
On Tue, May 5, 2015 at 11:13 PM, Rendy Bambang Junior
rendy.b.jun
Hi
We are using pyspark 1.3 and input is text files located on hdfs.
file structure
day1
file1.txt
file2.txt
day2
file1.txt
file2.txt
...
Question:
1) What is the way to provide as an input for PySpark job
to indicate that the system is aware that a datanode exists
but is excluded from the operation. So, it looks like it is not partitioned
and Ambari indicates that HDFS is in good health with one NN, one SN, one
DN.
I am unable to figure out what the issue is.
thanks for your help.
On Tue, May 5
Hi all,
I am planning to load data from Kafka to HDFS. Is it normal to use spark
streaming to load data from Kafka to HDFS? What are concerns on doing this?
There are no processing to be done by Spark, only to store data to HDFS
from Kafka for storage and for further Spark processing
Rendy
- which seem to indicate that the system is aware that a datanode exists
but is excluded from the operation. So, it looks like it is not partitioned
and Ambari indicates that HDFS is in good health with one NN, one SN, one
DN.
I am unable to figure out what the issue is.
thanks for your help.
On Tue, May
why not try https://github.com/linkedin/camus - camus is kafka to HDFS
pipeline
On Tue, May 5, 2015 at 11:13 PM, Rendy Bambang Junior
rendy.b.jun...@gmail.com wrote:
Hi all,
I am planning to load data from Kafka to HDFS. Is it normal to use spark
streaming to load data from Kafka to HDFS
What happens when you try to put files to your hdfs from local filesystem?
Looks like its a hdfs issue rather than spark thing.
On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote:
I have searched all replies to this question not found an answer.
I am running standalone Spark 1.3.1
I have searched all replies to this question not found an answer.I am
running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by side, on
the same machine and trying to write output of wordcount program into HDFS
(works fine writing to a local file, /tmp/wordcount).Only line I added
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
I am building a mesos cluster for the purposes of using it to run
spark workloads (in addition to other frameworks). I am under the
impression that it is preferable/recommended to run hdfs datanode
process, spark slave on the same physical node
and have installed spark cluster over
my system having hadoop cluster.I want to process data
stored in HDFS through spark.
When I am running code in eclipse it is giving the
following warning repeatedly:
scheduler.TaskSchedulerImpl: Initial
installed spark cluster over my system having
hadoop cluster.I want to process data stored in HDFS through spark.
When I am running code in eclipse it is giving the following warning
repeatedly:
scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure
What machines are HDFS data nodes -- just your master? that would
explain it. Otherwise, is it actually the write that's slow or is
something else you're doing much faster on the master for other
reasons maybe? like you're actually shipping data via the master first
in some local computation? so
are HDFS data nodes -- just your master? that would
explain it. Otherwise, is it actually the write that's slow or is
something else you're doing much faster on the master for other
reasons maybe? like you're actually shipping data via the master first
in some local computation? so the master's
on the other 2
nodes
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Monday, April 20, 2015 12:57 PM
To: jamborta
Cc: user@spark.apache.org
Subject: Re: writing to hdfs on master node much faster
What machines are HDFS data nodes -- just your master? that would explain
Regards
On Mon, Apr 20, 2015 at 12:22 PM, madhvi madhvi.gu...@orkash.com wrote:
Hi All,
I am new to spark and have installed spark cluster over my system having
hadoop cluster.I want to process data stored in HDFS through spark.
When I am running code in eclipse it is giving the following
There are lot of similar problems shared and resolved by users on this same
portal. I have been part of those discussions before, Search those, Please
Try them and let us know, if you still face problems.
Thanks and Regards,
Archit Thakur.
On Mon, Apr 20, 2015 at 3:05 PM, madhvi
On Monday 20 April 2015 03:18 PM, Archit Thakur wrote:
There are lot of similar problems shared and resolved by users on this
same portal. I have been part of those discussions before, Search
those, Please Try them and let us know, if you still face problems.
Thanks and Regards,
Archit
wrote:
Hi All,
I am new to spark and have installed spark cluster over my system having
hadoop cluster.I want to process data stored in HDFS through spark.
When I am running code in eclipse it is giving the following warning
repeatedly:
scheduler.TaskSchedulerImpl: Initial job has not accepted
Hi all,
I have a three node cluster with identical hardware. I am trying a workflow
where it reads data from hdfs, repartitions it and runs a few map operations
then writes the results back to hdfs.
It looks like that all the computation, including the repartitioning and the
maps complete within
801 - 900 of 1329 matches
Mail list logo