Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
PM To: Dave Ariens Cc: Tim Chen; Olivier Girardot; user@spark.apache.org Subject: Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos On Fri, Jun 26, 2015 at 3:09 PM, Dave Ariens dari...@blackberry.commailto:dari...@blackberry.com wrote: Would there be any way to have the task

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Marcelo Vanzin
. You can check the Hadoop sources for details. Not sure if there's another way. *From: *Marcelo Vanzin *Sent: *Friday, June 26, 2015 6:20 PM *To: *Dave Ariens *Cc: *Tim Chen; Olivier Girardot; user@spark.apache.org *Subject: *Re: Accessing Kerberos Secured HDFS Resources from Spark

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
in the slaves call the UGI login with a principal/keytab provided to the driver? From: Marcelo Vanzin Sent: Friday, June 26, 2015 5:28 PM To: Tim Chen Cc: Olivier Girardot; Dave Ariens; user@spark.apache.org Subject: Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos On Fri, Jun 26

Re: Should I keep memory dedicated for HDFS and Spark on cluster nodes?

2015-06-24 Thread Akhil Das
-keep-memory-dedicated-for-HDFS-and-Spark-on-cluster-nodes-tp23451.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands

Should I keep memory dedicated for HDFS and Spark on cluster nodes?

2015-06-23 Thread maxdml
in context: http://apache-spark-user-list.1001560.n3.nabble.com/Should-I-keep-memory-dedicated-for-HDFS-and-Spark-on-cluster-nodes-tp23451.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Fwd: Storing an action result in HDFS

2015-06-22 Thread ravi tella
Hello All, I am new to Spark. I have a very basic question.How do I write the output of an action on a RDD to HDFS? Thanks in advance for the help. Cheers, Ravi

Re: Storing an action result in HDFS

2015-06-22 Thread ddpisfun
Hi Chris, Thanks for the quick reply and the welcome. I am trying to read a file from hdfs and then writing back just the first line to hdfs. I calling first() on the RDD to get the first line. Sent from my iPhone On Jun 22, 2015, at 7:42 PM, Chris Gore cdg...@cdgore.com wrote: Hi Ravi

Re: Storing an action result in HDFS

2015-06-22 Thread Chris Gore
Hi Ravi, Welcome, you probably want RDD.saveAsTextFile(“hdfs:///my_file”) Chris On Jun 22, 2015, at 5:28 PM, ravi tella ddpis...@gmail.com wrote: Hello All, I am new to Spark. I have a very basic question.How do I write the output of an action on a RDD to HDFS? Thanks in advance

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Akhil Das
You can use fileStream for that, look at the XMLInputFormat https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java of mahout. It should give you full XML object as on record, (as opposed to an XML

Re: Storing an action result in HDFS

2015-06-22 Thread Chris Gore
Hi Ravi, For this case, you could simply do sc.parallelize([rdd.first()]).saveAsTextFile(“hdfs:///my_file”) using pyspark or sc.parallelize(Array(rdd.first())).saveAsTextFile(“hdfs:///my_file”) using Scala Chris On Jun 22, 2015, at 5:53 PM, ddpis...@gmail.com wrote: Hi Chris, Thanks

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Akhil Das
Like this? val rawXmls = ssc.fileStream(path, classOf[XmlInputFormat], classOf[LongWritable], classOf[Text]) Thanks Best Regards On Mon, Jun 22, 2015 at 5:45 PM, Yong Feng fengyong...@gmail.com wrote: Thanks a lot, Akhil I saw this mail thread before, but still do not understand how

Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-22 Thread Ashish Soni
Hi All , What is the Best Way to install and Spark Cluster along side with Hadoop Cluster , Any recommendation for below deployment topology will be a great help *Also Is it necessary to put the Spark Worker on DataNodes as when it read block from HDFS it will be local to the Server / Worker

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Yong Feng
Thanks a lot, Akhil I saw this mail thread before, but still do not understand how to use XmlInputFormatof mahout in Spark Streaming (I am not Spark Streaming Expert yet ;-)). Can you show me some sample code for explanation. Thanks in advance, Yong On Mon, Jun 22, 2015 at 6:44 AM, Akhil Das

Re: Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-22 Thread Akhil Das
recommendation for below deployment topology will be a great help *Also Is it necessary to put the Spark Worker on DataNodes as when it read block from HDFS it will be local to the Server / Worker or I can put the Worker on any other nodes and if i do that will it affect the performance of the Spark

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Yong Feng
Thanks Akhil I will have a try and then go back to you Yong On Mon, Jun 22, 2015 at 8:25 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Like this? val rawXmls = ssc.fileStream(path, classOf[XmlInputFormat], classOf[LongWritable], classOf[Text]) Thanks Best Regards On Mon, Jun

Re: Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-22 Thread ayan guha
*Also Is it necessary to put the Spark Worker on DataNodes as when it read block from HDFS it will be local to the Server / Worker or I can put the Worker on any other nodes and if i do that will it affect the performance of the Spark Data Processing ..* Hadoop Option 1 Server 1 - NameNode

Fwd: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-21 Thread Yong Feng
Hi Spark Experts I have a customer who wants to monitor coming data files (with xml format), and then analysize them after that put analysized data into DB. The size of each file is about 30MB (or even less in future). Spark streaming seems promising. After learning Spark Streaming and also

how to maintain the offset for spark streaming if HDFS is the source

2015-06-16 Thread Manohar753
Hi All, In my usecase HDFS file as source for Spark Stream, the job will process the data line by line but how will make sure to maintain the offset line number(data already processed) while restarting/new code push . Team can you please reply on this is there any configuration in Spark

HDFS not supported by databricks cloud :-(

2015-06-16 Thread Sanjay Subramanian
hey guys After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS is not supported by Databricks cloud.My speed bottleneck is to transfer ~1TB of snapshot HDFS data (250+ external hive tables) to S3 :-(  I want to use databricks cloud but this to me is a starting disabler.The

Re: HDFS not supported by databricks cloud :-(

2015-06-16 Thread Simon Elliston Ball
You could consider using Zeppelin and spark on yarn as an alternative. http://zeppelin.incubator.apache.org/ Simon On 16 Jun 2015, at 17:58, Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID wrote: hey guys After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS

Re: Spark application in production without HDFS

2015-06-15 Thread nsalian
Hi, Spark on YARN should help in the memory management for Spark jobs. Here is a good starting point: https://spark.apache.org/docs/latest/running-on-yarn.html YARN integrates well with HDFS and should be a good solution for a large cluster. What specific features are you looking for that HDFS

Re: Spark application in production without HDFS

2015-06-15 Thread rahulkumar-aws
-without-HDFS-tp23260p23322.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Reading Really Big File Stream from HDFS

2015-06-12 Thread Saisai Shao
Using sc.textFile will also read the file from HDFS one by one line through iterator, don't need to fit all into memory, even you have small size of memory, it still can be worked. 2015-06-12 13:19 GMT+08:00 SLiZn Liu sliznmail...@gmail.com: Hmm, you have a good point. So should I load the file

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
of strings which might be the 1mil file names. But also it has 2.34 GB of long[] ! That's so far, it is still running. What are those long[] used for? When Spark lists files it also needs all the extra metadata about where the files are in the HDFS cluster. That is a lot more than just

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
needs all the extra metadata about where the files are in the HDFS cluster. That is a lot more than just the file's name - see the LocatedFileStatus class in the Hadoop docs for an idea. What you could try is to somehow break that input down into smaller batches, if that's feasible for your app

Reading Really Big File Stream from HDFS

2015-06-11 Thread SLiZn Liu
Hi Spark Users, I'm trying to load a literally big file (50GB when compressed as gzip file, stored in HDFS) by receiving a DStream using `ssc.textFileStream`, as this file cannot be fitted in my memory. However, it looks like no RDD will be received until I copy this big file to a prior-specified

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
Spark lists files it also needs all the extra metadata about where the files are in the HDFS cluster. That is a lot more than just the file's name - see the LocatedFileStatus class in the Hadoop docs for an idea. What you could try is to somehow break that input down into smaller batches

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
after 2h of running, now I got a 10GB long[], 1.3mil instances of long[] So probably information about the files again. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Deleting HDFS files from Pyspark

2015-06-11 Thread Siegfried Bilstein
I've seen plenty of examples for creating HDFS files from pyspark but haven't been able to figure out how to delete files from pyspark. Is there an API I am missing for filesystem management? Or should I be including the HDFS python modules? Thanks, Siegfried

Re: Reading Really Big File Stream from HDFS

2015-06-11 Thread SLiZn Liu
in this use case? 50g need not to be in memory. Give it a try with high number of partitions. On 11 Jun 2015 23:09, SLiZn Liu sliznmail...@gmail.com wrote: Hi Spark Users, I'm trying to load a literally big file (50GB when compressed as gzip file, stored in HDFS) by receiving a DStream using

Re: Deleting HDFS files from Pyspark

2015-06-11 Thread ayan guha
Simplest way would be issuing a os.system with HDFS rm command from driver, assuming it has hdfs connectivity, like a gateway node. Executors will have nothing to do with it. On 12 Jun 2015 08:57, Siegfried Bilstein sbilst...@gmail.com wrote: I've seen plenty of examples for creating HDFS files

Re: append file on hdfs

2015-06-10 Thread Pa Rö
Exception { FSDataOutputStream out = fs.create(pt_temp, true); IOUtils.copyBytes(sourceContent, out, 4096, false); out.close(); } where is my fault?? or give it a function to write(append) to the hadoop hdfs? best

spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Kostas Kougios
.nabble.com/spark-uses-too-much-memory-maybe-binaryFiles-with-more-than-1-million-files-in-HDFS-groupBy-or-reduc-tp23253.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user

Re: append file on hdfs

2015-06-10 Thread Richard Marscher
Hi, if you now want to write 1 file per partition, that's actually built into Spark as *saveAsTextFile*(*path*)Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Kostas Kougios
-uses-too-much-memory-maybe-binaryFiles-with-more-than-1-million-files-in-HDFS-groupBy-or-reduc-tp23253p23257.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-10 Thread Akhil Das
Or you can do sc.addJar(/path/to/the/jar), i haven't tested with HDFS path though it works fine with local path. Thanks Best Regards On Wed, Jun 10, 2015 at 10:17 AM, Jörn Franke jornfra...@gmail.com wrote: I am not sure they work with HDFS pathes. You may want to look at the source code

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Marcelo Vanzin
, it is still running. What are those long[] used for? When Spark lists files it also needs all the extra metadata about where the files are in the HDFS cluster. That is a lot more than just the file's name - see the LocatedFileStatus class in the Hadoop docs for an idea. What you could try

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Kostas Kougios
After some time the driver accumulated 6.67GB of long[] . The executor mem usage so far is low. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-uses-too-much-memory-maybe-binaryFiles-with-more-than-1-million-files-in-HDFS-groupBy-or-reduc

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Akhil Das
wrote: Thanks Akhil: The driver fails so fast to get a look at 4040. Is there any other way to see the download and ship process of the files? Is driver supposed to download these jars from HDFS to some location, then ship them to excutors? I can see from log that the driver

RE: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Dong Lei
Thanks Akhil: The driver fails so fast to get a look at 4040. Is there any other way to see the download and ship process of the files? Is driver supposed to download these jars from HDFS to some location, then ship them to excutors? I can see from log that the driver downloaded

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Akhil Das
by putting the jar with the class in it on the top of your classpath. Thanks Best Regards On Tue, Jun 9, 2015 at 9:05 AM, Dong Lei dong...@microsoft.com wrote: Hi, spark-users: I’m using spark-submit to submit multiple jars and files(all in HDFS) to run a job, with the following command

Re: Can a Spark App run with spark-submit write pdf files to HDFS

2015-06-09 Thread nsalian
By writing PDF files, do you mean something equivalent to a hadoop fs -put /path? I'm not sure how Pdfbox works though, have you tried writing individually without spark? We can potentially look if you have established that as a starting point to see how Spark can be interfaced to write to HDFS

Can a Spark App run with spark-submit write pdf files to HDFS

2015-06-09 Thread Richard Catlin
I would like to write pdf files using pdfbox to HDFS from my Spark application. Can this be done? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-a-Spark-App-run-with-spark-submit-write-pdf-files-to-HDFS-tp23233.html Sent from the Apache Spark User

Re: Can a Spark App run with spark-submit write pdf files to HDFS

2015-06-09 Thread William Briggs
I don't know anything about your use case, so take this with a grain of salt, but typically if you are operating at a scale that benefits from Spark, then you likely will not want to write your output records as individual files into HDFS. Spark has built-in support for the Hadoop SequenceFile

append file on hdfs

2015-06-09 Thread Pa Rö
{ FSDataOutputStream out = fs.create(pt_temp, true); IOUtils.copyBytes(sourceContent, out, 4096, false); out.close(); } where is my fault?? or give it a function to write(append) to the hadoop hdfs? best regards, paul

RE: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Dong Lei
Thanks So much! I did put sleep on my code to have the UI available. Now from the UI, I can see: · In the “SparkProperty” Section, the spark.jars and spark.files are set as what I want. · In the “Classpath Entries” Section, my jars and files paths are there(with a HDFS path

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Jörn Franke
I am not sure they work with HDFS pathes. You may want to look at the source code. Alternatively you can create a fat jar containing all jars (let your build tool set correctly METAINF). This always works. Le mer. 10 juin 2015 à 6:22, Dong Lei dong...@microsoft.com a écrit : Thanks So much

RE: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Dong Lei
Hi Jörn: I start to check code and sadly it seems it does not work hdfs path: In HTTPFileServer.scala: def addFileToDir: …. Files.copy …. It looks like it only copy file from local

ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-08 Thread Dong Lei
Hi, spark-users: I'm using spark-submit to submit multiple jars and files(all in HDFS) to run a job, with the following command: Spark-submit --class myClass --master spark://localhost:7077/ --deploy-mode cluster --jars hdfs://localhost/1.jar, hdfs://localhost/2.jar --files hdfs

path to hdfs

2015-06-08 Thread Pa Rö
URI: 127.0.0.1:8020 at java.net.URI.checkPath(URI.java:1804) at java.net.URI.init(URI.java:752) at org.apache.hadoop.fs.Path.initialize(Path.java:203) ... 45 more i set my path like: file:///127.0.0.1:8020/user/cloudera/inputs/ (namenode of hadoop) how i must set the path to hdfs

Re: path to hdfs

2015-06-08 Thread Nirmal Fernando
HDFS path should be something like; hdfs:// 127.0.0.1:8020/user/cloudera/inputs/ On Mon, Jun 8, 2015 at 4:15 PM, Pa Rö paul.roewer1...@googlemail.com wrote: hello, i submit my spark job with the following parameters: ./spark-1.1.0-bin-hadoop2.4/bin/spark-submit \ --class

Re: path to hdfs

2015-06-08 Thread Jeetendra Gangele
your HDFS path to spark job is incorrect. On 8 June 2015 at 16:24, Nirmal Fernando nir...@wso2.com wrote: HDFS path should be something like; hdfs:// 127.0.0.1:8020/user/cloudera/inputs/ On Mon, Jun 8, 2015 at 4:15 PM, Pa Rö paul.roewer1...@googlemail.com wrote: hello, i submit my spark

Re: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Konstantinos Kougios
) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) On 08/06/15 15:12, Ewan Leith wrote: Try putting a * on the end of xmlDir, i.e. xmlDir = fdfs:///abc/def/* Rather than xmlDir = Hdfs://abc/def

spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Kostas Kougios
$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427) I run my spark job via spark-submit and it works for an other HDFS directory

RE: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Ewan Leith
Try putting a * on the end of xmlDir, i.e. xmlDir = fdfs:///abc/def/* Rather than xmlDir = Hdfs://abc/def and see what happens. I don't know why, but that appears to be more reliable for me with S3 as the filesystem. I'm also using binaryFiles, but I've tried running the same command while

Re: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Konstantinos Kougios
No luck I am afraid. After giving the namenode 16GB of RAM, I am still getting an out of mem exception, kind of different one: 15/06/08 15:35:52 ERROR yarn.ApplicationMaster: User class threw exception: GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded at

RE: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Ewan Leith
Can you do a simple sc.binaryFiles(hdfs:///path/to/files/*).count() in the spark-shell and verify that part works? Ewan -Original Message- From: Konstantinos Kougios [mailto:kostas.koug...@googlemail.com] Sent: 08 June 2015 15:40 To: Ewan Leith; user@spark.apache.org Subject: Re

Re: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Konstantinos Kougios
It was giving the same error, which made me figure out it is the driver but the driver running on hadoop - not the local one. So I did --conf spark.driver.memory=8g and now it is processing the files! Cheers On 08/06/15 15:52, Ewan Leith wrote: Can you do a simple sc.binaryFiles(hdfs

Re: Required settings for permanent HDFS Spark on EC2

2015-06-05 Thread Nicholas Chammas
to persistent HDFS - it always looks for 9000 port regardless of options I set for 9010 persistent HDFS. Have you figured out a solution? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Required-settings-for-permanent-HDFS-Spark-on-EC2-tp22860p23157

Re: Required settings for permanent HDFS Spark on EC2

2015-06-04 Thread barmaley
Hi - I'm having similar problem with switching from ephemeral to persistent HDFS - it always looks for 9000 port regardless of options I set for 9010 persistent HDFS. Have you figured out a solution? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com

Re: HDFS Rest Service not available

2015-06-02 Thread Akhil Das
It says your namenode is down (connection refused on 8020), you can restart your HDFS by going into hadoop directory and typing sbin/stop-dfs.sh and then sbin/start-dfs.sh Thanks Best Regards On Tue, Jun 2, 2015 at 5:03 AM, Su She suhsheka...@gmail.com wrote: Hello All, A bit scared I did

Re: HDFS Rest Service not available

2015-06-02 Thread Su She
Ahh, this did the trick, I had to get the name node out of same mode however before it fully worked. Thanks! On Tue, Jun 2, 2015 at 12:09 AM, Akhil Das ak...@sigmoidanalytics.com wrote: It says your namenode is down (connection refused on 8020), you can restart your HDFS by going into hadoop

Event Logging to HDFS on Standalone Cluster In Progress

2015-06-01 Thread Richard Marscher
Hi, In Spark 1.3.0 I've enabled event logging to write to an existing HDFS folder on a Standalone cluster. This is generally working, all the logs are being written. However, from the Master Web UI, the vast majority of completed applications are labeled as not having a history: http

Re: Event Logging to HDFS on Standalone Cluster In Progress

2015-06-01 Thread Richard Marscher
, 2015 at 12:23 PM, Richard Marscher rmarsc...@localytics.com wrote: Hi, In Spark 1.3.0 I've enabled event logging to write to an existing HDFS folder on a Standalone cluster. This is generally working, all the logs are being written. However, from the Master Web UI, the vast majority

Re: Event Logging to HDFS on Standalone Cluster In Progress

2015-06-01 Thread Richard Marscher
the dagScheduler? Thanks, Richard On Mon, Jun 1, 2015 at 12:23 PM, Richard Marscher rmarsc...@localytics.com wrote: Hi, In Spark 1.3.0 I've enabled event logging to write to an existing HDFS folder on a Standalone cluster. This is generally working, all the logs are being written. However, from

HDFS Rest Service not available

2015-06-01 Thread Su She
these kill commands, but I now can't connect to HDFS or start spark. I can't seem to access Hue. I am afraid I accidentally killed an important process related to HDFS. But, I am not sure what it would be as I couldn't even kill the PIDs. Is it a coincidence that HDFS failed randomly? Likely that I

Re: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-28 Thread Andrew Otto
, 2015, at 20:52, Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID wrote: hey guys On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x , there are about 300+ hive tables. The data is stored an text (moving slowly to Parquet) on HDFS. I want to use SparkSQL

Re: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread ayan guha
distribution CDH 5.2.x , there are about 300+ hive tables. The data is stored an text (moving slowly to Parquet) on HDFS. I want to use SparkSQL and point to the Hive metadata and be able to define JOINS etc using a programming structure like this import org.apache.spark.sql.hive.HiveContext val

RE: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread Cheng, Hao
in HDFS hey guys On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x , there are about 300+ hive tables. The data is stored an text (moving slowly to Parquet) on HDFS. I want to use SparkSQL and point to the Hive metadata and be able to define JOINS etc using a programming

Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread Sanjay Subramanian
hey guys On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x , there are about 300+ hive tables.The data is stored an text (moving slowly to Parquet) on HDFS.I want to use SparkSQL and point to the Hive metadata and be able to define JOINS etc using a programming

RE: Running Spark/YARN on AWS EMR - Issues finding file on hdfs?

2015-05-16 Thread jaredtims
Any resolution to this? Im having the same problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-YARN-on-AWS-EMR-Issues-finding-file-on-hdfs-tp10214p22918.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: zip files submitted with --py-files disappear from hdfs after a while on EMR

2015-05-16 Thread jaredtims
Any resolution to this? I am having the same problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/zip-files-submitted-with-py-files-disappear-from-hdfs-after-a-while-on-EMR-tp22342p22919.html Sent from the Apache Spark User List mailing list archive

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-16 Thread Nisrina Luthfiyati
Hi Ayan and Helena, I've considered using Cassandra/HBase but ended up opting to save to worker hdfs because I want to take advantage of the data locality since the data will often be loaded to Spark for further processing. I was also under the impression that saving to filesystem (instead of db

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-16 Thread Helena Edelson
. On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati nisrina.luthfiy...@gmail.com mailto:nisrina.luthfiy...@gmail.com wrote: Hi all, I have a stream of data from Kafka that I want to process and store in hdfs using Spark Streaming. Each data has a date/time dimension and I want to write data

Grouping and storing unordered time series data stream to HDFS

2015-05-15 Thread Nisrina Luthfiyati
Hi all, I have a stream of data from Kafka that I want to process and store in hdfs using Spark Streaming. Each data has a date/time dimension and I want to write data within the same time dimension to the same hdfs directory. The data stream might be unordered (by time dimension). I'm wondering

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-15 Thread ayan guha
from Kafka that I want to process and store in hdfs using Spark Streaming. Each data has a date/time dimension and I want to write data within the same time dimension to the same hdfs directory. The data stream might be unordered (by time dimension). I'm wondering what are the best practices

Re: Spark can not access jar from HDFS !!

2015-05-11 Thread Ravindra
with hiveContext as given below - scala hiveContext.sql (CREATE TEMPORARY FUNCTION sample_to_upper AS 'com.abc.api.udf.MyUpper' USING JAR 'hdfs:///users/ravindra/customUDF2.jar') I have put the udf jar in the hdfs at the path given above. The same command works well in the hive shell

Re: Spark can not access jar from HDFS !!

2015-05-11 Thread Ravindra
' USING JAR 'hdfs:///users/ravindra/customUDF2.jar') I have put the udf jar in the hdfs at the path given above. The same command works well in the hive shell but failing here in the spark shell. And it fails as given below. - 15/05/10 00:41:51 ERROR Task: FAILED

Re: Spark can not access jar from HDFS !!

2015-05-09 Thread Michael Armbrust
hiveContext.sql (CREATE TEMPORARY FUNCTION sample_to_upper AS 'com.abc.api.udf.MyUpper' USING JAR 'hdfs:///users/ravindra/customUDF2.jar') I have put the udf jar in the hdfs at the path given above. The same command works well in the hive shell but failing here in the spark shell. And it fails as given

Spark can not access jar from HDFS !!

2015-05-09 Thread Ravindra
Hi All, I am trying to create custom udfs with hiveContext as given below - scala hiveContext.sql (CREATE TEMPORARY FUNCTION sample_to_upper AS 'com.abc.api.udf.MyUpper' USING JAR 'hdfs:///users/ravindra/customUDF2.jar') I have put the udf jar in the hdfs at the path given above. The same

Re: Using spark streaming to load data from Kafka to HDFS

2015-05-06 Thread Saisai Shao
Also Kafka has a Hadoop consumer API for doing such things, please refer to http://kafka.apache.org/081/documentation.html#kafkahadoopconsumerapi 2015-05-06 12:22 GMT+08:00 MrAsanjar . afsan...@gmail.com: why not try https://github.com/linkedin/camus - camus is kafka to HDFS pipeline On Tue

Re: Using spark streaming to load data from Kafka to HDFS

2015-05-06 Thread Rendy Bambang Junior
refer to http://kafka.apache.org/081/documentation.html#kafkahadoopconsumerapi 2015-05-06 12:22 GMT+08:00 MrAsanjar . afsan...@gmail.com: why not try https://github.com/linkedin/camus - camus is kafka to HDFS pipeline On Tue, May 5, 2015 at 11:13 PM, Rendy Bambang Junior rendy.b.jun

multiple hdfs folder files input to PySpark

2015-05-05 Thread Oleg Ruchovets
Hi We are using pyspark 1.3 and input is text files located on hdfs. file structure day1 file1.txt file2.txt day2 file1.txt file2.txt ... Question: 1) What is the way to provide as an input for PySpark job

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty
to indicate that the system is aware that a datanode exists but is excluded from the operation. So, it looks like it is not partitioned and Ambari indicates that HDFS is in good health with one NN, one SN, one DN. I am unable to figure out what the issue is. thanks for your help. On Tue, May 5

Using spark streaming to load data from Kafka to HDFS

2015-05-05 Thread Rendy Bambang Junior
Hi all, I am planning to load data from Kafka to HDFS. Is it normal to use spark streaming to load data from Kafka to HDFS? What are concerns on doing this? There are no processing to be done by Spark, only to store data to HDFS from Kafka for storage and for further Spark processing Rendy

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty
- which seem to indicate that the system is aware that a datanode exists but is excluded from the operation. So, it looks like it is not partitioned and Ambari indicates that HDFS is in good health with one NN, one SN, one DN. I am unable to figure out what the issue is. thanks for your help. On Tue, May

Re: Using spark streaming to load data from Kafka to HDFS

2015-05-05 Thread MrAsanjar .
why not try https://github.com/linkedin/camus - camus is kafka to HDFS pipeline On Tue, May 5, 2015 at 11:13 PM, Rendy Bambang Junior rendy.b.jun...@gmail.com wrote: Hi all, I am planning to load data from Kafka to HDFS. Is it normal to use spark streaming to load data from Kafka to HDFS

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread ayan guha
What happens when you try to put files to your hdfs from local filesystem? Looks like its a hdfs issue rather than spark thing. On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote: I have searched all replies to this question not found an answer. I am running standalone Spark 1.3.1

saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan
I have searched all replies to this question not found an answer.I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by side, on the same machine and trying to write output of wordcount program into HDFS (works fine writing to a local file, /tmp/wordcount).Only line I added

Spark + Mesos + HDFS resource split

2015-04-27 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am building a mesos cluster for the purposes of using it to run spark workloads (in addition to other frameworks). I am under the impression that it is preferable/recommended to run hdfs datanode process, spark slave on the same physical node

Re: Running spark over HDFS

2015-04-21 Thread madhvi
and have installed spark cluster over my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following warning repeatedly: scheduler.TaskSchedulerImpl: Initial

Re: Running spark over HDFS

2015-04-21 Thread Akhil Das
installed spark cluster over my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following warning repeatedly: scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure

Re: writing to hdfs on master node much faster

2015-04-20 Thread Sean Owen
What machines are HDFS data nodes -- just your master? that would explain it. Otherwise, is it actually the write that's slow or is something else you're doing much faster on the master for other reasons maybe? like you're actually shipping data via the master first in some local computation? so

Re: writing to hdfs on master node much faster

2015-04-20 Thread Tamas Jambor
are HDFS data nodes -- just your master? that would explain it. Otherwise, is it actually the write that's slow or is something else you're doing much faster on the master for other reasons maybe? like you're actually shipping data via the master first in some local computation? so the master's

RE: writing to hdfs on master node much faster

2015-04-20 Thread Evo Eftimov
on the other 2 nodes -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, April 20, 2015 12:57 PM To: jamborta Cc: user@spark.apache.org Subject: Re: writing to hdfs on master node much faster What machines are HDFS data nodes -- just your master? that would explain

Re: Running spark over HDFS

2015-04-20 Thread SURAJ SHETH
Regards On Mon, Apr 20, 2015 at 12:22 PM, madhvi madhvi.gu...@orkash.com wrote: Hi All, I am new to spark and have installed spark cluster over my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following

Re: Running spark over HDFS

2015-04-20 Thread Archit Thakur
There are lot of similar problems shared and resolved by users on this same portal. I have been part of those discussions before, Search those, Please Try them and let us know, if you still face problems. Thanks and Regards, Archit Thakur. On Mon, Apr 20, 2015 at 3:05 PM, madhvi

Re: Running spark over HDFS

2015-04-20 Thread madhvi
On Monday 20 April 2015 03:18 PM, Archit Thakur wrote: There are lot of similar problems shared and resolved by users on this same portal. I have been part of those discussions before, Search those, Please Try them and let us know, if you still face problems. Thanks and Regards, Archit

Re: Running spark over HDFS

2015-04-20 Thread Akhil Das
wrote: Hi All, I am new to spark and have installed spark cluster over my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following warning repeatedly: scheduler.TaskSchedulerImpl: Initial job has not accepted

writing to hdfs on master node much faster

2015-04-20 Thread jamborta
Hi all, I have a three node cluster with identical hardware. I am trying a workflow where it reads data from hdfs, repartitions it and runs a few map operations then writes the results back to hdfs. It looks like that all the computation, including the repartitioning and the maps complete within

<    4   5   6   7   8   9   10   11   12   13   >