Re: Is SPARK-3322 fixed in latest version of Spark?

2015-08-05 Thread Aaron Davidson
ConnectionManager has been deprecated and is no longer used by default (NettyBlockTransferService is the replacement). Hopefully you would no longer see these messages unless you have explicitly flipped it back on. On Tue, Aug 4, 2015 at 6:14 PM, Jim Green openkbi...@gmail.com wrote: And also

Re: Total delay per batch in a CSV file

2015-08-05 Thread Saisai Shao
Hi, Lots of streaming internal status are exposed through StreamingListener, as well as what see from web UI, so you could write your own StreamingListener and register in StreamingContext to get the internal information of Spark Streaming and write to CSV file. You could check the source code

Re: Combining Spark Files with saveAsTextFile

2015-08-05 Thread Igor Berman
using coalesce might be dangerous, since 1 worker process will need to handle whole file and if the file is huge you'll get OOM, however it depends on implementation, I'm not sure how it will be done nevertheless, worse to try the coallesce method(please post your results) another option would be

Debugging Spark job in Eclipse

2015-08-05 Thread Deepesh Maheshwari
Hi, As spark job is executed when you run start() method of JavaStreamingContext. All the job like map, flatMap is already defined earlier but even though you put breakpoints in the function ,breakpoint doesn't stop there , then how can i debug the spark jobs. JavaDStreamString

Implementing algorithms in GraphX pregel

2015-08-05 Thread Krish
Hi, I was recently looking into spark graphx as one of the frameworks that can help me solve some graph related problems. The 'think-like-a-vertex' paradigm is something new to me and I cannot wrap my head over how to implement simple algorithms like Depth First or Breadth First or even getting

RE: Spark SQL support for Hive 0.14

2015-08-05 Thread Ishwardeep Singh
Thanks Steve and Michael for your response. Is there a tentative release date for Spark 1.5? From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Tuesday, August 4, 2015 11:53 PM To: Steve Loughran ste...@hortonworks.com Cc: Ishwardeep Singh ishwardeep.si...@impetus.co.in;

Fwd: MLLIB MulticlassMetrics Unable to find class key

2015-08-05 Thread Hayri Volkan Agun
Hi everyone, After a successfull prediction, MulticlassMetrics returns an error for unable to find the key for the class for precision. Is there a way to check whether the metrics contains the class key? -- Hayri Volkan Agun PhD. Student - Anadolu University

Re: Combining Spark Files with saveAsTextFile

2015-08-05 Thread Igor Berman
seems that coallesce do work, see following thread https://www.mail-archive.com/user%40spark.apache.org/msg00928.html On 5 August 2015 at 09:47, Igor Berman igor.ber...@gmail.com wrote: using coalesce might be dangerous, since 1 worker process will need to handle whole file and if the file is

Spark Streaming - CheckPointing issue

2015-08-05 Thread Sadaf
Hi i've done the twitter streaming using twitter's streaming user api and spark streaming. this runs successfully on my local machine. but when i run this program on cluster in local mode. it just run successfully for the very first time. later on it gives the following exception. Exception in

Re?? About memory leak in spark 1.4.1

2015-08-05 Thread Sea
No one help me... I help myself, I split the cluster to two cluster 1.4.1 and 1.3.0 -- -- ??: Ted Yu;yuzhih...@gmail.com; : 2015??8??4??(??) 10:28 ??: Igor Bermanigor.ber...@gmail.com; : Sea261810...@qq.com; Barak

Label based MLLib MulticlassMetrics is buggy

2015-08-05 Thread Hayri Volkan Agun
The results in MulticlassMetrics is totally wrong. They are improperly calculated. Confusion matrix may be true I don't know but for each label scores are wrong. -- Hayri Volkan Agun PhD. Student - Anadolu University

Re: large scheduler delay in pyspark

2015-08-05 Thread gen tang
Hi, Thanks a lot for your reply. It seems that it is because of the slowness of the second code. I rewrite code as list(set([i.items for i in a] + [i.items for i in b])). The program returns normal. By the way, I find that when the computation is running, UI will show scheduler delay. However,

Re: Spark Streaming - CheckPointing issue

2015-08-05 Thread Saisai Shao
Hi, What Spark version do you use? it looks like a problem of configuration recovery, not sure is it a twitter streaming specific problem, I tried Kafka streaming with checkpoint enabled in my local machine, seems no such issue. Did you try to set these configurations in somewhere? Thanks Saisai

Re: Debugging Spark job in Eclipse

2015-08-05 Thread Eugene Morozov
Deepesh, you have to call an action to start actual processing. words.count() would do the trick. On 05 Aug 2015, at 11:42, Deepesh Maheshwari deepesh.maheshwar...@gmail.com wrote: Hi, As spark job is executed when you run start() method of JavaStreamingContext. All the job like map,

Best practices to call hiveContext in DataFrame.foreach in executor program or how to have a for loop in driver program

2015-08-05 Thread unk1102
Hi I have the following code which fires hiveContext.sql() most of the time. My task is I want to create few table and insert values into after processing for all hive table partition. So I first fire show partitions and using its output in a for loop I call few methods which creates table if not

Spark SQL Hive - merge small files

2015-08-05 Thread Brandon White
Hello, I would love to have hive merge the small files in my managed hive context after every query. Right now, I am setting the hive configuration in my Spark Job configuration but hive is not managing the files. Do I need to set the hive fields in around place? How do you set Hive

Memory allocation error with Spark 1.5

2015-08-05 Thread Alexis Seigneurin
Hi, I'm receiving a memory allocation error with a recent build of Spark 1.5: java.io.IOException: Unable to acquire 67108864 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:348) at

Re: No Twitter Input from Kafka to Spark Streaming

2015-08-05 Thread narendra
Thanks Akash for the answer. I added endpoint to the listener and now it is working. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-Twitter-Input-from-Kafka-to-Spark-Streaming-tp24131p24142.html Sent from the Apache Spark User List mailing list archive

Re: Spark SQL support for Hive 0.14

2015-08-05 Thread Steve Loughran
On 5 Aug 2015, at 02:08, Ishwardeep Singh ishwardeep.si...@impetus.co.inmailto:ishwardeep.si...@impetus.co.in wrote: Thanks Steve and Michael for your response. Is there a tentative release date for Spark 1.5? The branch is of and going through its bug stabilisation/test phase. This is where

Re: Unable to load native-hadoop library for your platform

2015-08-05 Thread Steve Loughran
On 4 Aug 2015, at 12:26, Sean Owen so...@cloudera.com wrote: Oh good point, does the Windows integration need native libs for POSIX-y file system access? I know there are some binaries shipped for this purpose but wasn't sure if that's part of what's covered in the native libs message. On

Re: Label based MLLib MulticlassMetrics is buggy

2015-08-05 Thread Feynman Liang
Hi Hayri, Can you provide a sample of the expected and actual results? Feynman On Wed, Aug 5, 2015 at 6:19 AM, Hayri Volkan Agun volkana...@gmail.com wrote: The results in MulticlassMetrics is totally wrong. They are improperly calculated. Confusion matrix may be true I don't know but for

Re: Label based MLLib MulticlassMetrics is buggy

2015-08-05 Thread Feynman Liang
Also, what version of Spark are you using? On Wed, Aug 5, 2015 at 9:57 AM, Feynman Liang fli...@databricks.com wrote: Hi Hayri, Can you provide a sample of the expected and actual results? Feynman On Wed, Aug 5, 2015 at 6:19 AM, Hayri Volkan Agun volkana...@gmail.com wrote: The results

Re: Spark SQL Hive - merge small files

2015-08-05 Thread Michael Armbrust
This feature isn't currently supported. On Wed, Aug 5, 2015 at 8:43 AM, Brandon White bwwintheho...@gmail.com wrote: Hello, I would love to have hive merge the small files in my managed hive context after every query. Right now, I am setting the hive configuration in my Spark Job

Re: Memory allocation error with Spark 1.5

2015-08-05 Thread Reynold Xin
In Spark 1.5, we have a new way to manage memory (part of Project Tungsten). The default unit of memory allocation is 64MB, which is way too high when you have 1G of memory allocated in total and have more than 4 threads. We will reduce the default page size before releasing 1.5. For now, you

Re: Spark SQL Hive - merge small files

2015-08-05 Thread Brandon White
So there is no good way to merge spark files in a manage hive table right now? On Wed, Aug 5, 2015 at 10:02 AM, Michael Armbrust mich...@databricks.com wrote: This feature isn't currently supported. On Wed, Aug 5, 2015 at 8:43 AM, Brandon White bwwintheho...@gmail.com wrote: Hello, I

Re: Is SPARK-3322 fixed in latest version of Spark?

2015-08-05 Thread Jim Green
We tested Spark 1.2 and 1.3 , and this issue is gone. I know starting from 1.2, Spark uses netty instead of nio. So you mean that bypass this issue? Another question is , why this error message did not show in Spark 0.9 or older version? On Tue, Aug 4, 2015 at 11:01 PM, Aaron Davidson

Streaming and calculated-once semantics

2015-08-05 Thread Dimitris Kouzis - Loukas
Hello, here's a simple program that demonstrates my problem: ssc = StreamingContext(sc, 1) input = [ [(k1,12), (k2,14)], [(k1,22)] ] rawData = ssc.queueStream([sc.parallelize(d, 1) for d in input]) runningRawData = rawData.updateStateByKey(lambda nv, prev: reduce(sum, nv, prev or 0)) def

Starting Spark SQL thrift server from within a streaming app

2015-08-05 Thread Daniel Haviv
Hi, Is it possible to start the Spark SQL thrift server from with a streaming app so the streamed data could be queried as it's goes in ? Thank you. Daniel

Spark MLib v/s SparkR

2015-08-05 Thread praveen S
I was wondering when one should go for MLib or SparkR. What is the criteria or what should be considered before choosing either of the solutions for data analysis? or What is the advantages of Spark MLib over Spark R or advantages of SparkR over MLib?

Re: Spark MLib v/s SparkR

2015-08-05 Thread Benamara, Ali
SparkR doesn't support all the ML algorithms yet, the next 1.5 will have more support but still not all the algorithms that are currently supported in Mllib. SparkR is more of a convenience for R users to get acquainted with Spark at this point. -Ali From: praveen S

Re: Spark MLib v/s SparkR

2015-08-05 Thread Krishna Sankar
A few points to consider: a) SparkR gives the union of R_in_a_single_machine and the distributed_computing_of_Spark: b) It also gives the ability to wrangle with data in R, that is in the Spark eco system c) Coming to MLlib, the question is MLlib and R (not MLlib or R) - depending on the scale,

spark hangs at broadcasting during a filter

2015-08-05 Thread AlexG
I'm trying to load a 1 Tb file whose lines i,j,v represent the values of a matrix given as A_{ij} = v so I can convert it to a Parquet file. Only some of the rows of A are relevant, so the following code first loads the triplets are text, splits them into Tuple3[Int, Int, Double], drops triplets

Re: Spark MLib v/s SparkR

2015-08-05 Thread Charles Earl
What machine learning algorithms are you interested in exploring or using? Start from there or better yet the problem you are trying to solve, and then the selection may be evident. On Wednesday, August 5, 2015, praveen S mylogi...@gmail.com wrote: I was wondering when one should go for MLib

Set Job Descriptions for Scala application

2015-08-05 Thread Rares Vernica
Hello, My Spark application is written in Scala and submitted to a Spark cluster in standalone mode. The Spark Jobs for my application are listed in the Spark UI like this: Job Id Description ... 6 saveAsTextFile at Foo.scala:202 5 saveAsTextFile at Foo.scala:201 4

Re: Set Job Descriptions for Scala application

2015-08-05 Thread Mark Hamstra
SparkContext#setJobDescription or SparkContext#setJobGroup On Wed, Aug 5, 2015 at 12:29 PM, Rares Vernica rvern...@gmail.com wrote: Hello, My Spark application is written in Scala and submitted to a Spark cluster in standalone mode. The Spark Jobs for my application are listed in the Spark

Re: spark hangs at broadcasting during a filter

2015-08-05 Thread Philip Weaver
How big is droprows? Try explicitly broadcasting it like this: val broadcastDropRows = sc.broadcast(dropRows) val valsrows = ... .filter(x = !broadcastDropRows.value.contains(x._1)) - Philip On Wed, Aug 5, 2015 at 11:54 AM, AlexG swift...@gmail.com wrote: I'm trying to load a 1 Tb file

HiveContext error

2015-08-05 Thread Stefan Panayotov
Hello, I am trying to define an external Hive table from Spark HiveContext like the following: import org.apache.spark.sql.hive.HiveContext val hiveCtx = new HiveContext(sc) hiveCtx.sql(sCREATE EXTERNAL TABLE IF NOT EXISTS Rentrak_Ratings (Version string, Gen_Date string, Market_Number

Re: Label based MLLib MulticlassMetrics is buggy

2015-08-05 Thread Feynman Liang
1.5 has not yet been released; what is the commit hash that you are building? On Wed, Aug 5, 2015 at 10:29 AM, Hayri Volkan Agun volkana...@gmail.com wrote: Hi, In Spark 1.5 I saw a result for precision 1.0 and recall 0.01 for decision tree classification. While precision a hundred the

SparkConf ignoring keys

2015-08-05 Thread Corey Nolet
I've been using SparkConf on my project for quite some time now to store configuration information for its various components. This has worked very well thus far in situations where I have control over the creation of the SparkContext the SparkConf. I have run into a bit of a problem trying to

Re: large scheduler delay in pyspark

2015-08-05 Thread ayan guha
It seems you want to dedupe your data after the merge so set(a+b) should also work..you may ditch the list comprehensiion operation. On 5 Aug 2015 23:55, gen tang gen.tan...@gmail.com wrote: Hi, Thanks a lot for your reply. It seems that it is because of the slowness of the second code. I

How to read gzip data in Spark - Simple question

2015-08-05 Thread ๏̯͡๏
I have csv data that is embedded in gzip format on HDFS. *With Pig* a = load '/user/zeppelin/aggregatedsummary/2015/08/03/regular/part-m-3.gz' using PigStorage(); b = limit a 10

Re: How to read gzip data in Spark - Simple question

2015-08-05 Thread Philip Weaver
This message means that java.util.Date is not supported by Spark DataFrame. You'll need to use java.sql.Date, I believe. On Wed, Aug 5, 2015 at 8:29 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: That seem to be working. however i see a new exception Code: def formatStringAsDate(dateStr:

spark on mesos with docker from private repository

2015-08-05 Thread Eyal Fink
Hi, My Spark set up is a cluster on top of mesos using docker containers. I want to pull the docker images from a private repository (currently gcr.io ), and I can't get the authentication to work. I know how to generate a .dockercfg file (running on GCE, using gcloud docker -a). My problem is

Re: Reliable Streaming Receiver

2015-08-05 Thread Sourabh Chandak
Thanks Tathagata. I tried that but BlockGenerator internally uses SystemClock which is again private. We are using DSE so stuck with Spark 1.2 hence can't use the receiver-less version. Is it possible to use the same code as a separate API with 1.2? Thanks, Sourabh On Wed, Aug 5, 2015 at 6:13

Re: Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Muler
Thanks! On Wed, Aug 5, 2015 at 5:24 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Yes, finally shuffle data will be written to disk for reduce stage to pull, no matter how large you set to shuffle memory fraction. Thanks Saisai On Thu, Aug 6, 2015 at 7:50 AM, Muler

Re: How to connect to remote HDFS programmatically to retrieve data, analyse it and then write the data back to HDFS?

2015-08-05 Thread Ted Yu
Please see the comments at the tail of SPARK-2356 Cheers On Wed, Aug 5, 2015 at 6:04 PM, Ashish Dutt ashish.du...@gmail.com wrote: *Use Case:* To automate the process of data extraction (HDFS), data analysis (pySpark/sparkR) and saving the data back to HDFS programmatically. *Prospective

RE: How to read gzip data in Spark - Simple question

2015-08-05 Thread Ganelin, Ilya
Have you tried reading the spark documentation? http://spark.apache.org/docs/latest/programming-guide.html Thank you, Ilya Ganelin -Original Message- From: ÐΞ€ρ@Ҝ (๏̯͡๏) [deepuj...@gmail.commailto:deepuj...@gmail.com] Sent: Thursday, August 06, 2015 12:41 AM Eastern Standard Time

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-05 Thread Philip Weaver
Absolutely, thanks! On Wed, Aug 5, 2015 at 9:07 PM, Cheng Lian lian.cs@gmail.com wrote: We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396 Could you give it a shot to see whether it helps in your case? We've observed ~50x performance boost with schema merging turned

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-05 Thread Cheng Lian
We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396 Could you give it a shot to see whether it helps in your case? We've observed ~50x performance boost with schema merging turned on. Cheng On 8/6/15 8:26 AM, Philip Weaver wrote: I have a parquet directory that was

Re: Reliable Streaming Receiver

2015-08-05 Thread Dibyendu Bhattacharya
Hi, You can try This Kafka Consumer for Spark which is also part of Spark Packages . https://github.com/dibbhatt/kafka-spark-consumer Regards, Dibyendu On Thu, Aug 6, 2015 at 6:48 AM, Sourabh Chandak sourabh3...@gmail.com wrote: Thanks Tathagata. I tried that but BlockGenerator internally

Re: Upgrade of Spark-Streaming application

2015-08-05 Thread Shushant Arora
Hi For checkpointing and using fromOffsets arguments- Say for the first time when my app starts I don't have any prev state stored and I want to start consuming from largest offset 1. is it possible to specify that in fromOffsets api- I don't want to use another api which returs

Re: Reliable Streaming Receiver

2015-08-05 Thread Tathagata Das
You could very easily strip out the BlockGenerator code from the Spark source code and use it directly in the same way the Reliable Kafka Receiver uses it. BTW, you should know that we will be deprecating the receiver based approach for the Direct Kafka approach. That is quite flexible, can give

Re: How to read gzip data in Spark - Simple question

2015-08-05 Thread Philip Weaver
The parallelize method does not read the contents of a file. It simply takes a collection and distributes it to the cluster. In this case, the String is a collection 67 characters. Use sc.textFile instead of sc.parallelize, and it should work as you want. On Wed, Aug 5, 2015 at 8:12 PM, ÐΞ€ρ@Ҝ

Re: How to read gzip data in Spark - Simple question

2015-08-05 Thread ๏̯͡๏
That seem to be working. however i see a new exception Code: def formatStringAsDate(dateStr: String) = new SimpleDateFormat(-MM-dd).parse(dateStr) //(2015-07-27,12459,,31242,6,Daily,-999,2099-01-01,2099-01-02,1,0,0.1,0,1,-1,isGeo,,,204,694.0,1.9236856708701322E-4,0.0,-4.48,0.0,0.0,0.0,) val

Re: How to read gzip data in Spark - Simple question

2015-08-05 Thread ๏̯͡๏
how do i persist the RDD to HDFS ? On Wed, Aug 5, 2015 at 8:32 PM, Philip Weaver philip.wea...@gmail.com wrote: This message means that java.util.Date is not supported by Spark DataFrame. You'll need to use java.sql.Date, I believe. On Wed, Aug 5, 2015 at 8:29 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)

Re: How to read gzip data in Spark - Simple question

2015-08-05 Thread ๏̯͡๏
Code: val summary = rowStructText.map(s = s.split(,)).map( { s = Summary(formatStringAsDate(s(0)), s(1).replaceAll(\, ).toLong, s(3).replaceAll(\, ).toLong, s(4).replaceAll(\, ).toInt, s(5).replaceAll(\, ),

Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Muler
Hi, Consider I'm running WordCount with 100m of data on 4 node cluster. Assuming my RAM size on each node is 200g and i'm giving my executors 100g (just enough memory for 100m data) 1. If I have enough memory, can Spark 100% avoid writing to disk? 2. During shuffle, where results have to

Re: Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Saisai Shao
Hi Muler, Shuffle data will be written to disk, no matter how large memory you have, large memory could alleviate shuffle spill where temporary file will be generated if memory is not enough. Yes, each node writes shuffle data to file and pulled from disk in reduce stage from network framework

Pause Spark Streaming reading or sampling streaming data

2015-08-05 Thread foobar
Hi, I have a question about sampling Spark Streaming data, or getting part of the data. For every minute, I only want the data read in during the first 10 seconds, and discard all data in the next 50 seconds. Is there any way to pause reading and discard data in that period? I'm doing this to

Pause Spark Streaming reading or sampling streaming data

2015-08-05 Thread Heath Guo
Hi, I have a question about sampling Spark Streaming data, or getting part of the data. For every minute, I only want the data read in during the first 10 seconds, and discard all data in the next 50 seconds. Is there any way to pause reading and discard data in that period? I'm doing this to

Re: Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Muler
thanks, so if I have enough large memory (with enough spark.shuffle.memory) then shuffle (in-memory shuffle) spill doesn't happen (per node) but still shuffle data has to be ultimately written to disk so that reduce stage pulls if across network? On Wed, Aug 5, 2015 at 4:40 PM, Saisai Shao

Windows function examples in pyspark

2015-08-05 Thread jegordon
Hi to all, Im trying to use some windows functions (ntile and percentRank) for a Dataframe but i dont know how to use them. Does anyone can help me with this please? in the Python API documentation there are no examples about it. In specific, im trying to get quantiles of a numeric field in my

Re: Starting Spark SQL thrift server from within a streaming app

2015-08-05 Thread Todd Nist
Hi Danniel, It is possible to create an instance of the SparkSQL Thrift server, however seems like this project is what you may be looking for: https://github.com/Intel-bigdata/spark-streamingsql Not 100% sure of your use case is, but you can always convert the data into DF then issue a query

Re: trying to understand yarn-client mode

2015-08-05 Thread nir
Hi DB Tsai-2, I am trying to run singleton sparkcontext in my container (spring-boot tomcat container). When my application bootstrap I used to create sparkContext and keep the reference for future job submission. I got it working with standalone spark perfectly but I am having trouble with yarn

Re: Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Saisai Shao
Yes, finally shuffle data will be written to disk for reduce stage to pull, no matter how large you set to shuffle memory fraction. Thanks Saisai On Thu, Aug 6, 2015 at 7:50 AM, Muler mulugeta.abe...@gmail.com wrote: thanks, so if I have enough large memory (with enough spark.shuffle.memory)

Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-05 Thread Philip Weaver
I have a parquet directory that was produced by partitioning by two keys, e.g. like this: df.write.partitionBy(a, b).parquet(asdf) There are 35 values of a, and about 1100-1200 values of b for each value of a, for a total of over 40,000 partitions. Before running any transformations or actions

Re: Pause Spark Streaming reading or sampling streaming data

2015-08-05 Thread Heath Guo
Hi Dimitris, Thanks for your reply. Just wondering – are you asking about my streaming input source? I implemented a custom receiver and have been using that. Thanks. From: Dimitris Kouzis - Loukas look...@gmail.commailto:look...@gmail.com Date: Wednesday, August 5, 2015 at 5:27 PM To: Heath

PySpark in Pycharm- unable to connect to remote server

2015-08-05 Thread Ashish Dutt
Use Case: I want to use my laptop (using Win 7 Professional) to connect to the CentOS 6.4 master server using PyCharm. Objective: To write the code in Pycharm on the laptop and then send the job to the server which will do the processing and should then return the result back to the laptop or to

Reliable Streaming Receiver

2015-08-05 Thread Sourabh Chandak
Hi, I am trying to replicate the Kafka Streaming Receiver for a custom version of Kafka and want to create a Reliable receiver. The current implementation uses BlockGenerator which is a private class inside Spark streaming hence I can't use that in my code. Can someone help me with some resources

How to connect to remote HDFS programmatically to retrieve data, analyse it and then write the data back to HDFS?

2015-08-05 Thread Ashish Dutt
*Use Case:* To automate the process of data extraction (HDFS), data analysis (pySpark/sparkR) and saving the data back to HDFS programmatically. *Prospective solutions:* 1. Create a remote server connectivity program in an IDE like pyCharm or RStudio and use it to retrieve the data from HDFS or