Re: Reading multiple json files form nested folders for data frame

2016-07-21 Thread Ashutosh Kumar
Thanks for response. I am using google cloud . I have couple of options . 1. I can go for spark and run sql queries using sqlcontext . 2. Use hive , As I understand , hive will have underlying engine spark . Is that correct ? Also my data is json and is highly nested . What do you suggest ?

Re: MLlib, Java, and DataFrame

2016-07-21 Thread Bryan Cutler
ML has a DataFrame based API, while MLlib is RDDs and will be deprecated as of Spark 2.0. On Thu, Jul 21, 2016 at 10:41 PM, VG wrote: > Why do we have these 2 packages ... ml and mlib? > What is the difference in these > > > > On Fri, Jul 22, 2016 at 11:09 AM, Bryan Cutler

Re: SparkWebUI and Master URL on EC2

2016-07-21 Thread Ismaël Mejía
Hello, If you are using EMR you probably need to create a SSH tunnel so you can access the web ports of the master instance. https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html Also verify that your EMR cluster is not behind a private VPC, in this case you

Re: MLlib, Java, and DataFrame

2016-07-21 Thread VG
Why do we have these 2 packages ... ml and mlib? What is the difference in these On Fri, Jul 22, 2016 at 11:09 AM, Bryan Cutler wrote: > Hi JG, > > If you didn't know this, Spark MLlib has 2 APIs, one of which uses > DataFrames. Take a look at this example >

Re: MLlib, Java, and DataFrame

2016-07-21 Thread Bryan Cutler
Hi JG, If you didn't know this, Spark MLlib has 2 APIs, one of which uses DataFrames. Take a look at this example https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java This example uses a Dataset, which is

Re: NoClassDefFoundError with ZonedDateTime

2016-07-21 Thread Ted Yu
You can use this command (assuming log aggregation is turned on): yarn logs --applicationId XX In the log, you should see snippet such as the following: java.class.path=... FYI On Thu, Jul 21, 2016 at 9:38 PM, Ilya Ganelin wrote: > what's the easiest way to get the

Re: NoClassDefFoundError with ZonedDateTime

2016-07-21 Thread Ilya Ganelin
what's the easiest way to get the Classpath for the spark application itself? On Thu, Jul 21, 2016 at 9:37 PM Ted Yu wrote: > Might be classpath issue. > > Mind pastebin'ning the effective class path ? > > Stack trace of NoClassDefFoundError may also help provide some clue.

Re: NoClassDefFoundError with ZonedDateTime

2016-07-21 Thread Ted Yu
Might be classpath issue. Mind pastebin'ning the effective class path ? Stack trace of NoClassDefFoundError may also help provide some clue. On Thu, Jul 21, 2016 at 8:26 PM, Ilya Ganelin wrote: > Hello - I'm trying to deploy the Spark TimeSeries library in a new >

Applying schema on single column dataframe in java

2016-07-21 Thread raheel-akl
Hi folks, I am reading lines from apache webserver log file into spark data frame. A sample line from log file is below: *piweba4y.prodigy.com - - [01/Aug/1995:00:00:10 -0400] "GET /images/launchmedium.gif HTTP/1.0" 200 11853* I have split the values into /host/, /timestamp/, /path/, /status/

MLlib, Java, and DataFrame

2016-07-21 Thread Jean Georges Perrin
Hi, I am looking for some really super basic examples of MLlib (like a linear regression over a list of values) in Java. I have found a few, but I only saw them using JavaRDD... and not DataFrame. I was kind of hoping to take my current DataFrame and send them in MLlib. Am I too optimistic?

NoClassDefFoundError with ZonedDateTime

2016-07-21 Thread Ilya Ganelin
Hello - I'm trying to deploy the Spark TimeSeries library in a new environment. I'm running Spark 1.6.1 submitted through YARN in a cluster with Java 8 installed on all nodes but I'm getting the NoClassDef at runtime when trying to create a new TimeSeriesRDD. Since ZonedDateTime is part of Java 8

Re: How to submit app in cluster mode? port 7077 or 6066

2016-07-21 Thread Andy Davidson
Thanks Andy From: Saisai Shao Date: Thursday, July 21, 2016 at 6:11 PM To: Andrew Davidson Cc: "user @spark" Subject: Re: How to submit app in cluster mode? port 7077 or 6066 > I think both 6066 and 7077 can

Re: the spark job is so slow - almost frozen

2016-07-21 Thread Gourav Sengupta
Andrew, you have pretty much consolidated my entire experience, please give a presentation in a meetup on this, and send across the links :) Regards, Gourav On Wed, Jul 20, 2016 at 4:35 AM, Andrew Ehrlich wrote: > Try: > > - filtering down the data as soon as possible in

Re: Reading multiple json files form nested folders for data frame

2016-07-21 Thread Gourav Sengupta
If you are using EMR, please try their latest release, there will be very few reasons left for using SPARK ever at all (particularly given that hiveContext rides a lot on HIVE) if you are using SQL. Just over regular csv data I have seen Hive on TEZ performance gains by 100x (query 64 million

Re: Programmatic use of UDFs from Java

2016-07-21 Thread Gourav Sengupta
JAVA seriously? On Thu, Jul 21, 2016 at 6:10 PM, Everett Anderson wrote: > Hi, > > In the Java Spark DataFrames API, you can create a UDF, register it, and > then access it by string name by using the convenience UDF classes in > org.apache.spark.sql.api.java >

Re: How to submit app in cluster mode? port 7077 or 6066

2016-07-21 Thread Saisai Shao
I think both 6066 and 7077 can be worked. 6066 is using the REST way to submit application, while 7077 is the legacy way. From user's aspect, it should be transparent and no need to worry about the difference. - *URL:* spark://hw12100.local:7077 - *REST URL:* spark://hw12100.local:6066

SVD output within Spark

2016-07-21 Thread Martin Somers
just looking at a comparision between Matlab and Spark for svd with an input matrix N this is matlab code - yes very small matrix N = 2.5903 -0.04160.6023 -0.12362.55960.7629 0.0148 -0.06930.2490 U = -0.3706 -0.92840.0273 -0.92870.3708

How to submit app in cluster mode? port 7077 or 6066

2016-07-21 Thread Andy Davidson
I have some very long lived streaming apps. They have been running for several months. I wonder if something has changed recently? I first started working with spark-1.3 . I am using the stand alone cluster manager. The way I would submit my app to run in cluster mode was port 6066 Looking at

Upgrade from 1.2 to 1.6 - parsing flat files in working directory

2016-07-21 Thread Sumona Routh
Hi all, We are running into a classpath issue when we upgrade our application from 1.2 to 1.6. In 1.2, we load properties from a flat file (from working directory of the spark-submit script) using classloader resource approach. This was executed up front (by the driver) before any processing

Number of sortBy output partitions

2016-07-21 Thread Simone Franzini
Hi all, I am really struggling with the behavior of sortBy. I am running sortBy on a fairly large dataset (~20GB), that I partitioned in 1200 tasks. The output of the sortBy stage in the Spark UI shows that it ran with 1200 tasks. However, when I run the next operation (e.g. filter or

Re: SparkWebUI and Master URL on EC2

2016-07-21 Thread Jacek Laskowski
Hi, What's in the logs of spark-shell? There should be the host and port of web UI. what the public IP of the host where you execute spark-shell? Use it for 4040. I don't think you use Spark Standalone cluster (the other address with 8080) if you simply spark-shell (unless you've got spark.master

Re: spark.driver.extraJavaOptions

2016-07-21 Thread SamyaMaiti
Thanks for the reply RK. Using the first option, my application doesn't recognize spark.driver.extraJavaOptions. With the second option, the issue remains as same, 2016-07-21 12:59:41 ERROR SparkContext:95 - Error initializing SparkContext. org.apache.spark.SparkException: Found both

Re: MultiThreading in Spark 1.6.0

2016-07-21 Thread RK Aduri
Thanks for the idea Maciej. The data is roughly 10 gigs. I’m wondering if there any way to avoid the collect for each unit operation and somehow capture all such resultant arrays and collect them at once. > On Jul 20, 2016, at 2:52 PM, Maciej Bryński wrote: > > RK Aduri, >

Re: Understanding Spark UI DAGs

2016-07-21 Thread C. Josephson
Ok, so those line numbers in our DAG don't refer to our code. Is there any way to display (or calculate) line numbers that refer to code we actually wrote, or is that only possible in Scala Spark? On Thu, Jul 21, 2016 at 12:24 PM, Jacek Laskowski wrote: > Hi, > > My little

Re: Understanding Spark UI DAGs

2016-07-21 Thread RK Aduri
That -1 is coming from here: PythonRDD.writeIteratorToStream(inputIterator, dataOut) dataOut.writeInt(SpecialLengths.END_OF_DATA_SECTION) —> val END_OF_DATA_SECTION = -1 dataOut.writeInt(SpecialLengths.END_OF_STREAM) dataOut.flush() > On Jul 21, 2016, at 12:24 PM, Jacek Laskowski

Re: spark.driver.extraJavaOptions

2016-07-21 Thread RK Aduri
This has worked for me: --conf "spark.driver.extraJavaOptions -Dlog4j.configuration=file:/some/path/search-spark-service-log4j-Driver.properties" \ you may want to try it. If that doesn't work, then you may use --properties-file -- View this message in context:

Re: Understanding Spark UI DAGs

2016-07-21 Thread Jacek Laskowski
Hi, My little understanding of Python-Spark bridge is that at some point the python code communicates over the wire with Spark's backbone that includes PythonRDD [1]. When the CallSite can't be computed, it's null:-1 to denote "nothing could be referred to". [1]

Re: spark.driver.extraJavaOptions

2016-07-21 Thread dhruve ashar
I am not familiar with the CDH distributions. However from the exception, you are setting both SPARK_JAVA_OPTS and specifying individually for driver and executor. Check for the spark-env.sh file in your spark config directory and you could comment/remove the SPARK_JAVA_OPTS entry and add the

Re: Understanding Spark UI DAGs

2016-07-21 Thread C. Josephson
> > It's called a CallSite that shows where the line comes from. You can see > the code yourself given the python file and the line number. > But that's what I don't understand. Which python file? We spark submit one file called ctr_parsing.py, but it only has 150 lines. So what is MapPartitions

how to resolve you must build spark with hive exception?

2016-07-21 Thread Nomii5007
Hello I know this question is already asked.. but no one answer that..that is why I am asking again. I am using anaconda3.5 distribution and spark 1.6.2 I have been following this blog

spark.driver.extraJavaOptions

2016-07-21 Thread SamyaMaiti
Hi Team, I am using *CDH 5.7.1* with spark *1.6.0* I have a spark streaming application that read s from kafka & do some processing. The issue is while starting the application in CLUSTER mode, i want to pass custom log4j.properies file to both driver & executor. *I have the below command :-*

Programmatic use of UDFs from Java

2016-07-21 Thread Everett Anderson
Hi, In the Java Spark DataFrames API, you can create a UDF, register it, and then access it by string name by using the convenience UDF classes in org.apache.spark.sql.api.java . Example

Re: spark and plot data

2016-07-21 Thread Andy Davidson
Hi Pseudo Plotting, graphing, data visualization, report generation are common needs in scientific and enterprise computing. Can you tell me more about your use case? What is it about the current process / workflow do you think could be improved by pushing plotting (I assume you mean plotting

add hours to from_unixtimestamp

2016-07-21 Thread Divya Gehlot
Hi, I need to add 8 hours to from_unixtimestamp df.withColumn(from_unixtime(col("unix_timestamp"),fmt)) as "date_time" I am try to joda time function def unixToDateTime (unix_timestamp : String) : DateTime = { val utcTS = new DateTime(unix_timestamp.toLong * 1000L)+ 8.hours return utcTS }

RE: Role-based S3 access outside of EMR

2016-07-21 Thread Ewan Leith
If you use S3A rather than S3N, it supports IAM roles. I think you can make s3a used for s3:// style URLs so it’s consistent with your EMR paths by adding this to your Hadoop config, probably in core-site.xml: fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem

Re: HiveThriftServer2.startWithContext no more showing tables in 1.6.2

2016-07-21 Thread Todd Nist
This is due to a change in 1.6, by default the Thrift server runs in multi-session mode. You would want to set the following to true on your spark config. spark-default.conf set spark.sql.hive.thriftServer.singleSession Good write up here:

Re: Role-based S3 access outside of EMR

2016-07-21 Thread Everett Anderson
Hey, FWIW, we are using EMR, actually, in production. The main case I have for wanting to access S3 with Spark outside of EMR is that during development, our developers tend to run EC2 sandbox instances that have all the rest of our code and access to some of the input data on S3. It'd be nice

Re: Load selected rows with sqlContext in the dataframe

2016-07-21 Thread Todd Nist
You can set the dbtable to this: .option("dbtable", "(select * from master_schema where 'TID' = '100_0')") HTH, Todd On Thu, Jul 21, 2016 at 10:59 AM, sujeet jog wrote: > I have a table of size 5GB, and want to load selective rows into dataframe > instead of loading

spark and plot data

2016-07-21 Thread pseudo oduesp
Hi , i know spark it s engine to compute large data set but for me i work with pyspark and it s very wonderful machine my question we don't have tools for ploting data each time we have to switch and go back to python for using plot. but when you have large result scatter plot or roc curve

Upgrading a Hive External Storage Handler...

2016-07-21 Thread Lavelle, Shawn
Hello, I am looking to upgrade a Hive 0.11 external storage handler that was run on Shark 0.9.2 to work on spark-sql 1.6.1. I’ve run into a snag in that it seems that the storage handler is not receiving predicate pushdown information. Being fairly new to Spark’s development, would

Load selected rows with sqlContext in the dataframe

2016-07-21 Thread sujeet jog
I have a table of size 5GB, and want to load selective rows into dataframe instead of loading the entire table in memory, For me memory is a constraint hence , and i would like to peridically load few set of rows and perform dataframe operations on it, , for the "dbtable" is there a way to

Re: HiveThriftServer2.startWithContext no more showing tables in 1.6.2

2016-07-21 Thread Marco Colombo
Thanks. That is just a typo. I'm using on 'spark://10.0.2.15:7077' (standalone). Same url used in --master in spark-submit 2016-07-21 16:08 GMT+02:00 Mich Talebzadeh : > Hi Marco > > In your code > > val conf = new SparkConf() >

Re: HiveThriftServer2.startWithContext no more showing tables in 1.6.2

2016-07-21 Thread Mich Talebzadeh
Hi Marco In your code val conf = new SparkConf() .setMaster("spark://10.0.2.15:7077") .setMaster("local") .set("spark.cassandra.connection.host", "10.0.2.15") .setAppName("spark-sql-dataexample"); As I understand the first .setMaster("spark://:7077 indicates that you are

init() and cleanup() for Spark map functions

2016-07-21 Thread Amit Sela
I have a use case where I use Spark (streaming) as a way to distribute a set of computations, which requires (some) of the computations to call an external service. Naturally, I'd like to manage my connections (per executor/worker). I know this pattern for DStream:

HiveThriftServer2.startWithContext no more showing tables in 1.6.2

2016-07-21 Thread Marco Colombo
Hi all, I have a spark application that was working in 1.5.2, but now has a problem in 1.6.2. Here is an example: val conf = new SparkConf() .setMaster("spark://10.0.2.15:7077") .setMaster("local") .set("spark.cassandra.connection.host", "10.0.2.15")

Re: ML PipelineModel to be scored locally

2016-07-21 Thread Robin East
MLeap is another option (Apache licensed) https://github.com/TrueCar/mleap --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action

Using RDD.checkpoint to recover app failure

2016-07-21 Thread harelglik
I am writing a Spark application that has many iterations. I am planning to checkpoint on every Nth iteration to cut the graph of my rdd and clear previous shuffle files. I would also like to be able to restart my application completely using the last checkpoint. I understand that regular

Re: Reading multiple json files form nested folders for data frame

2016-07-21 Thread Ashutosh Kumar
It works. Is it better to have hive in this case for better performance ? On Thu, Jul 21, 2016 at 12:30 PM, Simone wrote: > If you have a folder, and a bunch of json inside that folder- yes it > should work. Just set as path something like "path/to/your/folder/*.json"

RE: Understanding spark concepts cluster, master, slave, job, stage, worker, executor, task

2016-07-21 Thread Joaquin Alzola
You have the same as link 1 but in English? * spark-questions-concepts * deep-into-spark-exection-model Seems really interesting post but in Chinese. I suppose

writing Kafka dstream to local flat file

2016-07-21 Thread Puneet Tripathi
Hi, I am trying to consume from Kafka topics following http://spark.apache.org/docs/latest/streaming-kafka-integration.html Approach one(createStream). I am not able to write it to local text file using saveAsTextFiles() function. Below is the code import pyspark from pyspark import

Re: Spark Job trigger in production

2016-07-21 Thread Lars Albertsson
I assume that you would like to trigger Spark batch jobs, and not streaming jobs. For production jobs, I recommend avoiding scheduling batch jobs directly with cron or cron services like Chronos. Sometimes, jobs will fail, either due to missing input data, or due to execution problems. When it

what contribute to Task Deserialization Time

2016-07-21 Thread patcharee
Hi, I'm running a simple job (reading sequential file and collect data at the driver) with yarn-client mode. When looking at the history server UI, Task Deserialization Time of tasks are quite different (5 ms to 5 s). What contribute to this Task Deserialization Time? Thank you in advance!

Re: Role-based S3 access outside of EMR

2016-07-21 Thread Gourav Sengupta
Hi Teng, This is totally a flashing news for me, that people cannot use EMR in production because its not open sourced, I think that even Werner is not aware of such a problem. Is EMRFS opensourced? I am curious to know what does HA stand for? Regards, Gourav On Thu, Jul 21, 2016 at 8:37 AM,

Re: write and call UDF in spark dataframe

2016-07-21 Thread Kabeer Ahmed
Divya: https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html The link gives a complete example of registering a udAf - user defined aggregate function. This is a complete example and this example should give you a

Re: calculate time difference between consecutive rows

2016-07-21 Thread ayan guha
Please post your code and results. Lag will be null for the first record. Also, what data type you are using? Are you using cast? On 21 Jul 2016 14:28, "Divya Gehlot" wrote: > I have a dataset of time as shown below : > Time1 > 07:30:23 > 07:34:34 > 07:38:23 > 07:39:12 >

Re: Role-based S3 access outside of EMR

2016-07-21 Thread Teng Qiu
there are several reasons that AWS users do (can) not use EMR, one point for us is that security compliance problem, EMR is totally not open sourced, we can not use it in production system. second is that EMR do not support HA yet. but to the original question from @Everett : -> Credentials and

Re: XLConnect in SparkR

2016-07-21 Thread Marco Mistroni
Hi, have you tried to use spark-csv (https://github.com/databricks/spark-csv) ? after all you can reconduct an XL file to CSV hth. On Thu, Jul 21, 2016 at 4:25 AM, Felix Cheung wrote: > From looking at be CLConnect package, its loadWorkbook() function only >

Re: Best practices to restart Spark jobs programatically from driver itself

2016-07-21 Thread Lars Albertsson
You can use a workflow manager, which gives you tools to handle transient failures in data pipelines. I suggest either Luigi or Airflow. They provide DSLs embedded in Python, so if the primitives provided are insufficient, it is easy to customise Spark tasks with restart logic. Regards, Lars

RE: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-07-21 Thread Ravi Aggarwal
Hi Ian, Thanks for the information. I think you are referring to post http://apache-spark-user-list.1001560.n3.nabble.com/How-spark-decides-whether-to-do-BroadcastHashJoin-or-SortMergeJoin-td27369.html. Yeah I could solve above issue of mine using spark.sql.autoBroadcastJoinThreshold=-1, so

Re: Where is the SparkSQL Specification?

2016-07-21 Thread Mich Talebzadeh
Spark SQL is a subset of Hive SQL which by and large supports ANSI 92 SQL including search parameters like above scala> sqlContext.sql("select count(1) from oraclehadoop.channels where channel_desc like ' %b_xx%'").show +---+ |_c0| +---+ | 0| +---+ So check Hive QL Language support HTH Dr

Re: Understanding spark concepts cluster, master, slave, job, stage, worker, executor, task

2016-07-21 Thread Taotao.Li
Hi, Sachin, there is no planning on translate these into english currently, sorry for that, but you can check databrick's blog, there are lots of high-quality and easy-understanding posts. or you can check the list in this post of mine, choose the English version: -

Re: Optimize filter operations with sorted data

2016-07-21 Thread Chanh Le
You can check in spark UI or in output of spark application. How many stages and tasks before you partition and after. Also compare the run time. Regards, Chanh On Thu, Jul 7, 2016 at 6:40 PM, tan shai wrote: > How can you verify that it is loading only the part of time

Where is the SparkSQL Specification?

2016-07-21 Thread Linyuxin
Hi All Newbee here. My spark version is 1.5.1 And I want to know how can I find the Specification of Spark SQL to find out that if it is supported ‘a like %b_xx’ or other sql syntax

Re: calculate time difference between consecutive rows

2016-07-21 Thread Jacek Laskowski
Hi, What was the code you tried? You should use the built-in window aggregates (windows) functions or create one yourself. I haven't tried lag before (and don't think it's what you need really). Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark

Re: write and call UDF in spark dataframe

2016-07-21 Thread Jacek Laskowski
On Thu, Jul 21, 2016 at 5:53 AM, Mich Talebzadeh wrote: > something similar Is this going to be in Scala? > def ChangeToDate (word : String) : Date = { > //return > TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(word,"dd/MM/"),"-MM-dd")) > val d1 =

Re: write and call UDF in spark dataframe

2016-07-21 Thread Jacek Laskowski
On Thu, Jul 21, 2016 at 4:53 AM, Divya Gehlot wrote: > To be very specific I am looking for UDFs syntax for example which takes > String as parameter and returns integer .. how do we define the return type val f: String => Int = ??? val myUDF = udf(f) or val myUDF =

Re: write and call UDF in spark dataframe

2016-07-21 Thread Jacek Laskowski
On Wed, Jul 20, 2016 at 1:22 PM, Rishabh Bhardwaj wrote: > val new_df = df.select(from_unixtime($"time").as("newtime")) or better yet using tick (less typing and more prose than code :)) df.select(from_unixtime('time) as "newtime") Jacek

Unsubscribe

2016-07-21 Thread Kath Gallagher
dunnhumby limited is a limited company registered in England and Wales with registered number 02388853 and VAT registered number 927 5871 83. Our registered office is at Aurora House, 71-75 Uxbridge Road, London W5 5SL. The contents of this message and any attachments to it are confidential

Re: Re: run spark apps in linux crontab

2016-07-21 Thread Mich Talebzadeh
One more thing. If you run a file interactively and you are interested in capturing the output in a file plus seeing the output on the screen, you can use* tee -a * ENVFILE=$HOME/dba/bin/environment.ksh if [[ -f $ENVFILE ]] then . $ENVFILE else echo "Abort: $0 failed. No

Re: Understanding Spark UI DAGs

2016-07-21 Thread Jacek Laskowski
On Thu, Jul 21, 2016 at 2:56 AM, C. Josephson wrote: > I just started looking at the DAG for a Spark Streaming job, and had a > couple of questions about it (image inline). > > 1.) What do the numbers in brackets mean, e.g. PythonRDD[805]? > Every RDD has its identifier (as id

Re: Reading multiple json files form nested folders for data frame

2016-07-21 Thread Ashutosh Kumar
That example points to a particular json file. Will it work same way if I point to top level folder containing all json files ? On Thu, Jul 21, 2016 at 12:04 PM, Simone wrote: > Yes you can - have a look here >

Re: run spark apps in linux crontab

2016-07-21 Thread Chanh Le
If you use > it only print the (print or println) to log file in the others log like (INFO, WARN, ERROR) -> (stdout) I believe it not print to the log file. But tee can do that. The following command (with the help of tee command) writes the output both to the screen (stdout) and to the file.

回复:Re: run spark apps in linux crontab

2016-07-21 Thread luohui20001
got it. difference:> : all messages goes to the log file, leaving no messages in STDOUTtee: all message goes to the log file and STDOUT at the same time. ThanksBest regards! San.Luo - 原始邮件 - 发件人:Chanh Le

Re: Role-based S3 access outside of EMR

2016-07-21 Thread Gourav Sengupta
But that would mean you would be accessing data over internet increasing data read latency, data transmission failures. Why are you not using EMR? Regards, Gourav On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson wrote: > Thanks, Andy. > > I am indeed often doing

Re: Reading multiple json files form nested folders for data frame

2016-07-21 Thread Ashutosh Kumar
There is no database . I read files from google cloud storage /S3/hdfs. Thanks Ashutosh On Thu, Jul 21, 2016 at 11:50 AM, Sree Eedupuganti wrote: > Database you are using ? >

Reading multiple json files form nested folders for data frame

2016-07-21 Thread Ashutosh Kumar
I need to read bunch of json files kept in date wise folders and perform sql queries on them using data frame. Is it possible to do so? Please provide some pointers . Thanks Ashutosh