Re: Process large JSON file without causing OOM

2017-11-13 Thread Sonal Goyal
If you are running Spark with local[*] as master, there will be a single process whose memory will be controlled by --driver-memory command line option to spark submit. Check http://spark.apache.org/docs/latest/configuration.html spark.driver.memory 1g Amount of memory to use for the driver

Re: Use of Accumulators

2017-11-13 Thread Holden Karau
So you want to set an accumulator to 1 after a transformation has fully completed? Or what exactly do you want to do? On Mon, Nov 13, 2017 at 9:47 PM vaquar khan wrote: > Confirmed ,you can use Accumulators :) > > Regards, > Vaquar khan > > On Mon, Nov 13, 2017 at 10:58

Re: Process large JSON file without causing OOM

2017-11-13 Thread vaquar khan
https://stackoverflow.com/questions/26562033/how-to-set-apache-spark-executor-memory Regards, Vaquar khan On Mon, Nov 13, 2017 at 6:22 PM, Alec Swan wrote: > Hello, > > I am using the Spark library to convert JSON/Snappy files to ORC/ZLIB > format. Effectively, my Java

Re: Use of Accumulators

2017-11-13 Thread vaquar khan
Confirmed ,you can use Accumulators :) Regards, Vaquar khan On Mon, Nov 13, 2017 at 10:58 AM, Kedarnath Dixit < kedarnath_di...@persistent.com> wrote: > Hi, > > > We need some way to toggle the flag of a variable in transformation. > > > We are thinking to make use of spark Accumulators for

Re: Spark based Data Warehouse

2017-11-13 Thread lucas.g...@gmail.com
Hi Ashish, bear in mind that EMR has some additional tooling available that smoothes out some S3 problems that you may / almost certainly will encounter. We are using Spark / S3 not on EMR and have encountered issues with file consistency, you can deal with it but be aware it's additional

Re: Process large JSON file without causing OOM

2017-11-13 Thread Alec Swan
Hi Joel, Here are the relevant snippets of my code and an OOM error thrown in frameWriter.save(..). Surprisingly, the heap dump is pretty small ~60MB even though I am running with -Xmx10G and 4G executor and driver memory as shown below. SparkConf sparkConf = new SparkConf()

Re: Spark based Data Warehouse

2017-11-13 Thread Affan Syed
Another option that we are trying internally is to uses Mesos for isolating different jobs or groups. Within a single group, using Livy to create different spark contexts also works. - Affan On Tue, Nov 14, 2017 at 8:43 AM, ashish rawat wrote: > Thanks Sky Yin. This really

Re: Spark based Data Warehouse

2017-11-13 Thread ashish rawat
Thanks Sky Yin. This really helps. On Nov 14, 2017 12:11 AM, "Sky Yin" wrote: We are running Spark in AWS EMR as data warehouse. All data are in S3 and metadata in Hive metastore. We have internal tools to creat juypter notebook on the dev cluster. I guess you can use

Re: Process large JSON file without causing OOM

2017-11-13 Thread Joel D
Have you tried increasing driver, exec mem (gc overhead too if required)? your code snippet and stack trace will be helpful. On Mon, Nov 13, 2017 at 7:23 PM Alec Swan wrote: > Hello, > > I am using the Spark library to convert JSON/Snappy files to ORC/ZLIB > format.

Process large JSON file without causing OOM

2017-11-13 Thread Alec Swan
Hello, I am using the Spark library to convert JSON/Snappy files to ORC/ZLIB format. Effectively, my Java service starts up an embedded Spark cluster (master=local[*]) and uses Spark SQL to convert JSON to ORC. However, I keep getting OOM errors with large (~1GB) files. I've tried different ways

Re: Databricks Serverless

2017-11-13 Thread Mark Hamstra
This is not a Databricks forum. On Mon, Nov 13, 2017 at 3:18 PM, Benjamin Kim wrote: > I have a question about this. The documentation compares the concept > similar to BigQuery. Does this mean that we will no longer need to deal > with instances and just pay for execution

Re: Spark 2.2 Structured Streaming + Kinesis

2017-11-13 Thread Benjamin Kim
To add, we have a CDH 5.12 cluster with Spark 2.2 in our data center. On Mon, Nov 13, 2017 at 3:15 PM Benjamin Kim wrote: > Does anyone know if there is a connector for AWS Kinesis that can be used > as a source for Structured Streaming? > > Thanks. > >

Re: Spark 2.2 Structured Streaming + Kinesis

2017-11-13 Thread Jules Damji
You can use the Databricks to connect to Kinesis: https://databricks.com/blog/2017/08/09/apache-sparks-structured-streaming-with-amazon-kinesis-on-databricks.html Cheers Jules Sent from my iPhone Pardon the dumb thumb typos :) > On Nov 13, 2017, at 3:15 PM, Benjamin Kim

Databricks Serverless

2017-11-13 Thread Benjamin Kim
I have a question about this. The documentation compares the concept similar to BigQuery. Does this mean that we will no longer need to deal with instances and just pay for execution duration and amount of data processed? I’m just curious about how this will be priced. Also, when will it be ready

Spark 2.2 Structured Streaming + Kinesis

2017-11-13 Thread Benjamin Kim
Does anyone know if there is a connector for AWS Kinesis that can be used as a source for Structured Streaming? Thanks.

Re: Reload some static data during struct streaming

2017-11-13 Thread spark receiver
I need it cached to improve throughput ,only hope it can be refreshed once a day not every batch. > On Nov 13, 2017, at 4:49 PM, Burak Yavuz wrote: > > I think if you don't cache the jdbc table, then it should auto-refresh. > > On Mon, Nov 13, 2017 at 1:21 PM, spark

Re: Reload some static data during struct streaming

2017-11-13 Thread Burak Yavuz
I think if you don't cache the jdbc table, then it should auto-refresh. On Mon, Nov 13, 2017 at 1:21 PM, spark receiver wrote: > Hi > > I’m using struct streaming(spark 2.2) to receive Kafka msg ,it works > great. The thing is I need to join the Kafka message with a

Reload some static data during struct streaming

2017-11-13 Thread spark receiver
Hi I’m using struct streaming(spark 2.2) to receive Kafka msg ,it works great. The thing is I need to join the Kafka message with a relative static table stored in mysql database (let’s call it metadata here). So is it possible to reload the metadata table after some time interval(like

Re: Spark based Data Warehouse

2017-11-13 Thread Sky Yin
We are running Spark in AWS EMR as data warehouse. All data are in S3 and metadata in Hive metastore. We have internal tools to creat juypter notebook on the dev cluster. I guess you can use zeppelin instead, or Livy? We run genie as a job server for the prod cluster, so users have to submit

Re: Spark based Data Warehouse

2017-11-13 Thread Deepak Sharma
If you have only 1 user , its still possible to execute non-blocking long running queries . Best way is to have different users with pre assigned resources , run their queries . HTH Thanks Deepak On Nov 13, 2017 23:56, "ashish rawat" wrote: > Thanks Everyone. I am still

Re: Spark based Data Warehouse

2017-11-13 Thread ashish rawat
Thanks Everyone. I am still not clear on what is the right way to execute support multiple users, running concurrent queries with Spark. Is it through multiple spark contexts or through Livy (which creates a single spark context only). Also, what kind of isolation is possible with Spark SQL? If

Use of Accumulators

2017-11-13 Thread Kedarnath Dixit
Hi, We need some way to toggle the flag of a variable in transformation. We are thinking to make use of spark Accumulators for this purpose. Can we use these as below: Variables -> Initial Value Variable1 -> 0 Variable2 -> 0 In one of the transformations if we need to make

Re: Spark SQL - Truncate Day / Hour

2017-11-13 Thread Eike von Seggern
Hi, you can truncate datetimes like this (in pyspark), e.g. to 5 minutes: import pyspark.sql.functions as F df.select((F.floor(F.col('myDateColumn').cast('long') / 300) * 300).cast('timestamp')) Best, Eike David Hodefi schrieb am Mo., 13. Nov. 2017 um 12:27 Uhr:

Re: Spark SQL - Truncate Day / Hour

2017-11-13 Thread David Hodefi
I am familiar with those functions, none of them is actually truncating a date. We can use those methods to help implement truncate method. I think truncating a day/ hour should be as simple as "truncate(...,"DD") or truncate(...,"HH") ". On Thu, Nov 9, 2017 at 8:23 PM, Gaspar Muñoz