unsubscribe

2017-11-21 Thread 韩盼
unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Caching dataframes and overwrite

2017-11-21 Thread Michael Artz
I have been interested in finding out why I am getting strange behavior when running a certain spark job. The job will error out if I place an action (A .show(1) method) either right after caching the DataFrame or right before writing the dataframe back to hdfs. There is a very similar post to

Re: What do you pay attention to when validating Spark jobs?

2017-11-21 Thread lucas.g...@gmail.com
I don't think these will blow anyones minds but: 1) Row counts. Most of our jobs 'recompute the world' nightly so we can expect to see fairly predictable row variances. 2) Rolling snapshots. We can also expect that for some critical datasets we can compute a rolling average for important

What do you pay attention to when validating Spark jobs?

2017-11-21 Thread Holden Karau
Hi Folks, I'm working on updating a talk and I was wondering if any folks in the community wanted to share their best practices for validating your Spark jobs? Are there any counters folks have found useful for monitoring/validating your Spark jobs? Cheers, Holden :) -- Twitter:

Re: Writing files to s3 with out temporary directory

2017-11-21 Thread Jim Carroll
I got it working. It's much faster. If someone else wants to try it I: 1) Was already using the code from the Presto S3 Hadoop FileSystem implementation modified to sever it from the rest of the Presto codebase. 2) I extended it and overrode the method "keyFromPath" so that anytime the Path

Custom Data Source for getting data from Rest based services

2017-11-21 Thread Sourav Mazumder
Hi All, Need your thoughts/inputs on a custom Data Source for accessing Rest based services in parallel using Spark. Many a times for business applications (batch oriented) one has to call a target Rest service for a high number of times (with different set of values of parameters/KV pairs).

Re: Process large JSON file without causing OOM

2017-11-21 Thread Alec Swan
Pinging back to see if anybody could provide me with some pointers on hot to stream/batch JSON-to-ORC conversion in Spark SQL or why I get an OOM dump with such small memory footprint? Thanks, Alec On Wed, Nov 15, 2017 at 11:03 AM, Alec Swan wrote: > Thanks Steve and Vadim

Re: Spark/Parquet/Statistics question

2017-11-21 Thread Rabin Banerjee
Spark is not adding any STAT meta in parquet files in Version 1.6.x. Scanning all files for filter. (1 to 30).map(i => (i, i.toString)).toDF("a", "b").sort("a").coalesce(1).write.format("parquet").saveAsTable("metrics") ./parquet-meta /user/hive/warehouse/metrics/*.parquet file:

Parquet Filter pushdown not working and statistics are not generating for any column with Spark 1.6 CDH 5.7

2017-11-21 Thread Rabin Banerjee
Hi All , I am using CDH 5.7 which comes with Spark version 1.6.0. I am saving my data set as parquet data and then querying it . Query is executing fine But when I checked the files generated by spark, I found statistics(min/max) is missing for all the columns . And hence filters are not

Re: Writing files to s3 with out temporary directory

2017-11-21 Thread Jim Carroll
It's not actually that tough. We already use a custom Hadoop FileSystem for S3 because when we started using Spark with S3 the native FileSystem was very unreliable. Our's is based on the code from Presto. (see

Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-21 Thread Aakash Basu
Yes, I did the same. It's working. Thanks! On 21-Nov-2017 4:04 PM, "Fernando Pereira" wrote: > Did you consider do string processing to build the SQL expression which > you can execute with spark.sql(...)? > Some examples: https://spark.apache.org/docs/latest/sql- >

Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-21 Thread Fernando Pereira
Did you consider do string processing to build the SQL expression which you can execute with spark.sql(...)? Some examples: https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables Cheers On 21 November 2017 at 03:27, Aakash Basu wrote: > Hi all,

Spark Writing to parquet directory : java.io.IOException: Disk quota exceeded

2017-11-21 Thread Chetan Khatri
Hello Spark Users, I am getting below error, when i am trying to write dataset to parquet location. I have enough disk space available. Last time i was facing same kind of error which were resolved by increasing number of cores at hyper parameters. Currently result set data size is almost 400Gig