unsubscribe
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
I have been interested in finding out why I am getting strange behavior
when running a certain spark job. The job will error out if I place an
action (A .show(1) method) either right after caching the DataFrame or
right before writing the dataframe back to hdfs. There is a very similar
post to
I don't think these will blow anyones minds but:
1) Row counts. Most of our jobs 'recompute the world' nightly so we can
expect to see fairly predictable row variances.
2) Rolling snapshots. We can also expect that for some critical datasets
we can compute a rolling average for important
Hi Folks,
I'm working on updating a talk and I was wondering if any folks in the
community wanted to share their best practices for validating your Spark
jobs? Are there any counters folks have found useful for
monitoring/validating your Spark jobs?
Cheers,
Holden :)
--
Twitter:
I got it working. It's much faster.
If someone else wants to try it I:
1) Was already using the code from the Presto S3 Hadoop FileSystem
implementation modified to sever it from the rest of the Presto codebase.
2) I extended it and overrode the method "keyFromPath" so that anytime the
Path
Hi All,
Need your thoughts/inputs on a custom Data Source for accessing Rest based
services in parallel using Spark.
Many a times for business applications (batch oriented) one has to call a
target Rest service for a high number of times (with different set of
values of parameters/KV pairs).
Pinging back to see if anybody could provide me with some pointers on hot
to stream/batch JSON-to-ORC conversion in Spark SQL or why I get an OOM
dump with such small memory footprint?
Thanks,
Alec
On Wed, Nov 15, 2017 at 11:03 AM, Alec Swan wrote:
> Thanks Steve and Vadim
Spark is not adding any STAT meta in parquet files in Version 1.6.x.
Scanning all files for filter.
(1 to 30).map(i => (i, i.toString)).toDF("a",
"b").sort("a").coalesce(1).write.format("parquet").saveAsTable("metrics")
./parquet-meta /user/hive/warehouse/metrics/*.parquet
file:
Hi All ,
I am using CDH 5.7 which comes with Spark version 1.6.0. I am saving my
data set as parquet data and then querying it . Query is executing fine But
when I checked the files generated by spark, I found statistics(min/max) is
missing for all the columns . And hence filters are not
It's not actually that tough. We already use a custom Hadoop FileSystem for
S3 because when we started using Spark with S3 the native FileSystem was
very unreliable. Our's is based on the code from Presto. (see
Yes, I did the same. It's working. Thanks!
On 21-Nov-2017 4:04 PM, "Fernando Pereira" wrote:
> Did you consider do string processing to build the SQL expression which
> you can execute with spark.sql(...)?
> Some examples: https://spark.apache.org/docs/latest/sql-
>
Did you consider do string processing to build the SQL expression which you
can execute with spark.sql(...)?
Some examples:
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
Cheers
On 21 November 2017 at 03:27, Aakash Basu
wrote:
> Hi all,
Hello Spark Users,
I am getting below error, when i am trying to write dataset to parquet
location. I have enough disk space available. Last time i was facing same
kind of error which were resolved by increasing number of cores at hyper
parameters. Currently result set data size is almost 400Gig
13 matches
Mail list logo