Loading objects only once

2017-09-27 Thread Naveen Swamy
Hello all, I am a new user to Spark, please bear with me if this has been discussed earlier. I am trying to run batch inference using DL frameworks pre-trained models and Spark. Basically, I want to download a model(which is usually ~500 MB) onto the workers and load the model and run inference

Re: pyspark histogram

2017-09-27 Thread Weichen Xu
If you want to avoid pulling values into python you can use hive function "histogram_numeric", you need set `SparkSession.enableHiveSupport()`, but note that, calling hive function in spark will also slow down performance. Spark-sql haven't implemented "histogram_numeric" yet. But I think it will

Re: Applying a Java script to many files: Java API or also Python API?

2017-09-27 Thread Weichen Xu
I think you have to use Spark Java API, in PySpark, functions running on spark executors (such as map function) can only written in python. On Thu, Sep 28, 2017 at 12:48 AM, Giuseppe Celano < cel...@informatik.uni-leipzig.de> wrote: > Hi everyone, > > I would like to apply a java script to many

Re: CSV write to S3 failing silently with partial completion

2017-09-27 Thread Mcclintic, Abbi
Hi folks, We appear to have mitigated the issue by including the following configurations to our jobs, with significant improvement in S3 consistency with CSV and JSON (which turned out to be worse than CSV initially): spark.speculation=false

Re: PySpark: Overusing allocated cores / too many processes

2017-09-27 Thread Fabian Böhnlein
It ended up being unintended multi-threading of numpy , solved by export MKL_NUM_THREADS=1 On Tue, 26 Sep 2017 at 09:05 Fabian Böhnlein wrote: > Hi all, > > above topic has

Applying a Java script to many files: Java API or also Python API?

2017-09-27 Thread Giuseppe Celano
Hi everyone, I would like to apply a java script to many files in parallel. I am wondering whether I should definitely use the Spark Java API, or I could also run the script using the Python API (with which I am more familiar with), without this affecting performance. Thanks. Giuseppe

pyspark histogram

2017-09-27 Thread Brian Wylie
Hi All, My google/SO searching is somehow failing on this I simply want to compute histograms for a column in a Spark dataframe. There are two SO hits on this question: - https://stackoverflow.com/questions/39154325/pyspark-show-histogram-of-a-data-frame-column -

Spark job taking 10s to allocate executors and memory before submitting job

2017-09-27 Thread navneet sharma
Hi, I am running spark job taking total 18s, in that 8 seconds for actual processing logic(business logic) and 10s for allocating executors and memory. How to reduce initial time. Any ideas how to reduce time before spark job goes to submit state. thanks, Navneet Sharma

How to read LZO file in Spark?

2017-09-27 Thread 孫澤恩
Hi All, Currently, I follow this blog http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ that I could use hdfs dfs -text to read the LZO file. But I want to

Typed datataset from Avro generated classes?

2017-09-27 Thread Joaquin Tarraga
Hi all, I have an Avro generated class (e.g., AvroGenerateClass) and I am using Encoders.bean to get a typed dataset (e.g., Dataset): Encoder encoder = Encoders.bean(AvroGenereatedClass.class); Dataset ds = sparkSession.read().parquet(filename).as(encoder); I am getting an exception from the