Re: Manually reading parquet files.

2019-03-21 Thread Long, Andrew
Thanks a ton for the help! Is there a standardized way of converting the internal row to rows? I’ve tried this but im getting an exception val enconder = RowEncoder(df.schema) val rows = readFile(pFile).flatMap(_ match { case r: InternalRow => Seq(r) case b: ColumnarBatch =>

Re: Manually reading parquet files.

2019-03-21 Thread Ryan Blue
You're getting InternalRow instances. They probably have the data you want, but the toString representation doesn't match the data for InternalRow. On Thu, Mar 21, 2019 at 3:28 PM Long, Andrew wrote: > Hello Friends, > > > > I’m working on a performance improvement that reads additional parquet

Manually reading parquet files.

2019-03-21 Thread Long, Andrew
Hello Friends, I’m working on a performance improvement that reads additional parquet files in the middle of a lambda and I’m running into some issues. This is what id like todo ds.mapPartitions(x=>{ //read parquet file in and perform an operation with x }) Here’s my current POC code but

Re: Network statistics , network cost

2019-03-21 Thread asma zgolli
Hello, sparkMeasure is a great tool that is indeed helpful for me but unfortunately, it doesn't measure the network communication time/cost. it is stated as a limitation in the GitHub page : - The currently available Spark task metrics can give you precious quantitative information on

Re: [build system] jenkins wedged again, rebooting master node

2019-03-21 Thread shane knapp
i tweaked some apache settings (MaxClients increased to fix an error i found buried in the logs, and added 'retry' and 'acquire' to the reverse proxy settings to hopefully combat the dreaded 502 response), restarted httpd and things actually seem quite snappy right now! i'm not holding my breath,

Cross Join

2019-03-21 Thread asma zgolli
Hello , I need to cross my data and i'm executing a cross join on two dataframes . C = A.crossJoin(B) A has 50 records B has 5 records the result im getting with spark 2.0 is a dataframe C having 50 records. only the first row from B was added to C. Is that a bug in Spark? Asma ZGOLLI PhD

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-21 Thread Tom Graves
While I agree with you that it would be ideal to have the task level resources and do a deeper redesign for the scheduler, I think that can be a separate enhancement like was discussed earlier in the thread. That feature is useful without GPU's.  I do realize that they overlap some but I think

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-21 Thread Mark Hamstra
I understand the application-level, static, global nature of spark.task.accelerator.gpu.count and its similarity to the existing spark.task.cpus, but to me this feels like extending a weakness of Spark's scheduler, not building on its strengths. That is because I consider binding the number of

Re: Network statistics , network cost

2019-03-21 Thread Saikat Kanjilal
How about using this: https://github.com/LucaCanali/sparkMeasure Sent from my iPhone On Mar 21, 2019, at 7:46 AM, asma zgolli mailto:zgollia...@gmail.com>> wrote: Hello , is there a way to get the network statistics, server and distribution statistics from spark? I m looking for that

Network statistics , network cost

2019-03-21 Thread asma zgolli
Hello , is there a way to get the network statistics, server and distribution statistics from spark? I m looking for that information in order to work on network communication performance. thank you very much for your help kind regards Asma ZGOLLI PhD student in data engineering - computer

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-21 Thread Tom Graves
Tthe proposal here is that all your resources are static and the gpu per task config is global per application, meaning you ask for a certain amount memory, cpu, GPUs for every executor up front just like you do today and every executor you get is that size.  This means that both static or

Re: Introduce FORMAT clause to CAST with SQL:2016 datetime patterns

2019-03-21 Thread Gabor Kaszab
Thanks for the quick feedbacks, Maciej and Shawn! Maciej: The concern about confusing users with supporting multiple datetime patterns is a valid one. The cleanest way to introduce SQL:2016 patterns would be to drop the existing pattern support (SimpleDateFormat in case of Impala) and replace it

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-21 Thread Marco Gaido
Thanks for this SPIP. I cannot comment on the docs, but just wanted to highlight one thing. In page 5 of the SPIP, when we talk about DRA, I see: "For instance, if each executor consists 4 CPUs and 2 GPUs, and each task requires 1 CPU and 1GPU, then we shall throw an error on application start