Re: cache table vs. parquet table performance

2019-04-17 Thread Bin Fan
Hi Tomas, One option is to cache your table as Parquet files into Alluxio (which can serve as an in-memory distributed caching layer for Spark in your case). The code on Spark will be like > df.write.parquet("alluxio://master:19998/data.parquet")> df = >

Re: cache table vs. parquet table performance

2019-01-16 Thread Jörn Franke
I believe the in-memory solution misses the storage indexes that parquet / orc have. The in-memory solution is more suitable if you iterate in the whole set of data frequently. > Am 15.01.2019 um 19:20 schrieb Tomas Bartalos : > > Hello, > > I'm using spark-thrift server and I'm searching

Re: cache table vs. parquet table performance

2019-01-16 Thread Todd Nist
Hi Tomas, Have you considered using something like https://www.alluxio.org/ for you cache? Seems like a possible solution for what your trying to do. -Todd On Tue, Jan 15, 2019 at 11:24 PM 大啊 wrote: > Hi ,Tomas. > Thanks for your question give me some prompt.But the best way use cache >

cache table vs. parquet table performance

2019-01-15 Thread Tomas Bartalos
Hello, I'm using spark-thrift server and I'm searching for best performing solution to query hot set of data. I'm processing records with nested structure, containing subtypes and arrays. 1 record takes up several KB. I tried to make some improvement with cache table: cache table event_jan_01