Re: Spark 3.3 + parquet 1.10

2023-07-24 Thread Mich Talebzadeh
e have build & > compiled Spark 3.3 with parquet 1.10. > > Regards > Pralabh Kumar > > On Mon, Jul 24, 2023 at 3:04 PM Mich Talebzadeh > wrote: > >> Hi, >> >> Where is this limitation coming from (using 1.1.0)? That is 2018 build >> >> Have you tri

[Spark Core] Does Spark support parquet predicate pushdown for big lists?

2021-12-16 Thread Amin Borjian
Hello all, We use Apache Spark 3.2.0 and our data stored on Apache Hadoop with parquet format. One of the advantages of the parquet format is the presence of the predicate pushdown filter feature, which allows only the necessary data to be read. This feature is well provided by Spark. For

Spark read parquet with unnamed index

2018-04-20 Thread Lord, Jesse
When reading a parquet created from a pandas dataframe with an unnamed index spark creates a column named “__index_level_0__” since spark DataFrames do not support row indexing. This looks like it is probably a bug to me, since as a spark user I would expect unnamed index columns to be dropped

Re: Spark querying parquet data partitioned in S3

2017-07-05 Thread Steve Loughran
> On 29 Jun 2017, at 17:44, fran wrote: > > We have got data stored in S3 partitioned by several columns. Let's say > following this hierarchy: > s3://bucket/data/column1=X/column2=Y/parquet-files > > We run a Spark job in a EMR cluster (1 master,3 slaves) and

Spark querying parquet data partitioned in S3

2017-06-30 Thread Francisco Blaya
We have got data stored in S3 partitioned by several columns. Let's say following this hierarchy: s3://bucket/data/column1=X/column2=Y/parquet-files We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the following: A) - When we declare the initial dataframe to be the whole

Spark querying parquet data partitioned in S3

2017-06-29 Thread fran
The above is for EMR 5.5.0, Hadoop 2.7.3 and Spark 2.1.0. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-querying-parquet-data-partitioned-in-S3-tp28809.html Sent from the Apache Spark User List mailing list archive at

How Spark determines Parquet partition size

2016-11-08 Thread Selvam Raman
Hi, Can you please tell me how parquet partitions the data while saving the dataframe. I have a dataframe which contains 10 values like below ++ |field_num| ++ | 139| | 140| | 40| | 41| | 148| | 149| |

Re: Spark 2.0 - Parquet data with fields containing periods "."

2016-08-31 Thread Don Drake
Yes, I just tested it against the nightly build from 8/31. Looking at the PR, I'm happy the test added verifies my issue. Thanks. -Don On Wed, Aug 31, 2016 at 6:49 PM, Hyukjin Kwon wrote: > Hi Don, I guess this should be fixed from 2.0.1. > > Please refer this PR.

Re: Spark 2.0 - Parquet data with fields containing periods "."

2016-08-31 Thread Hyukjin Kwon
Hi Don, I guess this should be fixed from 2.0.1. Please refer this PR. https://github.com/apache/spark/pull/14339 On 1 Sep 2016 2:48 a.m., "Don Drake" wrote: > I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark > 2.0 and have encountered some

Spark 2.0 - Parquet data with fields containing periods "."

2016-08-31 Thread Don Drake
I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark 2.0 and have encountered some interesting issues. First, it seems the SQL parsing is different, and I had to rewrite some SQL that was doing a mix of inner joins (using where syntax, not inner) and outer joins to get the SQL

Re: Spark with Parquet

2016-08-23 Thread Mohit Jaggi
something like this should work…. val df = sparkSession.read.csv(“myfile.csv”) //you may have to provide a schema if the guessed schema is not accurate df.write.parquet(“myfile.parquet”) Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com > On Apr 27, 2014, at 11:41 PM, Sai

Re: Spark with Parquet

2016-08-22 Thread shamu
) you can use this file path for your processing... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-Parquet-tp4923p27584.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Hive 1.0.0 not able to read Spark 1.6.1 parquet output files on EMR 4.7.0

2016-06-15 Thread Cheng Lian
by Hive 1.0.0 . It is because Hive 1.0.0 uses older parquet version than Spark 1.6.1 which is using 1.7.0 parquet . Is it possible that we can use older parquet version in Spark or newer parquet version in Hive ? I have tried adding parquet-hive-bundle : 1.7.0 to Hive but while reading it throws

Hive 1.0.0 not able to read Spark 1.6.1 parquet output files on EMR 4.7.0

2016-06-13 Thread mayankshete
Hello Team , I am facing an issue where output files generated by Spark 1.6.1 are not read by Hive 1.0.0 . It is because Hive 1.0.0 uses older parquet version than Spark 1.6.1 which is using 1.7.0 parquet . Is it possible that we can use older parquet version in Spark or newer parquet version

Re: Spark SQL / Parquet - Dynamic Schema detection

2016-03-14 Thread Michael Armbrust
> > Each json file is of a single object and has the potential to have > variance in the schema. > How much variance are we talking? JSON->Parquet is going to do well with 100s of different columns, but at 10,000s many things will probably start breaking.

Spark SQL / Parquet - Dynamic Schema detection

2016-03-14 Thread Anthony Andras
Hello there, I am trying to write a program in Spark that is attempting to load multiple json files (with undefined schemas) into a dataframe and then write it out to a parquet file. When doing so, I am running into a number of garbage collection issues as a result of my JVM running out of heap

SPARK SQL- Parquet projection pushdown for nested data

2015-10-29 Thread Sadhan Sood
I noticed when querying struct data in spark sql, we are requesting the whole column from parquet files. Is this intended or is there some kind of config to control this behaviour? Wouldn't it be better to request just the struct field?

Re: SPARK SQL- Parquet projection pushdown for nested data

2015-10-29 Thread Michael Armbrust
Yeah, this is unfortunate. It would be good to fix this, but its a non-trivial change. Tracked here if you'd like to vote on the issue: https://issues.apache.org/jira/browse/SPARK-4502 On Thu, Oct 29, 2015 at 6:00 PM, Sadhan Sood wrote: > I noticed when querying struct

Re: SPARK SQL- Parquet projection pushdown for nested data

2015-10-29 Thread Sadhan Sood
Thanks Michael, I will upvote this. On Thu, Oct 29, 2015 at 10:29 AM, Michael Armbrust wrote: > Yeah, this is unfortunate. It would be good to fix this, but its a > non-trivial change. > > Tracked here if you'd like to vote on the issue: >

Spark 1.3 + Parquet: Skipping data using statistics

2015-08-12 Thread YaoPau
a Parquet file using Spark, Spark will then have access to min/max vals for each column. If a query asks for a value outside that range (like a timestamp), Spark will know to skip that file entirely. Michael says this feature is turned off by default in 1.3. How can I turn this on? I don't see much

Re: Schema change on Spark Hive (Parquet file format) table not working

2015-08-08 Thread sim
issue for it. Let's vote for these issues and get them resolved. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Schema-change-on-Spark-Hive-Parquet-file-format-table-not-working-tp15360p24180.html Sent from the Apache Spark User List mailing list archive

Re: Spark-hive parquet schema evolution

2015-07-22 Thread Cheng Lian
Yeah, the benefit of `saveAsTable` is that you don't need to deal with schema explicitly, while the benefit of ALTER TABLE is you still have a standard vanilla Hive table. Cheng On 7/22/15 11:00 PM, Dean Wampler wrote: While it's not recommended to overwrite files Hive thinks it understands,

Re: Spark-hive parquet schema evolution

2015-07-22 Thread Cheng Lian
Since Hive doesn’t support schema evolution, you’ll have to update the schema stored in metastore somehow. For example, you can create a new external table with the merged schema. Say you have a Hive table |t1|: |CREATE TABLE t1 (c0 INT, c1 DOUBLE); | By default, this table is stored in HDFS

Re: Spark-hive parquet schema evolution

2015-07-22 Thread Dean Wampler
While it's not recommended to overwrite files Hive thinks it understands, you can add the column to Hive's metastore using an ALTER TABLE command using HiveQL in the Hive shell or using HiveContext.sql(): ALTER TABLE mytable ADD COLUMNS col_name data_type See

Re: Spark-hive parquet schema evolution

2015-07-21 Thread Jerrick Hoang
Hi Lian, Sorry I'm new to Spark so I did not express myself very clearly. I'm concerned about the situation when let's say I have a Parquet table some partitions and I add a new column A to parquet schema and write some data with the new schema to a new partition in the table. If i'm not

Re: Spark-hive parquet schema evolution

2015-07-21 Thread Cheng Lian
Hey Jerrick, What do you mean by schema evolution with Hive metastore tables? Hive doesn't take schema evolution into account. Could you please give a concrete use case? Are you trying to write Parquet data with extra columns into an existing metastore Parquet table? Cheng On 7/21/15 1:04

Re: Spark-hive parquet schema evolution

2015-07-20 Thread Jerrick Hoang
I'm new to Spark, any ideas would be much appreciated! Thanks On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang jerrickho...@gmail.com wrote: Hi all, I'm aware of the support for schema evolution via DataFrame API. Just wondering what would be the best way to go about dealing with schema

Spark-hive parquet schema evolution

2015-07-18 Thread Jerrick Hoang
Hi all, I'm aware of the support for schema evolution via DataFrame API. Just wondering what would be the best way to go about dealing with schema evolution with Hive metastore tables. So, say I create a table via SparkSQL CLI, how would I deal with Parquet schema evolution? Thanks, J

Re: Spark SQL Parquet as External table - 1.3.x HiveMetastoreType now hidden

2015-04-13 Thread Michael Armbrust
Here is the stack trace. The first part shows the log when the session is started in Tableau. It is using the init sql option on the data connection to create theTEMPORARY table myNodeTable. Ah, I see. thanks for providing the error. The problem here is that temporary tables do not exist in

Spark SQL Parquet as External table - 1.3.x HiveMetastoreType now hidden

2015-04-06 Thread Todd Nist
In 1.2.1 of I was persisting a set of parquet files as a table for use by spark-sql cli later on. There was a post here http://apache-spark-user-list.1001560.n3.nabble.com/persist-table-schema-in-spark-sql-tt16297.html#a16311 by Mchael Armbrust that provide a nice little helper method for dealing

Re: Spark SQL Parquet as External table - 1.3.x HiveMetastoreType now hidden

2015-04-06 Thread Michael Armbrust
Hey Todd, In migrating to 1.3.x I see that the spark.sql.hive.convertMetastoreParquet is no longer public, so the above no longer works. This was probably just a typo, but to be clear, spark.sql.hive.convertMetastoreParquet is still a supported option and should work. You are correct that

Re: Spark SQL Parquet - data are reading very very slow

2015-02-01 Thread Mick Davies
Dictionary encoding of Strings from Parquet now added and will be in 1.3. This should reduce UTF to String decoding significantly https://issues.apache.org/jira/browse/SPARK-5309 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Parquet-data

Re: Spark SQL Parquet - data are reading very very slow

2015-01-19 Thread Mick Davies
Added a JIRA to track https://issues.apache.org/jira/browse/SPARK-5309 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Parquet-data-are-reading-very-very-slow-tp21061p21229.html Sent from the Apache Spark User List mailing list archive

Re: Spark SQL Parquet - data are reading very very slow

2015-01-16 Thread Mick Davies
Binary-String conversion per column and this led to a 25% performance improvement for the queries I used. Admittedly this was over a data set with lots or runs of the same Strings in the queried columns. I haven't looked at the code to write Parquet files in Spark but I imagine similar duplicate String

Re: Spark SQL Parquet - data are reading very very slow

2015-01-12 Thread kaushal
yes , i am also facing same problem .. please any one help to get fast execution. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Parquet-data-are-reading-very-very-slow-tp21061p21100.html Sent from the Apache Spark User List mailing list

Re: Schema change on Spark Hive (Parquet file format) table not working

2014-09-30 Thread barge.nilesh
*); ///Not able to read existing data and ArrayIndexOutOfBoundsException is thrown./ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Schema-change-on-Spark-Hive-Parquet-file-format-table-not-working-tp15360p15415.html Sent from the Apache Spark User List mailing list

Schema change on Spark Hive (Parquet file format) table not working

2014-09-29 Thread barge.nilesh
is being thrown here * }/ Any pointers towards to the root cause, solution, or workaround are appreciated -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Schema-change-on-Spark-Hive-Parquet-file-format-table-not-working-tp15360.html Sent from

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-25 Thread Brandon Amos
-spark-user-list.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203p12731.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-23 Thread Brandon Amos
-spark-user-list.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203p12706.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-16 Thread Matei Zaharia
.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-16 Thread Debasish Das
://github.com/adobe-research/spindle, and we welcome any feedback. Thanks! Regards, Brandon. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203.html Sent from

Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-15 Thread Brandon Amos
welcome any feedback. Thanks! Regards, Brandon. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203.html Sent from the Apache Spark User List mailing list

Re: Spark SQL, Parquet and Impala

2014-08-02 Thread Patrick McGloin
Hi Michael, Thanks for your reply. Is this the correct way to load data from Spark into Parquet? Somehow it doesn't feel right. When we followed the steps described for storing the data into Hive tables everything was smooth, we used HiveContext and the table is automatically recognised

Spark SQL, Parquet and Impala

2014-08-01 Thread Patrick McGloin
Hi, We would like to use Spark SQL to store data in Parquet format and then query that data using Impala. We've tried to come up with a solution and it is working but it doesn't seem good. So I was wondering if you guys could tell us what is the correct way to do this. We are using Spark 1.0

Re: Spark SQL, Parquet and Impala

2014-08-01 Thread Patrick McGloin
Sorry, sent early, wasn't finished typing. CREATE EXTERNAL TABLE Then we can select the data using Impala. But this is registered as an external table and must be refreshed if new data is inserted. Obviously this doesn't seem good and doesn't seem like the correct solution. How should we

Re: Spark SQL, Parquet and Impala

2014-08-01 Thread Michael Armbrust
So is the only issue that impala does not see changes until you refresh the table? This sounds like a configuration that needs to be changed on the impala side. On Fri, Aug 1, 2014 at 7:20 AM, Patrick McGloin mcgloin.patr...@gmail.com wrote: Sorry, sent early, wasn't finished typing. CREATE

Re: cache spark sql parquet file in memory?

2014-06-07 Thread Michael Armbrust
Not a stupid question! I would like to be able to do this. For now, you might try writing the data to tachyon http://tachyon-project.org/ instead of HDFS. This is untested though, please report any issues you run into. Michael On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen xche...@gmail.com

Re: cache spark sql parquet file in memory?

2014-06-07 Thread Marek Wiewiorka
I was also thinking of using tachyon to store parquet files - maybe tomorrow I will give a try as well. 2014-06-07 20:01 GMT+02:00 Michael Armbrust mich...@databricks.com: Not a stupid question! I would like to be able to do this. For now, you might try writing the data to tachyon

Re: cache spark sql parquet file in memory?

2014-06-07 Thread Xu (Simon) Chen
Is there a way to start tachyon on top of a yarn cluster? On Jun 7, 2014 2:11 PM, Marek Wiewiorka marek.wiewio...@gmail.com wrote: I was also thinking of using tachyon to store parquet files - maybe tomorrow I will give a try as well. 2014-06-07 20:01 GMT+02:00 Michael Armbrust

cache spark sql parquet file in memory?

2014-06-06 Thread Xu (Simon) Chen
This might be a stupid question... but it seems that saveAsParquetFile() writes everything back to HDFS. I am wondering if it is possible to cache parquet-format intermediate results in memory, and therefore making spark sql queries faster. Thanks. -Simon

Spark with Parquet

2014-04-28 Thread Sai Prasanna
Hi All, I want to store a csv-text file in Parquet format in HDFS and then do some processing in Spark. Somehow my search to find the way to do was futile. More help was available for parquet with impala. Any guidance here? Thanks !!