[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-30 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647630#comment-14647630
 ] 

Liang-Chi Hsieh commented on SPARK-9347:


It will merge different schema if the parquet schema merging configuration is 
enabled.

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-30 Thread Samphel Norden (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647606#comment-14647606
 ] 

Samphel Norden commented on SPARK-9347:
---

One additional question. Assuming schema does evolve and if we have folder 1 
and folder 2 each with a different _common_metadata that represents schema 
evolution, spark will do a merge of the 2 different _common_metadata files? or 
would this not work?

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-30 Thread Samphel Norden (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647634#comment-14647634
 ] 

Samphel Norden commented on SPARK-9347:
---

Am trying to get spark to only look at _common_metadata files for 2 different 
schema.
But if the new option is turned on (respect.summarymetadata?) would it merge 
based on different_common_metadata files or would it have to be disabled, and 
we use regular part merging?

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-30 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647673#comment-14647673
 ] 

Liang-Chi Hsieh commented on SPARK-9347:


Actually the newly introduced configuration is working only if the parquet 
schema merging configuration is enabled. So you need to turn both on.

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-29 Thread Samphel Norden (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647126#comment-14647126
 ] 

Samphel Norden commented on SPARK-9347:
---

No. I havent tried the latest. Assuming that the distributed/parallel footer 
read is whats in the latest: a) a distributed spark job would still struggle to 
read the data since there are tens of thousands of large parquet files and b) 
correct me if I'm wrong but as per my understanding of hdfs, its not really 
going to only give the last block which contains the parquet footer, so instead 
the part file will be transferred in its entirety to memory which is another 
constraint to deal with. 

the _common_metadata read will and should resolve the above issues 
comprehensively if I understood spark-8838. The only thing I didnt get a 
clarification from was given I have a parititioned folder hieararchy is it 
sufficient to place the _common_metadata file at the top level of the 
hieararchy and point spark to load @ the top level.?

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-29 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647134#comment-14647134
 ] 

Liang-Chi Hsieh commented on SPARK-9347:


OK. The latest development is, we will skip all part-files when the 
configuration proposed in that PR is enabled. Thus only summary files will be 
read and merged to have the schema. I think that is much close to what you 
expect and need.

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-29 Thread Samphel Norden (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647141#comment-14647141
 ] 

Samphel Norden commented on SPARK-9347:
---

That will be ideal. 

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-29 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647136#comment-14647136
 ] 

Liang-Chi Hsieh commented on SPARK-9347:


You concern should be solved in the latest development of that PR. Now all 
part-files will be skipped. So you only need to have the _common_metadata at 
the top level path as you said.

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-28 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644557#comment-14644557
 ] 

Liang-Chi Hsieh commented on SPARK-9347:


Besides common metadata, I think there are also metadata files in partition 
directories? If your table is not partitioned, then all part-files will be read.

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-28 Thread Samphel Norden (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644583#comment-14644583
 ] 

Samphel Norden commented on SPARK-9347:
---

_metadata is also generated but at top level... nothing inside the partition 
dirs

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-28 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644518#comment-14644518
 ] 

Liang-Chi Hsieh commented on SPARK-9347:


Currently, as we discussed in the PR, we will read the summary files in each 
partition directory and skip corresponding part-files. If there are no summary 
files in the partition directory, the part-files will be read.

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-28 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644571#comment-14644571
 ] 

Liang-Chi Hsieh commented on SPARK-9347:


Without any _metadata files in partition directory?

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-28 Thread Samphel Norden (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644525#comment-14644525
 ] 

Samphel Norden commented on SPARK-9347:
---

Understood. However, what I noticed is that Parquet Hadoop job only outputs a 
_common_metadata file at the top level directory. I assume that will be the 
file thats read if I give the top level directory as input to 
sqlContext.parquetFile(). i.e. it doesnt expect the _common_metadata to be 
present in the subdirectories. Or am I incorrect in the assumption?

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-28 Thread Samphel Norden (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644558#comment-14644558
 ] 

Samphel Norden commented on SPARK-9347:
---

table is partitioned using key=value folder names.. 
ex: root/parquet_date=20150715/parquet_hour_of_day=05/
The _common_metadata is present at root/ but not in the subfolders.

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-26 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641911#comment-14641911
 ] 

Liang-Chi Hsieh commented on SPARK-9347:


I already tried to fix this in [this 
PR|https://github.com/apache/spark/pull/7238]. Once it is merged, I think it 
should improve the performance significantly.

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-26 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642030#comment-14642030
 ] 

Liang-Chi Hsieh commented on SPARK-9347:


I am not sure if it will be backported to 1.3.1. But I guess it will not.

This fixing will use the summary file along with part-files as their schema and 
assume they have consistent schema. So as long as the part-files in the same 
partition directory have summary file, then these part-files will not be 
loaded. Thus the parquet files loading time can be reduced.

I am not sure if this is what you want and can benefit your use case.

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-26 Thread Samphel Norden (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642040#comment-14642040
 ] 

Samphel Norden commented on SPARK-9347:
---

It will certainly help.Is it sufficient to put the _metadata file at the top 
level folder or does every subfolder need the metadata... For example, I have a 
hiearchy root/parquet_date=/parquet_hour_of_day=/part*.snappy.parquet
Is it sufficient if _metadata files are present at root or do they have to be 
in each parquet_hour_of_day= leaf folder.
Thanks

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-26 Thread Samphel Norden (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642022#comment-14642022
 ] 

Samphel Norden commented on SPARK-9347:
---

which branch will this be merged to. Would it be backported to 1.3.1. Thanks. 
And does the fix require the summary metadata to be generated along with part 
files or can it just use a single part file assuming schema is same for all 
parts.

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org