[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-31 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *report_date* and *type* and i have directory structure 
like 
{code:java}
/custom_path/report_date=2018-07-24/type=A/file_1.parquet
{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count
{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 

 

This could be related to [https://jira.apache.org/jira/browse/SPARK-17994] 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *report_date* and *type* and i have directory structure 
like 
{code:java}
/custom_path/report_date=2018-07-24/type=A/file_1.parquet
{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count
{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *report_date* and *type* and i have directory 
> structure like 
> {code:java}
> /custom_path/report_date=2018-07-24/type=A/file_1.parquet
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type *A* and it is just a couple of 
> files. But spark load all 19K of files from all partitions into 
> SharedInMemoryCache which takes about 60 secs and only after that throws 
> unused partitions. 
>  
> This could be related to [https://jira.apache.org/jira/browse/SPARK-17994] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-30 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *report_date* and *type* and i have directory structure 
like 
{code:java}
/custom_path/report_date=2018-07-24/type=A/file_1.parquet
{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count
{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *report_date* and *type* and i have directory structure 
like 
{code:java}
/custom_path/report_date=2018-07-24/type=A/file_1
{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count
{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *report_date* and *type* and i have directory 
> structure like 
> {code:java}
> /custom_path/report_date=2018-07-24/type=A/file_1.parquet
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type *A* and it is just a couple of 
> files. But spark load all 19K of files from all partitions into 
> SharedInMemoryCache which takes about 60 secs and only after that throws 
> unused partitions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-30 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1
{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count
{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *type* and i has directory structure like 
> {code:java}
> report_date=2018-07-24/type=A/file_1
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type *A* and it is just a couple of 
> files. But spark load all 19K of files from all partitions into 
> SharedInMemoryCache which takes about 60 secs and only after that throws 
> unused partitions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-30 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *report_date* and *type* and i have directory structure 
like 
{code:java}
/custom_path/report_date=2018-07-24/type=A/file_1
{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count
{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1
{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count
{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *report_date* and *type* and i have directory 
> structure like 
> {code:java}
> /custom_path/report_date=2018-07-24/type=A/file_1
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type *A* and it is just a couple of 
> files. But spark load all 19K of files from all partitions into 
> SharedInMemoryCache which takes about 60 secs and only after that throws 
> unused partitions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-30 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions. 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions.

 

 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *type* and i has directory structure like 
> {code:java}
> report_date=2018-07-24/type=A/file_1
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type *A* and it is just a couple of 
> files. But spark load all 19K of files from all partitions into 
> SharedInMemoryCache which takes about 60 secs and only after that throws 
> unused partitions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-30 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type *A* and it is just a couple of 
files. But spark load all 19K of files from all partitions into 
SharedInMemoryCache which takes about 60 secs and only after that throws unused 
partitions.

 

 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type A and it is just couple of files. 
But spark load all 19K of files into SharedInMemoryCache which takes about 60 
secs and only after that throws unused partitions.

 

 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *type* and i has directory structure like 
> {code:java}
> report_date=2018-07-24/type=A/file_1
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type *A* and it is just a couple of 
> files. But spark load all 19K of files from all partitions into 
> SharedInMemoryCache which takes about 60 secs and only after that throws 
> unused partitions.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-30 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by type and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type A and it is just couple of files. 
But spark load all 19K of files into SharedInMemoryCache which takes about 60 
secs and only after that throws unused partitions.

 

 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by type and i has directory structure like 

{code}

{{report_date=2018-07-24/type=A/file_1}}

{code}

 

I am trying to execute 

{code}

{{val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count}}

{code}

 

In my query i need to load only files of type A and it is just couple of files. 
But spark load all 19K of files into SharedInMemoryCache which takes about 60 
secs and only after that throws unused partitions.

 

 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by type and i has directory structure like 
> {code:java}
> report_date=2018-07-24/type=A/file_1
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type A and it is just couple of 
> files. But spark load all 19K of files into SharedInMemoryCache which takes 
> about 60 secs and only after that throws unused partitions.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.

2018-07-30 Thread andrzej.stankev...@gmail.com (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

andrzej.stankev...@gmail.com updated SPARK-24974:
-
Description: 
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by *type* and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type A and it is just couple of files. 
But spark load all 19K of files into SharedInMemoryCache which takes about 60 
secs and only after that throws unused partitions.

 

 

  was:
SharedInMemoryCache has all  filestatus no matter whether you specify partition 
columns or not. It causes long load time for queries that use only couple 
partitions because Spark loads file's paths for files from all partitions.

I partitioned files by type and i has directory structure like 
{code:java}
report_date=2018-07-24/type=A/file_1

{code}
 

I am trying to execute 
{code:java}
val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
"type == 'A'").count

{code}
 

In my query i need to load only files of type A and it is just couple of files. 
But spark load all 19K of files into SharedInMemoryCache which takes about 60 
secs and only after that throws unused partitions.

 

 


> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> ---
>
> Key: SPARK-24974
> URL: https://issues.apache.org/jira/browse/SPARK-24974
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *type* and i has directory structure like 
> {code:java}
> report_date=2018-07-24/type=A/file_1
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type A and it is just couple of 
> files. But spark load all 19K of files into SharedInMemoryCache which takes 
> about 60 secs and only after that throws unused partitions.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org