[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-08 Thread Thu Kyaw (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534582#comment-14534582
 ] 

Thu Kyaw edited comment on SPARK-3928 at 5/8/15 9:52 PM:
-

New parquet implementation does not contain wild card support yet, but you 
could still use old version of parquet implementation to get wildcard support. 
Just turn off sql Configuration. spark.sql.parquet.useDataSourceApi ( turn 
off by setting it false; by default it is true ).


was (Author: tkyaw):
New parquet implementation does not contain wild card support yet, but you 
could still use old version parquet implementation to get wildcard support. 
Just turn off sql Configuration. spark.sql.parquet.useDataSourceApi ( to 
false by default it is true ).

 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Assignee: Cheng Lian
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-08 Thread Thu Kyaw (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534582#comment-14534582
 ] 

Thu Kyaw edited comment on SPARK-3928 at 5/8/15 9:50 PM:
-

New parquet implementation does not contain wild card support yet, but you 
could still use old version parquet implementation to get wildcard support. 
Just turn off sql Configuration. spark.sql.parquet.useDataSourceApi ( to 
false by default it is true ).


was (Author: tkyaw):
Hello [~lian cheng] please let me know if you want me to work on adding back 
the glob support.

 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Assignee: Cheng Lian
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-07 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532680#comment-14532680
 ] 

Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 2:16 PM:
--

Marius, are you saying that wildcards are not supported then? in my case, I 
would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ 
textFile method btw) -- i.e. pass a single path for all April 2015 partitions. 
Enumerating all paths underneath is pretty crazy, that's a huge list.
 Are you saying that is the only way? I thought the whole point of this bug is 
that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a 
HiveContext instance, not a dataframe.

As a side note I am trying to use this feature as a workaround to 
https://issues.apache.org/jira/browse/SPARK-6910 -- Michael A. suggested a work 
around which takes way too long in our case -- I was hoping to be able to 
create a DF from a subset of partitions...But it would be a pain to build an 
explicit listWould like to know for sure if that's the deliberate design 
though...


was (Author: yanakad):
Marius, are you saying that wildcards are not supported then? in my case, I 
would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ 
textFile method btw) -- i.e. pass a single path for all April 2015 partitions. 
Enumerating all paths underneath is pretty crazy, that's a huge list.
 Are you saying that is the only way? I thought the whole point of this bug is 
that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a 
HiveContext instance, not a dataframe.

As a side note I am trying to use this feature as a workaround to 
https://issues.apache.org/jira/browse/SPARK-6910 -- Mike A. suggested a work 
around which takes way too long in our case -- I was hoping to be able to 
create a DF from a subset of partitions...But it would be a pain to build an 
explicit listWould like to know for sure if that's the deliberate design 
though...

 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-07 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532624#comment-14532624
 ] 

Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 1:48 PM:
--

I am observing the same issue.

Downloaded a pre-built CDH4 1.3.1 distro.

{quote}
scala 
sc.textFile(/r/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first
res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? 
,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? 
???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� 
? ,?��??? ???p?p??? ,��???���q?�qL?� 
�8��{???%??? 
???/???(???�???�???9???�???�???2???#???M???0??? 
???6???�???4???�???P???*??? ???�???
s
scala 
hc.parquetFile(/r/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet)
java.io.FileNotFoundException: File does not exist: 
/r/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet

{quote}


was (Author: yanakad):
I am observing the same issue.

Downloaded a pre-built CDH4 1.3.1 distro.

{quote}
scala 
sc.textFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first
res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? 
,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? 
???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� 
? ,?��??? ???p?p??? ,��???���q?�qL?� 
�8��{???%??? 
???/???(???�???�???9???�???�???2???#???M???0??? 
???6???�???4???�???P???*??? ???�???
s
scala 
hc.parquetFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet)
java.io.FileNotFoundException: File does not exist: 
/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet

{quote}

 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-07 Thread Marius Soutier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532663#comment-14532663
 ] 

Marius Soutier edited comment on SPARK-3928 at 5/7/15 1:54 PM:
---

DataFrames now expect varargs, i.e. 
df.parquetFile(/path/to/file/1,/path/to/file/2).



was (Author: msoutier):
DataFrames now expect varagrs, i.e. 
df.parquetFile(/path/to/file/1,/path/to/file/2).


 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-07 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532680#comment-14532680
 ] 

Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 2:15 PM:
--

Marius, are you saying that wildcards are not supported then? in my case, I 
would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ 
textFile method btw) -- i.e. pass a single path for all April 2015 partitions. 
Enumerating all paths underneath is pretty crazy, that's a huge list.
 Are you saying that is the only way? I thought the whole point of this bug is 
that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a 
HiveContext instance, not a dataframe.

As a side note I am trying to use this feature as a workaround to 
https://issues.apache.org/jira/browse/SPARK-6910 -- Mike A. suggested a work 
around which takes way too long in our case -- I was hoping to be able to 
create a DF from a subset of partitions...But it would be a pain to build an 
explicit listWould like to know for sure if that's the deliberate design 
though...


was (Author: yanakad):
Marius, are you saying that wildcards are not supported then? in my case, I 
would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ 
textFile method btw) -- i.e. pass a single path for all April 2015 partitions. 
Enumerating all paths underneath is pretty crazy, that's a huge list.
 Are you saying that is the only way? I thought the whole point of this bug is 
that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a 
HiveContext instance, not a dataframe.

 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-07 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532624#comment-14532624
 ] 

Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 1:38 PM:
--

I am observing the same issue.

Downloaded a pre-built CDH4 1.3.1 distro.

{quote}
scala 
sc.textFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first
res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? 
,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? 
???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� 
? ,?��??? ???p?p??? ,��???���q?�qL?� 
�8��{???%??? 
???/???(???�???�???9???�???�???2???#???M???0??? 
???6???�???4???�???P???*??? ???�???
s
scala 
hc.parquetFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet)
java.io.FileNotFoundException: File does not exist: 
/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet

{quote}


was (Author: yanakad):
I am observing the same issue.

Downloaded a pre-built CDH4 1.3.1 distro.

{quote}
scala 
sc.textFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first
res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? 
,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? 
???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� 
? ,?��??? ???p?p??? ,��???���q?�qL?� 
�8��{???%??? 
???/???(???�???�???9???�???�???2???#???M???0??? 
???6???�???4???�???P???*??? ???�???
s
scala 
hc.parquetFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet)
java.io.FileNotFoundException: File does not exist: 
hdfs://cdh4-21968-nn/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet

{quote}

 Support wildcard matches on Parquet files
 -

 Key: SPARK-3928
 URL: https://issues.apache.org/jira/browse/SPARK-3928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
 {{2014-\?\?-\?\?}}. 
 It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org