[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534582#comment-14534582 ] Thu Kyaw edited comment on SPARK-3928 at 5/8/15 9:52 PM: - New parquet implementation does not contain wild card support yet, but you could still use old version of parquet implementation to get wildcard support. Just turn off sql Configuration. spark.sql.parquet.useDataSourceApi ( turn off by setting it false; by default it is true ). was (Author: tkyaw): New parquet implementation does not contain wild card support yet, but you could still use old version parquet implementation to get wildcard support. Just turn off sql Configuration. spark.sql.parquet.useDataSourceApi ( to false by default it is true ). Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Assignee: Cheng Lian Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534582#comment-14534582 ] Thu Kyaw edited comment on SPARK-3928 at 5/8/15 9:50 PM: - New parquet implementation does not contain wild card support yet, but you could still use old version parquet implementation to get wildcard support. Just turn off sql Configuration. spark.sql.parquet.useDataSourceApi ( to false by default it is true ). was (Author: tkyaw): Hello [~lian cheng] please let me know if you want me to work on adding back the glob support. Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Assignee: Cheng Lian Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532680#comment-14532680 ] Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 2:16 PM: -- Marius, are you saying that wildcards are not supported then? in my case, I would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ textFile method btw) -- i.e. pass a single path for all April 2015 partitions. Enumerating all paths underneath is pretty crazy, that's a huge list. Are you saying that is the only way? I thought the whole point of this bug is that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a HiveContext instance, not a dataframe. As a side note I am trying to use this feature as a workaround to https://issues.apache.org/jira/browse/SPARK-6910 -- Michael A. suggested a work around which takes way too long in our case -- I was hoping to be able to create a DF from a subset of partitions...But it would be a pain to build an explicit listWould like to know for sure if that's the deliberate design though... was (Author: yanakad): Marius, are you saying that wildcards are not supported then? in my case, I would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ textFile method btw) -- i.e. pass a single path for all April 2015 partitions. Enumerating all paths underneath is pretty crazy, that's a huge list. Are you saying that is the only way? I thought the whole point of this bug is that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a HiveContext instance, not a dataframe. As a side note I am trying to use this feature as a workaround to https://issues.apache.org/jira/browse/SPARK-6910 -- Mike A. suggested a work around which takes way too long in our case -- I was hoping to be able to create a DF from a subset of partitions...But it would be a pain to build an explicit listWould like to know for sure if that's the deliberate design though... Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532624#comment-14532624 ] Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 1:48 PM: -- I am observing the same issue. Downloaded a pre-built CDH4 1.3.1 distro. {quote} scala sc.textFile(/r/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?��??? ???p?p??? ,��???���q?�qL?� �8��{???%??? ???/???(???�???�???9???�???�???2???#???M???0??? ???6???�???4???�???P???*??? ???�??? s scala hc.parquetFile(/r/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet) java.io.FileNotFoundException: File does not exist: /r/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet {quote} was (Author: yanakad): I am observing the same issue. Downloaded a pre-built CDH4 1.3.1 distro. {quote} scala sc.textFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?��??? ???p?p??? ,��???���q?�qL?� �8��{???%??? ???/???(???�???�???9???�???�???2???#???M???0??? ???6???�???4???�???P???*??? ???�??? s scala hc.parquetFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet) java.io.FileNotFoundException: File does not exist: /rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet {quote} Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532663#comment-14532663 ] Marius Soutier edited comment on SPARK-3928 at 5/7/15 1:54 PM: --- DataFrames now expect varargs, i.e. df.parquetFile(/path/to/file/1,/path/to/file/2). was (Author: msoutier): DataFrames now expect varagrs, i.e. df.parquetFile(/path/to/file/1,/path/to/file/2). Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532680#comment-14532680 ] Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 2:15 PM: -- Marius, are you saying that wildcards are not supported then? in my case, I would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ textFile method btw) -- i.e. pass a single path for all April 2015 partitions. Enumerating all paths underneath is pretty crazy, that's a huge list. Are you saying that is the only way? I thought the whole point of this bug is that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a HiveContext instance, not a dataframe. As a side note I am trying to use this feature as a workaround to https://issues.apache.org/jira/browse/SPARK-6910 -- Mike A. suggested a work around which takes way too long in our case -- I was hoping to be able to create a DF from a subset of partitions...But it would be a pain to build an explicit listWould like to know for sure if that's the deliberate design though... was (Author: yanakad): Marius, are you saying that wildcards are not supported then? in my case, I would really like to do /r/warehouse/hive/pkey=-2015-04/* (which works w/ textFile method btw) -- i.e. pass a single path for all April 2015 partitions. Enumerating all paths underneath is pretty crazy, that's a huge list. Are you saying that is the only way? I thought the whole point of this bug is that we _don't_ have to enumerate the paths explicitly. Also in my case hc is a HiveContext instance, not a dataframe. Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532624#comment-14532624 ] Yana Kadiyska edited comment on SPARK-3928 at 5/7/15 1:38 PM: -- I am observing the same issue. Downloaded a pre-built CDH4 1.3.1 distro. {quote} scala sc.textFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?��??? ???p?p??? ,��???���q?�qL?� �8��{???%??? ???/???(???�???�???9???�???�???2???#???M???0??? ???6???�???4???�???P???*??? ???�??? s scala hc.parquetFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet) java.io.FileNotFoundException: File does not exist: /rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet {quote} was (Author: yanakad): I am observing the same issue. Downloaded a pre-built CDH4 1.3.1 distro. {quote} scala sc.textFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet).first res0: String = PAR1? L??? ?p??? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?�� ?? ???p?p??? ,�� ??�� ? ,?��??? ???p?p??? ,��???���q?�qL?� �8��{???%??? ???/???(???�???�???9???�???�???2???#???M???0??? ???6???�???4???�???P???*??? ???�??? s scala hc.parquetFile(/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet) java.io.FileNotFoundException: File does not exist: hdfs://cdh4-21968-nn/rum/warehouse/hive/pkey=-2015-04/-2015-04-serialno-750/*.parquet {quote} Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org