[jira] [Commented] (SPARK-5968) Parquet warning in spark-shell
[ https://issues.apache.org/jira/browse/SPARK-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034624#comment-15034624 ] swetha k commented on SPARK-5968: - [~lian cheng] Following are the dependencies and the versions that I am using. I want to know if using a different version would be of any help to fix this. I see this error in my Spark Batch Job when I save the Parquet files to hdfs. 1.5.2 1.7.7 1.4.3 org.apache.spark spark-core_2.10 ${sparkVersion} provided org.apache.avro avro ${avro.version} com.twitter parquet-avro 1.6.0rc7 com.twitter parquet-hadoop 1.6.0rc7 > Parquet warning in spark-shell > -- > > Key: SPARK-5968 > URL: https://issues.apache.org/jira/browse/SPARK-5968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.3.0 > > > This may happen in the case of schema evolving, namely appending new Parquet > data with different but compatible schema to existing Parquet files: > {code} > 15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file > for rankings > parquet.io.ParquetEncodingException: > file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet > invalid: all the files must be contained in the root rankings > at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) > at > parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) > at > parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) > {code} > The reason is that the Spark SQL schemas stored in Parquet key-value metadata > differ. Parquet doesn't know how to "merge" these opaque user-defined > metadata, and just throw an exception and give up writing summary files. > Since the Parquet data source in Spark 1.3.0 supports schema merging, it's > harmless. But this is kind of scary for the user. We should try to suppress > this through the logger. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
[ https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034650#comment-15034650 ] swetha k commented on SPARK-11620: -- [~hyukjin.kwon] I have the following code that saves the parquet files in my hourly batch to hdfs and the code is based on the github link in the end. val job = Job.getInstance() var filePath = "path" val metricsPath: Path = new Path(filePath) //Check if inputFile exists val fs: FileSystem = FileSystem.get(job.getConfiguration) if (fs.exists(metricsPath)) { fs.delete(metricsPath, true) } // Configure the ParquetOutputFormat to use Avro as the serialization format ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport]) // You need to pass the schema to AvroParquet when you are writing objects but not when you // are reading them. The schema is saved in Parquet file for future readers to use. AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$) // Create a PairRDD with all keys set to null and wrap each Metrics in serializable objects val metricsToBeSaved = metrics.map(metricRecord => (null, new SerializableMetrics(new Metrics(metricRecord._1, metricRecord._2._1, metricRecord._2._2; metricsToBeSaved.coalesce(1500) // Save the RDD to a Parquet file in our temporary output directory metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void], classOf[Metrics], classOf[ParquetOutputFormat[Metrics]], job.getConfiguration) https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala > parquet.hadoop.ParquetOutputCommitter.commitJob() throws > parquet.io.ParquetEncodingException > > > Key: SPARK-11620 > URL: https://issues.apache.org/jira/browse/SPARK-11620 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: swetha k > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
[ https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034650#comment-15034650 ] swetha k edited comment on SPARK-11620 at 12/1/15 9:50 PM: --- [~hyukjin.kwon] I have the following code that saves the parquet files in my hourly batch to hdfs and the code is based on the github link in the end. And the WARNING message that I get is as shown in the previous comments. Any idea as to why this is happening? val job = Job.getInstance() var filePath = "path" val metricsPath: Path = new Path(filePath) //Check if inputFile exists val fs: FileSystem = FileSystem.get(job.getConfiguration) if (fs.exists(metricsPath)) { fs.delete(metricsPath, true) } // Configure the ParquetOutputFormat to use Avro as the serialization format ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport]) // You need to pass the schema to AvroParquet when you are writing objects but not when you // are reading them. The schema is saved in Parquet file for future readers to use. AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$) // Create a PairRDD with all keys set to null and wrap each Metrics in serializable objects val metricsToBeSaved = metrics.map(metricRecord => (null, new SerializableMetrics(new Metrics(metricRecord._1, metricRecord._2._1, metricRecord._2._2; metricsToBeSaved.coalesce(1500) // Save the RDD to a Parquet file in our temporary output directory metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void], classOf[Metrics], classOf[ParquetOutputFormat[Metrics]], job.getConfiguration) https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala was (Author: swethakasireddy): [~hyukjin.kwon] I have the following code that saves the parquet files in my hourly batch to hdfs and the code is based on the github link in the end. val job = Job.getInstance() var filePath = "path" val metricsPath: Path = new Path(filePath) //Check if inputFile exists val fs: FileSystem = FileSystem.get(job.getConfiguration) if (fs.exists(metricsPath)) { fs.delete(metricsPath, true) } // Configure the ParquetOutputFormat to use Avro as the serialization format ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport]) // You need to pass the schema to AvroParquet when you are writing objects but not when you // are reading them. The schema is saved in Parquet file for future readers to use. AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$) // Create a PairRDD with all keys set to null and wrap each Metrics in serializable objects val metricsToBeSaved = metrics.map(metricRecord => (null, new SerializableMetrics(new Metrics(metricRecord._1, metricRecord._2._1, metricRecord._2._2; metricsToBeSaved.coalesce(1500) // Save the RDD to a Parquet file in our temporary output directory metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void], classOf[Metrics], classOf[ParquetOutputFormat[Metrics]], job.getConfiguration) https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala > parquet.hadoop.ParquetOutputCommitter.commitJob() throws > parquet.io.ParquetEncodingException > > > Key: SPARK-11620 > URL: https://issues.apache.org/jira/browse/SPARK-11620 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: swetha k > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
[ https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018051#comment-15018051 ] swetha k commented on SPARK-11620: -- [~hyukjin.kwon] We use Spark 1.5.2 now and it still shows the same error. Which version of Parquet-Avro should be used for that? Thanks, Swetha > parquet.hadoop.ParquetOutputCommitter.commitJob() throws > parquet.io.ParquetEncodingException > > > Key: SPARK-11620 > URL: https://issues.apache.org/jira/browse/SPARK-11620 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: swetha k > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
[ https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019284#comment-15019284 ] swetha k commented on SPARK-11620: -- It is not an error. It is a WARNING and I see the following. Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could not write summary file for active_sessions_current parquet.io.ParquetEncodingException: maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all the files must be contained in the root active_sessions_current at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) > parquet.hadoop.ParquetOutputCommitter.commitJob() throws > parquet.io.ParquetEncodingException > > > Key: SPARK-11620 > URL: https://issues.apache.org/jira/browse/SPARK-11620 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: swetha k > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
[ https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019288#comment-15019288 ] swetha k commented on SPARK-11620: -- [~hyukjin.kwon] If I use ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[PreviousPVTracker]]) with Parquet 1.7.0 , I see the following error. It looks like its is not a part of Parquet 1.7.0. My code is based on http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/. not found: type AvroReadSupport [ERROR] ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[PreviousPVTracker]]) > parquet.hadoop.ParquetOutputCommitter.commitJob() throws > parquet.io.ParquetEncodingException > > > Key: SPARK-11620 > URL: https://issues.apache.org/jira/browse/SPARK-11620 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: swetha k > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
[ https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002384#comment-15002384 ] swetha k commented on SPARK-11620: -- [~hyukjin.kwon] We are using Spark 1.4.1 in one of Clusters. Which parquet version should be used for 1.4.1? > parquet.hadoop.ParquetOutputCommitter.commitJob() throws > parquet.io.ParquetEncodingException > > > Key: SPARK-11620 > URL: https://issues.apache.org/jira/browse/SPARK-11620 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: swetha k > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5968) Parquet warning in spark-shell
[ https://issues.apache.org/jira/browse/SPARK-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001294#comment-15001294 ] swetha k commented on SPARK-5968: - [~lian cheng] Is this just a logger issue or would it have any potential impact on the functionality? > Parquet warning in spark-shell > -- > > Key: SPARK-5968 > URL: https://issues.apache.org/jira/browse/SPARK-5968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.3.0 > > > This may happen in the case of schema evolving, namely appending new Parquet > data with different but compatible schema to existing Parquet files: > {code} > 15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file > for rankings > parquet.io.ParquetEncodingException: > file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet > invalid: all the files must be contained in the root rankings > at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) > at > parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) > at > parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) > {code} > The reason is that the Spark SQL schemas stored in Parquet key-value metadata > differ. Parquet doesn't know how to "merge" these opaque user-defined > metadata, and just throw an exception and give up writing summary files. > Since the Parquet data source in Spark 1.3.0 supports schema merging, it's > harmless. But this is kind of scary for the user. We should try to suppress > this through the logger. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
swetha k created SPARK-11620: Summary: parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException Key: SPARK-11620 URL: https://issues.apache.org/jira/browse/SPARK-11620 Project: Spark Issue Type: Bug Reporter: swetha k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
[ https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998015#comment-14998015 ] swetha k edited comment on SPARK-11620 at 11/10/15 6:07 AM: I see the following Warning message when I use parquet-avro in my Spark Batch. Following is the dependency that I use. com.twitter parquet-avro 1.6.0 Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could not write summary file for active_sessions_current parquet.io.ParquetEncodingException: maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all the files must be contained in the root active_sessions_current at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) was (Author: swethakasireddy): I see the following Warning message when I use parquet-avro. Following is the dependency that I use. com.twitter parquet-avro 1.6.0 Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could not write summary file for active_sessions_current parquet.io.ParquetEncodingException: maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all the files must be contained in the root active_sessions_current at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) > parquet.hadoop.ParquetOutputCommitter.commitJob() throws > parquet.io.ParquetEncodingException > > > Key: SPARK-11620 > URL: https://issues.apache.org/jira/browse/SPARK-11620 > Project: Spark > Issue Type: Bug >Reporter: swetha k > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
[ https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998015#comment-14998015 ] swetha k commented on SPARK-11620: -- I see the following Warning message when I use parquet-avro. Following is the dependency that I use. com.twitter parquet-avro 1.6.0 Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could not write summary file for active_sessions_current parquet.io.ParquetEncodingException: maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all the files must be contained in the root active_sessions_current at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) > parquet.hadoop.ParquetOutputCommitter.commitJob() throws > parquet.io.ParquetEncodingException > > > Key: SPARK-11620 > URL: https://issues.apache.org/jira/browse/SPARK-11620 > Project: Spark > Issue Type: Bug >Reporter: swetha k > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store
[ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997456#comment-14997456 ] swetha k commented on SPARK-2365: - [~ankurd] What is the appropriate dependency to include for Spark Indexed RDD? I get compilation error if I include 0.3 as the version as shown below: amplab spark-indexedrdd 0.3 Thanks, Swetha > Add IndexedRDD, an efficient updatable key-value store > -- > > Key: SPARK-2365 > URL: https://issues.apache.org/jira/browse/SPARK-2365 > Project: Spark > Issue Type: New Feature > Components: GraphX, Spark Core >Reporter: Ankur Dave >Assignee: Ankur Dave > Attachments: 2014-07-07-IndexedRDD-design-review.pdf > > > RDDs currently provide a bulk-updatable, iterator-based interface. This > imposes minimal requirements on the storage layer, which only needs to > support sequential access, enabling on-disk and serialized storage. > However, many applications would benefit from a richer interface. Efficient > support for point lookups would enable serving data out of RDDs, but it > currently requires iterating over an entire partition to find the desired > element. Point updates similarly require copying an entire iterator. Joins > are also expensive, requiring a shuffle and local hash joins. > To address these problems, we propose IndexedRDD, an efficient key-value > store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key > uniqueness and pre-indexing the entries for efficient joins and point > lookups, updates, and deletions. > It would be implemented by (1) hash-partitioning the entries by key, (2) > maintaining a hash index within each partition, and (3) using purely > functional (immutable and efficiently updatable) data structures to enable > efficient modifications and deletions. > GraphX would be the first user of IndexedRDD, since it currently implements a > limited form of this functionality in VertexRDD. We envision a variety of > other uses for IndexedRDD, including streaming updates to RDDs, direct > serving from RDDs, and as an execution strategy for Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store
[ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997681#comment-14997681 ] swetha k commented on SPARK-2365: - It does not seem to be working as it's not available in Maven Central repo. I specified Maven remote repository but it does not seem to be picking up. Is this available in the form of a jar that I can include? > Add IndexedRDD, an efficient updatable key-value store > -- > > Key: SPARK-2365 > URL: https://issues.apache.org/jira/browse/SPARK-2365 > Project: Spark > Issue Type: New Feature > Components: GraphX, Spark Core >Reporter: Ankur Dave >Assignee: Ankur Dave > Attachments: 2014-07-07-IndexedRDD-design-review.pdf > > > RDDs currently provide a bulk-updatable, iterator-based interface. This > imposes minimal requirements on the storage layer, which only needs to > support sequential access, enabling on-disk and serialized storage. > However, many applications would benefit from a richer interface. Efficient > support for point lookups would enable serving data out of RDDs, but it > currently requires iterating over an entire partition to find the desired > element. Point updates similarly require copying an entire iterator. Joins > are also expensive, requiring a shuffle and local hash joins. > To address these problems, we propose IndexedRDD, an efficient key-value > store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key > uniqueness and pre-indexing the entries for efficient joins and point > lookups, updates, and deletions. > It would be implemented by (1) hash-partitioning the entries by key, (2) > maintaining a hash index within each partition, and (3) using purely > functional (immutable and efficiently updatable) data structures to enable > efficient modifications and deletions. > GraphX would be the first user of IndexedRDD, since it currently implements a > limited form of this functionality in VertexRDD. We envision a variety of > other uses for IndexedRDD, including streaming updates to RDDs, direct > serving from RDDs, and as an execution strategy for Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5968) Parquet warning in spark-shell
[ https://issues.apache.org/jira/browse/SPARK-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996160#comment-14996160 ] swetha k commented on SPARK-5968: - [~marmbrus] How is this issue resolved? I still see the following issue when I try to save my parquet file. Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could not write summary file for active_sessions_current parquet.io.ParquetEncodingException: maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all the files must be contained in the root active_sessions_current at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) > Parquet warning in spark-shell > -- > > Key: SPARK-5968 > URL: https://issues.apache.org/jira/browse/SPARK-5968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.3.0 > > > This may happen in the case of schema evolving, namely appending new Parquet > data with different but compatible schema to existing Parquet files: > {code} > 15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file > for rankings > parquet.io.ParquetEncodingException: > file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet > invalid: all the files must be contained in the root rankings > at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) > at > parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) > at > parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) > {code} > The reason is that the Spark SQL schemas stored in Parquet key-value metadata > differ. Parquet doesn't know how to "merge" these opaque user-defined > metadata, and just throw an exception and give up writing summary files. > Since the Parquet data source in Spark 1.3.0 supports schema merging, it's > harmless. But this is kind of scary for the user. We should try to suppress > this through the logger. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978607#comment-14978607 ] swetha k commented on SPARK-3655: - [~koert] The final output for this RDD is RDD[(String, List[(Long, String)])] . But, I call updateStateByKey on this RDD. Inside updateStateByKey, I process this list and put all the data in a single object which gets merged with the old state for this session. After the updateStateByKey, I will return objects for the session that represents the current batch and the merged batch. > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: koert kuipers >Assignee: Koert Kuipers > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978610#comment-14978610 ] swetha k commented on SPARK-3655: - [~koert] If I don't put the list as a materialized view in memory, what is the appropriate way to use Spark-Sorted to just group and sort the batch of Jsons based on the key(sessionId) > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: koert kuipers >Assignee: Koert Kuipers > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14715841#comment-14715841 ] swetha k commented on SPARK-3655: - [~koert] How do I include the dependency for this? Is this available as a jar somewhere? Support sorting of values in addition to keys (i.e. secondary sort) --- Key: SPARK-3655 URL: https://issues.apache.org/jira/browse/SPARK-3655 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: koert kuipers Assignee: Koert Kuipers Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are some use cases where getting a sorted iterator of values per key is helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org