[jira] [Commented] (SPARK-5968) Parquet warning in spark-shell

2015-12-01 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034624#comment-15034624
 ] 

swetha k commented on SPARK-5968:
-

[~lian cheng]

Following are the dependencies and the versions that I am using. I want to know 
if using a different version would be of any help to fix this.  I see this 
error in my Spark Batch Job when I save the Parquet files to hdfs.

1.5.2
1.7.7
1.4.3


org.apache.spark
spark-core_2.10
${sparkVersion}
provided




org.apache.avro
avro
${avro.version}



com.twitter
parquet-avro
1.6.0rc7



com.twitter
parquet-hadoop
1.6.0rc7




> Parquet warning in spark-shell
> --
>
> Key: SPARK-5968
> URL: https://issues.apache.org/jira/browse/SPARK-5968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.3.0
>
>
> This may happen in the case of schema evolving, namely appending new Parquet 
> data with different but compatible schema to existing Parquet files:
> {code}
> 15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file 
> for rankings
> parquet.io.ParquetEncodingException: 
> file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet 
> invalid: all the files must be contained in the root rankings
> at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
> at 
> parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
> at 
> parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
> {code}
> The reason is that the Spark SQL schemas stored in Parquet key-value metadata 
> differ. Parquet doesn't know how to "merge" these opaque user-defined 
> metadata, and just throw an exception and give up writing summary files. 
> Since the Parquet data source in Spark 1.3.0 supports schema merging, it's 
> harmless.  But this is kind of scary for the user.  We should try to suppress 
> this through the logger. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-12-01 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034650#comment-15034650
 ] 

swetha k commented on SPARK-11620:
--

[~hyukjin.kwon]

I have the following code that saves the parquet files in my hourly batch to
hdfs and the code is based on the github link in the end. 

val job = Job.getInstance()
var filePath = "path"
val metricsPath: Path = new Path(filePath)
//Check if inputFile exists
val fs: FileSystem = FileSystem.get(job.getConfiguration)

if (fs.exists(metricsPath)) {
  fs.delete(metricsPath, true)
}

// Configure the ParquetOutputFormat to use Avro as the
serialization format
ParquetOutputFormat.setWriteSupportClass(job,
classOf[AvroWriteSupport])
// You need to pass the schema to AvroParquet when you are writing
objects but not when you
// are reading them. The schema is saved in Parquet file for future
readers to use.
AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$)


// Create a PairRDD with all keys set to null and wrap each Metrics
in serializable objects
val metricsToBeSaved = metrics.map(metricRecord => (null, new
SerializableMetrics(new Metrics(metricRecord._1, metricRecord._2._1,
metricRecord._2._2;

metricsToBeSaved.coalesce(1500)
// Save the RDD to a Parquet file in our temporary output directory
metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void],
classOf[Metrics],
  classOf[ParquetOutputFormat[Metrics]], job.getConfiguration)


https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

> parquet.hadoop.ParquetOutputCommitter.commitJob() throws 
> parquet.io.ParquetEncodingException
> 
>
> Key: SPARK-11620
> URL: https://issues.apache.org/jira/browse/SPARK-11620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: swetha k
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-12-01 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034650#comment-15034650
 ] 

swetha k edited comment on SPARK-11620 at 12/1/15 9:50 PM:
---

[~hyukjin.kwon]

I have the following code that saves the parquet files in my hourly batch to
hdfs and the code is based on the github link in the end.  And the WARNING 
message that I get is as shown in the previous comments. Any idea as to why 
this is happening?

val job = Job.getInstance()
var filePath = "path"
val metricsPath: Path = new Path(filePath)
//Check if inputFile exists
val fs: FileSystem = FileSystem.get(job.getConfiguration)

if (fs.exists(metricsPath)) {
  fs.delete(metricsPath, true)
}

// Configure the ParquetOutputFormat to use Avro as the
serialization format
ParquetOutputFormat.setWriteSupportClass(job,
classOf[AvroWriteSupport])
// You need to pass the schema to AvroParquet when you are writing
objects but not when you
// are reading them. The schema is saved in Parquet file for future
readers to use.
AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$)


// Create a PairRDD with all keys set to null and wrap each Metrics
in serializable objects
val metricsToBeSaved = metrics.map(metricRecord => (null, new
SerializableMetrics(new Metrics(metricRecord._1, metricRecord._2._1,
metricRecord._2._2;

metricsToBeSaved.coalesce(1500)
// Save the RDD to a Parquet file in our temporary output directory
metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void],
classOf[Metrics],
  classOf[ParquetOutputFormat[Metrics]], job.getConfiguration)


https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala


was (Author: swethakasireddy):
[~hyukjin.kwon]

I have the following code that saves the parquet files in my hourly batch to
hdfs and the code is based on the github link in the end. 

val job = Job.getInstance()
var filePath = "path"
val metricsPath: Path = new Path(filePath)
//Check if inputFile exists
val fs: FileSystem = FileSystem.get(job.getConfiguration)

if (fs.exists(metricsPath)) {
  fs.delete(metricsPath, true)
}

// Configure the ParquetOutputFormat to use Avro as the
serialization format
ParquetOutputFormat.setWriteSupportClass(job,
classOf[AvroWriteSupport])
// You need to pass the schema to AvroParquet when you are writing
objects but not when you
// are reading them. The schema is saved in Parquet file for future
readers to use.
AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$)


// Create a PairRDD with all keys set to null and wrap each Metrics
in serializable objects
val metricsToBeSaved = metrics.map(metricRecord => (null, new
SerializableMetrics(new Metrics(metricRecord._1, metricRecord._2._1,
metricRecord._2._2;

metricsToBeSaved.coalesce(1500)
// Save the RDD to a Parquet file in our temporary output directory
metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void],
classOf[Metrics],
  classOf[ParquetOutputFormat[Metrics]], job.getConfiguration)


https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

> parquet.hadoop.ParquetOutputCommitter.commitJob() throws 
> parquet.io.ParquetEncodingException
> 
>
> Key: SPARK-11620
> URL: https://issues.apache.org/jira/browse/SPARK-11620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: swetha k
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-11-20 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018051#comment-15018051
 ] 

swetha k commented on SPARK-11620:
--

[~hyukjin.kwon]

We use Spark 1.5.2 now and it still shows the same error. Which version of 
Parquet-Avro should be used for that?

Thanks,
Swetha

> parquet.hadoop.ParquetOutputCommitter.commitJob() throws 
> parquet.io.ParquetEncodingException
> 
>
> Key: SPARK-11620
> URL: https://issues.apache.org/jira/browse/SPARK-11620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: swetha k
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-11-20 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019284#comment-15019284
 ] 

swetha k commented on SPARK-11620:
--

It is not an error. It is a WARNING and I see the following.

Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could 
not write summary file for active_sessions_current 
parquet.io.ParquetEncodingException: 
maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all 
the files must be contained in the root active_sessions_current 
at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) 
at 
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) 
at 
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) 
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)

> parquet.hadoop.ParquetOutputCommitter.commitJob() throws 
> parquet.io.ParquetEncodingException
> 
>
> Key: SPARK-11620
> URL: https://issues.apache.org/jira/browse/SPARK-11620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: swetha k
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-11-20 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019288#comment-15019288
 ] 

swetha k commented on SPARK-11620:
--

[~hyukjin.kwon]

If I use ParquetInputFormat.setReadSupportClass(job, 
classOf[AvroReadSupport[PreviousPVTracker]]) with Parquet 1.7.0 , I see the 
following error. It looks like its is not a part of Parquet 1.7.0. My code is 
based on http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/.

 not found: type AvroReadSupport
[ERROR]   ParquetInputFormat.setReadSupportClass(job, 
classOf[AvroReadSupport[PreviousPVTracker]])


> parquet.hadoop.ParquetOutputCommitter.commitJob() throws 
> parquet.io.ParquetEncodingException
> 
>
> Key: SPARK-11620
> URL: https://issues.apache.org/jira/browse/SPARK-11620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: swetha k
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-11-12 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002384#comment-15002384
 ] 

swetha k commented on SPARK-11620:
--

[~hyukjin.kwon]

We are using Spark 1.4.1 in one of Clusters. Which parquet version should be 
used for 1.4.1?

> parquet.hadoop.ParquetOutputCommitter.commitJob() throws 
> parquet.io.ParquetEncodingException
> 
>
> Key: SPARK-11620
> URL: https://issues.apache.org/jira/browse/SPARK-11620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: swetha k
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5968) Parquet warning in spark-shell

2015-11-11 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001294#comment-15001294
 ] 

swetha k commented on SPARK-5968:
-

[~lian cheng]

Is this just a logger issue or would it have any potential impact on the 
functionality?

> Parquet warning in spark-shell
> --
>
> Key: SPARK-5968
> URL: https://issues.apache.org/jira/browse/SPARK-5968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.3.0
>
>
> This may happen in the case of schema evolving, namely appending new Parquet 
> data with different but compatible schema to existing Parquet files:
> {code}
> 15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file 
> for rankings
> parquet.io.ParquetEncodingException: 
> file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet 
> invalid: all the files must be contained in the root rankings
> at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
> at 
> parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
> at 
> parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
> {code}
> The reason is that the Spark SQL schemas stored in Parquet key-value metadata 
> differ. Parquet doesn't know how to "merge" these opaque user-defined 
> metadata, and just throw an exception and give up writing summary files. 
> Since the Parquet data source in Spark 1.3.0 supports schema merging, it's 
> harmless.  But this is kind of scary for the user.  We should try to suppress 
> this through the logger. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-11-09 Thread swetha k (JIRA)
swetha k created SPARK-11620:


 Summary: parquet.hadoop.ParquetOutputCommitter.commitJob() throws 
parquet.io.ParquetEncodingException
 Key: SPARK-11620
 URL: https://issues.apache.org/jira/browse/SPARK-11620
 Project: Spark
  Issue Type: Bug
Reporter: swetha k






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-11-09 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998015#comment-14998015
 ] 

swetha k edited comment on SPARK-11620 at 11/10/15 6:07 AM:


I see the following Warning message when I use parquet-avro in my Spark Batch. 
Following is the dependency that I use.


com.twitter
parquet-avro
1.6.0


Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could 
not write summary file for active_sessions_current 
parquet.io.ParquetEncodingException: 
maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all 
the files must be contained in the root active_sessions_current 
at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) 
at 
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) 
at 
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) 
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)


was (Author: swethakasireddy):
I see the following Warning message when I use parquet-avro. Following is the 
dependency that I use.


com.twitter
parquet-avro
1.6.0


Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could 
not write summary file for active_sessions_current 
parquet.io.ParquetEncodingException: 
maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all 
the files must be contained in the root active_sessions_current 
at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) 
at 
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) 
at 
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) 
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)

> parquet.hadoop.ParquetOutputCommitter.commitJob() throws 
> parquet.io.ParquetEncodingException
> 
>
> Key: SPARK-11620
> URL: https://issues.apache.org/jira/browse/SPARK-11620
> Project: Spark
>  Issue Type: Bug
>Reporter: swetha k
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11620) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-11-09 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998015#comment-14998015
 ] 

swetha k commented on SPARK-11620:
--

I see the following Warning message when I use parquet-avro. Following is the 
dependency that I use.


com.twitter
parquet-avro
1.6.0


Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could 
not write summary file for active_sessions_current 
parquet.io.ParquetEncodingException: 
maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all 
the files must be contained in the root active_sessions_current 
at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) 
at 
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) 
at 
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) 
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)

> parquet.hadoop.ParquetOutputCommitter.commitJob() throws 
> parquet.io.ParquetEncodingException
> 
>
> Key: SPARK-11620
> URL: https://issues.apache.org/jira/browse/SPARK-11620
> Project: Spark
>  Issue Type: Bug
>Reporter: swetha k
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2015-11-09 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997456#comment-14997456
 ] 

swetha k commented on SPARK-2365:
-

[~ankurd]

What is the appropriate dependency to include for Spark Indexed RDD? I get
compilation error if I include 0.3 as the version as shown below:


amplab
spark-indexedrdd
0.3


Thanks,
Swetha

> Add IndexedRDD, an efficient updatable key-value store
> --
>
> Key: SPARK-2365
> URL: https://issues.apache.org/jira/browse/SPARK-2365
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, Spark Core
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Attachments: 2014-07-07-IndexedRDD-design-review.pdf
>
>
> RDDs currently provide a bulk-updatable, iterator-based interface. This 
> imposes minimal requirements on the storage layer, which only needs to 
> support sequential access, enabling on-disk and serialized storage.
> However, many applications would benefit from a richer interface. Efficient 
> support for point lookups would enable serving data out of RDDs, but it 
> currently requires iterating over an entire partition to find the desired 
> element. Point updates similarly require copying an entire iterator. Joins 
> are also expensive, requiring a shuffle and local hash joins.
> To address these problems, we propose IndexedRDD, an efficient key-value 
> store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key 
> uniqueness and pre-indexing the entries for efficient joins and point 
> lookups, updates, and deletions.
> It would be implemented by (1) hash-partitioning the entries by key, (2) 
> maintaining a hash index within each partition, and (3) using purely 
> functional (immutable and efficiently updatable) data structures to enable 
> efficient modifications and deletions.
> GraphX would be the first user of IndexedRDD, since it currently implements a 
> limited form of this functionality in VertexRDD. We envision a variety of 
> other uses for IndexedRDD, including streaming updates to RDDs, direct 
> serving from RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2015-11-09 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997681#comment-14997681
 ] 

swetha k commented on SPARK-2365:
-

It does not seem to be working as it's not available in Maven Central repo. I 
specified Maven remote repository but it does not seem to be picking up. Is 
this available in the form of a jar that I can include?

> Add IndexedRDD, an efficient updatable key-value store
> --
>
> Key: SPARK-2365
> URL: https://issues.apache.org/jira/browse/SPARK-2365
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, Spark Core
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Attachments: 2014-07-07-IndexedRDD-design-review.pdf
>
>
> RDDs currently provide a bulk-updatable, iterator-based interface. This 
> imposes minimal requirements on the storage layer, which only needs to 
> support sequential access, enabling on-disk and serialized storage.
> However, many applications would benefit from a richer interface. Efficient 
> support for point lookups would enable serving data out of RDDs, but it 
> currently requires iterating over an entire partition to find the desired 
> element. Point updates similarly require copying an entire iterator. Joins 
> are also expensive, requiring a shuffle and local hash joins.
> To address these problems, we propose IndexedRDD, an efficient key-value 
> store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key 
> uniqueness and pre-indexing the entries for efficient joins and point 
> lookups, updates, and deletions.
> It would be implemented by (1) hash-partitioning the entries by key, (2) 
> maintaining a hash index within each partition, and (3) using purely 
> functional (immutable and efficiently updatable) data structures to enable 
> efficient modifications and deletions.
> GraphX would be the first user of IndexedRDD, since it currently implements a 
> limited form of this functionality in VertexRDD. We envision a variety of 
> other uses for IndexedRDD, including streaming updates to RDDs, direct 
> serving from RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5968) Parquet warning in spark-shell

2015-11-08 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996160#comment-14996160
 ] 

swetha k commented on SPARK-5968:
-

[~marmbrus]

How is this issue resolved? I still see the following issue when I try to save 
my parquet file.

Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter: could 
not write summary file for active_sessions_current 
parquet.io.ParquetEncodingException: 
maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid: all 
the files must be contained in the root active_sessions_current 
at 
parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) 
at 
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) 
at 
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) 
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
 
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)

> Parquet warning in spark-shell
> --
>
> Key: SPARK-5968
> URL: https://issues.apache.org/jira/browse/SPARK-5968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.3.0
>
>
> This may happen in the case of schema evolving, namely appending new Parquet 
> data with different but compatible schema to existing Parquet files:
> {code}
> 15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file 
> for rankings
> parquet.io.ParquetEncodingException: 
> file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet 
> invalid: all the files must be contained in the root rankings
> at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
> at 
> parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
> at 
> parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
> {code}
> The reason is that the Spark SQL schemas stored in Parquet key-value metadata 
> differ. Parquet doesn't know how to "merge" these opaque user-defined 
> metadata, and just throw an exception and give up writing summary files. 
> Since the Parquet data source in Spark 1.3.0 supports schema merging, it's 
> harmless.  But this is kind of scary for the user.  We should try to suppress 
> this through the logger. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-10-28 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978607#comment-14978607
 ] 

swetha k commented on SPARK-3655:
-

[~koert]
The final output for this RDD is RDD[(String, List[(Long, String)])] . But, I 
call updateStateByKey on this RDD. Inside updateStateByKey, I process this list 
and put all the data in a single object which gets merged with the old state 
for this 
session. After the updateStateByKey, I will return objects for the session that 
represents the current batch and the  merged batch.

> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-10-28 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978610#comment-14978610
 ] 

swetha k commented on SPARK-3655:
-

[~koert]

If I don't put the list as a materialized view in memory, what is the 
appropriate way to use Spark-Sorted to just group and sort the batch of Jsons 
based on the key(sessionId)

> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-08-26 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14715841#comment-14715841
 ] 

swetha k commented on SPARK-3655:
-

[~koert]

How do I include the dependency for this? Is this available as a jar somewhere?

 Support sorting of values in addition to keys (i.e. secondary sort)
 ---

 Key: SPARK-3655
 URL: https://issues.apache.org/jira/browse/SPARK-3655
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: koert kuipers
Assignee: Koert Kuipers

 Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
 There are some use cases where getting a sorted iterator of values per key is 
 helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org