This is a bug in DataFrame caching.  You can avoid caching or turn off
compression.  It is fixed in Spark 1.5.1

On Sat, Oct 31, 2015 at 2:31 AM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> I don’t believe I have it on 1.5.1. Are you able to test the data locally
> to confirm, or is it too large?
>
> From: "Zhang, Jingyu" <jingyu.zh...@news.com.au>
> Date: Friday, October 30, 2015 at 7:31 PM
> To: Silvio Fiorito <silvio.fior...@granturing.com>
> Cc: Ted Yu <yuzhih...@gmail.com>, user <user@spark.apache.org>
>
> Subject: Re: key not found: sportingpulse.com in Spark SQL 1.5.0
>
> Thanks Silvio and Ted,
>
> Can you please let me know how to fix this intermittent issues? Should I
> wait EMR upgrading to support the Spark 1.5.1 or change my code from
> DataFrame to normal Spark map-reduce?
>
> Regards,
>
> Jingyu
>
> On 31 October 2015 at 09:40, Silvio Fiorito <silvio.fior...@granturing.com
> > wrote:
>
>> It's something due to the columnar compression. I've seen similar
>> intermittent issues when caching DataFrames. "sportingpulse.com" is a
>> value in one of the columns of the DF.
>> ------------------------------
>> From: Ted Yu <yuzhih...@gmail.com>
>> Sent: ‎10/‎30/‎2015 6:33 PM
>> To: Zhang, Jingyu <jingyu.zh...@news.com.au>
>> Cc: user <user@spark.apache.org>
>> Subject: Re: key not found: sportingpulse.com in Spark SQL 1.5.0
>>
>> I searched for sportingpulse in *.scala and *.java files under 1.5
>> branch.
>> There was no hit.
>>
>> mvn dependency doesn't show sportingpulse either.
>>
>> Is it possible this is specific to EMR ?
>>
>> Cheers
>>
>> On Fri, Oct 30, 2015 at 2:57 PM, Zhang, Jingyu <jingyu.zh...@news.com.au>
>> wrote:
>>
>>> There is not a problem in Spark SQL 1.5.1 but the error of "key not
>>> found: sportingpulse.com" shown up when I use 1.5.0.
>>>
>>> I have to use the version of 1.5.0 because that the one AWS EMR
>>> support.  Can anyone tell me why Spark uses "sportingpulse.com" and how
>>> to fix it?
>>>
>>> Thanks.
>>>
>>> Caused by: java.util.NoSuchElementException: key not found:
>>> sportingpulse.com
>>>
>>> at scala.collection.MapLike$class.default(MapLike.scala:228)
>>>
>>> at scala.collection.AbstractMap.default(Map.scala:58)
>>>
>>> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>>>
>>> at
>>> org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(
>>> compressionSchemes.scala:258)
>>>
>>> at
>>> org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(
>>> CompressibleColumnBuilder.scala:110)
>>>
>>> at org.apache.spark.sql.columnar.NativeColumnBuilder.build(
>>> ColumnBuilder.scala:87)
>>>
>>> at
>>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(
>>> InMemoryColumnarTableScan.scala:152)
>>>
>>> at
>>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(
>>> InMemoryColumnarTableScan.scala:152)
>>>
>>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>>> TraversableLike.scala:244)
>>>
>>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>>> TraversableLike.scala:244)
>>>
>>> at scala.collection.IndexedSeqOptimized$class.foreach(
>>> IndexedSeqOptimized.scala:33)
>>>
>>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>>
>>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>>
>>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>>>
>>> at
>>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
>>> InMemoryColumnarTableScan.scala:152)
>>>
>>> at
>>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
>>> InMemoryColumnarTableScan.scala:120)
>>>
>>> at org.apache.spark.storage.MemoryStore.unrollSafely(
>>> MemoryStore.scala:278)
>>>
>>> at org.apache.spark.CacheManager.putInBlockManager(
>>> CacheManager.scala:171)
>>>
>>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>>>
>>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>>> MapPartitionsRDD.scala:38)
>>>
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>>
>>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>>> MapPartitionsRDD.scala:38)
>>>
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>>
>>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>>> MapPartitionsRDD.scala:38)
>>>
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>>
>>> at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(
>>> MapPartitionsWithPreparationRDD.scala:63)
>>>
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>>
>>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>>> MapPartitionsRDD.scala:38)
>>>
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>>
>>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
>>> ShuffleMapTask.scala:73)
>>>
>>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
>>> ShuffleMapTask.scala:41)
>>>
>>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>>>
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>>
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> ThreadPoolExecutor.java:1142)
>>>
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> ThreadPoolExecutor.java:617)
>>>
>>> This message and its attachments may contain legally privileged or
>>> confidential information. It is intended solely for the named addressee. If
>>> you are not the addressee indicated in this message or responsible for
>>> delivery of the message to the addressee, you may not copy or deliver this
>>> message or its attachments to anyone. Rather, you should permanently delete
>>> this message and its attachments and kindly notify the sender by reply
>>> e-mail. Any content of this message and its attachments which does not
>>> relate to the official business of the sending company must be taken not to
>>> have been sent or endorsed by that company or any of its related entities.
>>> No warranty is made that the e-mail or attachments are free from computer
>>> virus or other defect.
>>
>>
>>
>
> This message and its attachments may contain legally privileged or
> confidential information. It is intended solely for the named addressee. If
> you are not the addressee indicated in this message or responsible for
> delivery of the message to the addressee, you may not copy or deliver this
> message or its attachments to anyone. Rather, you should permanently delete
> this message and its attachments and kindly notify the sender by reply
> e-mail. Any content of this message and its attachments which does not
> relate to the official business of the sending company must be taken not to
> have been sent or endorsed by that company or any of its related entities.
> No warranty is made that the e-mail or attachments are free from computer
> virus or other defect.
>

Reply via email to