Re: Querying a parquet file in s3 with an ec2 install

2014-09-09 Thread Jim Carroll
Okay,

This seems to be either a code version issue or a communication issue. It
works if I execute the spark shell from the master node. It doesn't work if
I run it from my laptop and connect to the master node. 

I had opened the ports for the WebUI (8080) and the cluster manager (7077)
for the master node or it fails much sooner. Do I need to open up the ports
for the workers as well?

I used the spark-ec2 install script with --spark-version using both 1.0.2
and then again with the git hash tag that corresponds to 1.1.0rc4
(2f9b2bd7844ee8393dc9c319f4fefedf95f5e460). In both cases I rebuilt from
source using the same codebase on my machine and moved the entire project
into /root/spark (since to run the spark-shell it needs to match the same
path as the install on ec2). Could I have missed something here?

Thanks.
Jim




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tp13737p13802.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Querying a parquet file in s3 with an ec2 install

2014-09-09 Thread Jim Carroll

>Why I think its the number of files is that I believe that a
> all of those or large part of those files are read when 
>you run sqlContext.parquetFile() and the time it would 
>take in s3 for that to happen is a lot so something 
>internally is timing out.. 

I'll create the parquet files with Drill instead of Spark which will give me
(somewhat) better control over the slice sizes and see what happens.

That said, this behavior seems wrong to me. First, exiting due to inactivity
on a job seems like (perhaps?) the wrong fix to a former problem.  Second,
there IS activity if it's reading the slice headers but the job is exiting
anyway. So if this fixes the problem the measure of "activity" seems wrong.

Ian and Manu, thanks for your help. I'll post back and let you know if that
fixes it.

Jim




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tp13737p13791.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Querying a parquet file in s3 with an ec2 install

2014-09-09 Thread Jim Carroll
My apologies to the list. I replied to Manu's question and it went directly
to him rather than the list.

In case anyone else has this issue here is my reply and Manu's reply to me.
This also answers Ian's question.

---

Hi Manu,

The dataset is 7.5 million rows and 500 columns. In parquet form it's about
1.1 Gig. It was created with Spark and copied up to s3. It has about 4600
parts (which I'd also like to gain some control over). I can try a smaller
dataset, however it works when I run it locally, even with the file out on
s3. It just takes a while.

I can try copying it to HDFS first but that wont help longer term.

Thanks
Jim

-
Manu's response:
-

I am pretty sure it is due to the number of parts you have.. I have a
parquet data set that is 250M rows  and 924 columns and it is ~2500 files... 

I recommend creating a tables in HIve with that data set and doing an insert
overwrite so you can get a data set with more manageable files..

Why I think its the number of files is that I believe that a all of those or
large part of those files are read when you run sqlContext.parquetFile() and
the time it would take in s3 for that to happen is a lot so something
internally is timing out..

-Manu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tp13737p13790.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Querying a parquet file in s3 with an ec2 install

2014-09-08 Thread Ian O'Connell
cheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
>> at
>>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
>> at
>>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>>
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
>> at
>>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
>> at
>>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
>> at scala.Option.foreach(Option.scala:236)
>> at
>>
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
>> at
>>
>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>> at
>>
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>> at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>>
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> at
>>
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>> As far as the "Initial job has not accepted any resources" I'm running the
>> spark-shell command with:
>>
>> SPARK_MEM=2g ./spark-shell --master
>> spark://ec2-x-x-x-x.compute-1.amazonaws.com:7077
>>
>> According to the master web page each node has 6 Gig so I'm not sure why
>> I'm
>> seeing that message either. If I run with less than 2g I get the following
>> in my spark-shell:
>>
>> 14/09/08 17:47:38 INFO Remoting: Remoting shut down
>> 14/09/08 17:47:38 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
>> shut down.
>> java.io.IOException: Error reading summaries
>> at
>>
>> parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:128)
>> 
>> Caused by: java.util.concurrent.ExecutionException:
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>
>> I'm not sure if this exception is from the spark-shell jvm or transferred
>> over from the master or a worker through the master.
>>
>> Any help would be greatly appreciated.
>>
>> Thanks
>> Jim
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tp13737.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Querying a parquet file in s3 with an ec2 install

2014-09-08 Thread Manu Mukerji
er.scala:1391)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
>
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> As far as the "Initial job has not accepted any resources" I'm running the
> spark-shell command with:
>
> SPARK_MEM=2g ./spark-shell --master
> spark://ec2-x-x-x-x.compute-1.amazonaws.com:7077
>
> According to the master web page each node has 6 Gig so I'm not sure why
> I'm
> seeing that message either. If I run with less than 2g I get the following
> in my spark-shell:
>
> 14/09/08 17:47:38 INFO Remoting: Remoting shut down
> 14/09/08 17:47:38 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
> shut down.
> java.io.IOException: Error reading summaries
> at
>
> parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:128)
> 
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>
> I'm not sure if this exception is from the spark-shell jvm or transferred
> over from the master or a worker through the master.
>
> Any help would be greatly appreciated.
>
> Thanks
> Jim
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tp13737.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Querying a parquet file in s3 with an ec2 install

2014-09-08 Thread Jim Carroll
rkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

As far as the "Initial job has not accepted any resources" I'm running the
spark-shell command with:

SPARK_MEM=2g ./spark-shell --master
spark://ec2-x-x-x-x.compute-1.amazonaws.com:7077

According to the master web page each node has 6 Gig so I'm not sure why I'm
seeing that message either. If I run with less than 2g I get the following
in my spark-shell:

14/09/08 17:47:38 INFO Remoting: Remoting shut down
14/09/08 17:47:38 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
shut down.
java.io.IOException: Error reading summaries
at
parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:128)

Caused by: java.util.concurrent.ExecutionException:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.concurrent.FutureTask.report(FutureTask.java:122)

I'm not sure if this exception is from the spark-shell jvm or transferred
over from the master or a worker through the master.

Any help would be greatly appreciated.

Thanks
Jim









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tp13737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org