Re: Spark cluster multi tenancy

2015-08-26 Thread Jerrick Hoang
Would be interested to know the answer too.

On Wed, Aug 26, 2015 at 11:45 AM, Sadhan Sood sadhan.s...@gmail.com wrote:

 Interestingly, if there is nothing running on dev spark-shell, it recovers
 successfully and regains the lost executors. Attaching the log for that.
 Notice, the Registering block manager .. statements in the very end after
 all executors were lost.

 On Wed, Aug 26, 2015 at 11:27 AM, Sadhan Sood sadhan.s...@gmail.com
 wrote:

 Attaching log for when the dev job gets stuck (once all its executors are
 lost due to preemption). This is a spark-shell job running in yarn-client
 mode.

 On Wed, Aug 26, 2015 at 10:45 AM, Sadhan Sood sadhan.s...@gmail.com
 wrote:

 Hi All,

 We've set up our spark cluster on aws running on yarn (running on hadoop
 2.3) with fair scheduling and preemption turned on. The cluster is shared
 for prod and dev work where prod runs with a higher fair share and can
 preempt dev jobs if there are not enough resources available for it.
 It appears that dev jobs which get preempted often get unstable after
 losing some executors and the whole jobs gets stuck (without making any
 progress) or end up getting restarted (and hence losing all the work done).
 Has someone encountered this before ? Is the solution just to set 
 spark.task.maxFailures
 to a really high value to recover from task failures in such scenarios? Are
 there other approaches that people have taken for spark multi tenancy that
 works better in such scenario?

 Thanks,
 Sadhan





 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Jerrick Hoang
@Michael: would listStatus calls read the actual parquet footers within the
folders?

On Mon, Aug 24, 2015 at 11:36 AM, Sereday, Scott scott.sere...@nielsen.com
wrote:

 Can you please remove me from this distribution list?



 (Filling up my inbox too fast)



 *From:* Michael Armbrust [mailto:mich...@databricks.com]
 *Sent:* Monday, August 24, 2015 2:13 PM
 *To:* Philip Weaver philip.wea...@gmail.com
 *Cc:* Jerrick Hoang jerrickho...@gmail.com; Raghavendra Pandey 
 raghavendra.pan...@gmail.com; User user@spark.apache.org; Cheng, Hao 
 hao.ch...@intel.com

 *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
 partitions



 I think we are mostly bottlenecked at this point by how fast we can make
 listStatus calls to discover the folders.  That said, we are happy to
 accept suggestions or PRs to make this faster.  Perhaps you can describe
 how your home grown partitioning works?



 On Sun, Aug 23, 2015 at 7:38 PM, Philip Weaver philip.wea...@gmail.com
 wrote:

 1 minute to discover 1000s of partitions -- yes, that is what I have
 observed. And I would assert that is very slow.



 On Sun, Aug 23, 2015 at 7:16 PM, Michael Armbrust mich...@databricks.com
 wrote:

 We should not be actually scanning all of the data of all of the
 partitions, but we do need to at least list all of the available
 directories so that we can apply your predicates to the actual values that
 are present when we are deciding which files need to be read in a given
 spark job.  While this is a somewhat expensive operation, we do it in
 parallel and we cache this information when you access the same relation
 more than once.



 Can you provide a little more detail about how exactly you are accessing
 the parquet data (are you using sqlContext.read or creating persistent
 tables in the metastore?), and how long it is taking?  It would also be
 good to know how many partitions we are talking about and how much data is
 in each.  Finally, I'd like to see the stacktrace where it is hanging to
 make sure my above assertions are correct.



 We have several tables internally that have 1000s of partitions and while
 it takes ~1 minute initially to discover the metadata, after that we are
 able to query the data interactively.







 On Sun, Aug 23, 2015 at 2:00 AM, Jerrick Hoang jerrickho...@gmail.com
 wrote:

 anybody has any suggestions?



 On Fri, Aug 21, 2015 at 3:14 PM, Jerrick Hoang jerrickho...@gmail.com
 wrote:

 Is there a workaround without updating Hadoop? Would really appreciate if
 someone can explain what spark is trying to do here and what is an easy way
 to turn this off. Thanks all!



 On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey 
 raghavendra.pan...@gmail.com wrote:

 Did you try with hadoop version 2.7.1 .. It is known that s3a works really
 well with parquet which is available in 2.7. They fixed lot of issues
 related to metadata reading there...

 On Aug 21, 2015 11:24 PM, Jerrick Hoang jerrickho...@gmail.com wrote:

 @Cheng, Hao : Physical plans show that it got stuck on scanning S3!



 (table is partitioned by date_prefix and hour)

 explain select count(*) from test_table where date_prefix='20150819' and
 hour='00';



 TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)]

  TungstenExchange SinglePartition

   TungstenAggregate(key=[],
 value=[(count(1),mode=Partial,isDistinct=false)]

Scan ParquetRelation[ .. about 1000 partition paths go here ]



 Why does spark have to scan all partitions when the query only concerns
 with 1 partitions? Doesn't it defeat the purpose of partitioning?



 Thanks!



 On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver philip.wea...@gmail.com
 wrote:

 I hadn't heard of spark.sql.sources.partitionDiscovery.enabled before, and
 I couldn't find much information about it online. What does it mean exactly
 to disable it? Are there any negative consequences to disabling it?



 On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao hao.ch...@intel.com wrote:

 Can you make some more profiling? I am wondering if the driver is busy
 with scanning the HDFS / S3.

 Like jstack pid of driver process



 And also, it’s will be great if you can paste the physical plan for the
 simple query.



 *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
 *Sent:* Thursday, August 20, 2015 1:46 PM
 *To:* Cheng, Hao
 *Cc:* Philip Weaver; user
 *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
 partitions



 I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of
 CLs trying to speed up spark sql with tables with a huge number of
 partitions, I've made sure that those CLs are included but it's still very
 slow



 On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao hao.ch...@intel.com wrote:

 Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to
 false.



 BTW, which version are you using?



 Hao



 *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
 *Sent:* Thursday, August 20, 2015 12:16 PM
 *To:* Philip

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-23 Thread Jerrick Hoang
anybody has any suggestions?

On Fri, Aug 21, 2015 at 3:14 PM, Jerrick Hoang jerrickho...@gmail.com
wrote:

 Is there a workaround without updating Hadoop? Would really appreciate if
 someone can explain what spark is trying to do here and what is an easy way
 to turn this off. Thanks all!

 On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey 
 raghavendra.pan...@gmail.com wrote:

 Did you try with hadoop version 2.7.1 .. It is known that s3a works
 really well with parquet which is available in 2.7. They fixed lot of
 issues related to metadata reading there...
 On Aug 21, 2015 11:24 PM, Jerrick Hoang jerrickho...@gmail.com wrote:

 @Cheng, Hao : Physical plans show that it got stuck on scanning S3!

 (table is partitioned by date_prefix and hour)
 explain select count(*) from test_table where date_prefix='20150819' and
 hour='00';

 TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)]
  TungstenExchange SinglePartition
   TungstenAggregate(key=[],
 value=[(count(1),mode=Partial,isDistinct=false)]
Scan ParquetRelation[ .. about 1000 partition paths go here ]

 Why does spark have to scan all partitions when the query only concerns
 with 1 partitions? Doesn't it defeat the purpose of partitioning?

 Thanks!

 On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver philip.wea...@gmail.com
 wrote:

 I hadn't heard of spark.sql.sources.partitionDiscovery.enabled before,
 and I couldn't find much information about it online. What does it mean
 exactly to disable it? Are there any negative consequences to disabling it?

 On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao hao.ch...@intel.com
 wrote:

 Can you make some more profiling? I am wondering if the driver is busy
 with scanning the HDFS / S3.

 Like jstack pid of driver process



 And also, it’s will be great if you can paste the physical plan for
 the simple query.



 *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
 *Sent:* Thursday, August 20, 2015 1:46 PM
 *To:* Cheng, Hao
 *Cc:* Philip Weaver; user
 *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
 partitions



 I cloned from TOT after 1.5.0 cut off. I noticed there were a couple
 of CLs trying to speed up spark sql with tables with a huge number of
 partitions, I've made sure that those CLs are included but it's still very
 slow



 On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao hao.ch...@intel.com
 wrote:

 Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled
 to false.



 BTW, which version are you using?



 Hao



 *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
 *Sent:* Thursday, August 20, 2015 12:16 PM
 *To:* Philip Weaver
 *Cc:* user
 *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
 partitions



 I guess the question is why does spark have to do partition discovery
 with all partitions when the query only needs to look at one partition? Is
 there a conf flag to turn this off?



 On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver 
 philip.wea...@gmail.com wrote:

 I've had the same problem. It turns out that Spark (specifically
 parquet) is very slow at partition discovery. It got better in 1.5 (not 
 yet
 released), but was still unacceptably slow. Sadly, we ended up reading
 parquet files manually in Python (via C++) and had to abandon Spark SQL
 because of this problem.



 On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang jerrickho...@gmail.com
 wrote:

 Hi all,



 I did a simple experiment with Spark SQL. I created a partitioned
 parquet table with only one partition (date=20140701). A simple `select
 count(*) from table where date=20140701` would run very fast (0.1 
 seconds).
 However, as I added more partitions the query takes longer and longer. 
 When
 I added about 10,000 partitions, the query took way too long. I feel like
 querying for a single partition should not be affected by having more
 partitions. Is this a known behaviour? What does spark try to do here?



 Thanks,

 Jerrick













Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Jerrick Hoang
@Cheng, Hao : Physical plans show that it got stuck on scanning S3!

(table is partitioned by date_prefix and hour)
explain select count(*) from test_table where date_prefix='20150819' and
hour='00';

TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)]
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], value=[(count(1),mode=Partial,isDistinct=false)]
   Scan ParquetRelation[ .. about 1000 partition paths go here ]

Why does spark have to scan all partitions when the query only concerns
with 1 partitions? Doesn't it defeat the purpose of partitioning?

Thanks!

On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver philip.wea...@gmail.com
wrote:

 I hadn't heard of spark.sql.sources.partitionDiscovery.enabled before, and
 I couldn't find much information about it online. What does it mean exactly
 to disable it? Are there any negative consequences to disabling it?

 On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao hao.ch...@intel.com wrote:

 Can you make some more profiling? I am wondering if the driver is busy
 with scanning the HDFS / S3.

 Like jstack pid of driver process



 And also, it’s will be great if you can paste the physical plan for the
 simple query.



 *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
 *Sent:* Thursday, August 20, 2015 1:46 PM
 *To:* Cheng, Hao
 *Cc:* Philip Weaver; user
 *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
 partitions



 I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of
 CLs trying to speed up spark sql with tables with a huge number of
 partitions, I've made sure that those CLs are included but it's still very
 slow



 On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao hao.ch...@intel.com wrote:

 Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to
 false.



 BTW, which version are you using?



 Hao



 *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
 *Sent:* Thursday, August 20, 2015 12:16 PM
 *To:* Philip Weaver
 *Cc:* user
 *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
 partitions



 I guess the question is why does spark have to do partition discovery
 with all partitions when the query only needs to look at one partition? Is
 there a conf flag to turn this off?



 On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver philip.wea...@gmail.com
 wrote:

 I've had the same problem. It turns out that Spark (specifically parquet)
 is very slow at partition discovery. It got better in 1.5 (not yet
 released), but was still unacceptably slow. Sadly, we ended up reading
 parquet files manually in Python (via C++) and had to abandon Spark SQL
 because of this problem.



 On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang jerrickho...@gmail.com
 wrote:

 Hi all,



 I did a simple experiment with Spark SQL. I created a partitioned parquet
 table with only one partition (date=20140701). A simple `select count(*)
 from table where date=20140701` would run very fast (0.1 seconds). However,
 as I added more partitions the query takes longer and longer. When I added
 about 10,000 partitions, the query took way too long. I feel like querying
 for a single partition should not be affected by having more partitions. Is
 this a known behaviour? What does spark try to do here?



 Thanks,

 Jerrick











Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Jerrick Hoang
Is there a workaround without updating Hadoop? Would really appreciate if
someone can explain what spark is trying to do here and what is an easy way
to turn this off. Thanks all!

On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey 
raghavendra.pan...@gmail.com wrote:

 Did you try with hadoop version 2.7.1 .. It is known that s3a works really
 well with parquet which is available in 2.7. They fixed lot of issues
 related to metadata reading there...
 On Aug 21, 2015 11:24 PM, Jerrick Hoang jerrickho...@gmail.com wrote:

 @Cheng, Hao : Physical plans show that it got stuck on scanning S3!

 (table is partitioned by date_prefix and hour)
 explain select count(*) from test_table where date_prefix='20150819' and
 hour='00';

 TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)]
  TungstenExchange SinglePartition
   TungstenAggregate(key=[],
 value=[(count(1),mode=Partial,isDistinct=false)]
Scan ParquetRelation[ .. about 1000 partition paths go here ]

 Why does spark have to scan all partitions when the query only concerns
 with 1 partitions? Doesn't it defeat the purpose of partitioning?

 Thanks!

 On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver philip.wea...@gmail.com
 wrote:

 I hadn't heard of spark.sql.sources.partitionDiscovery.enabled before,
 and I couldn't find much information about it online. What does it mean
 exactly to disable it? Are there any negative consequences to disabling it?

 On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao hao.ch...@intel.com
 wrote:

 Can you make some more profiling? I am wondering if the driver is busy
 with scanning the HDFS / S3.

 Like jstack pid of driver process



 And also, it’s will be great if you can paste the physical plan for the
 simple query.



 *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
 *Sent:* Thursday, August 20, 2015 1:46 PM
 *To:* Cheng, Hao
 *Cc:* Philip Weaver; user
 *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
 partitions



 I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of
 CLs trying to speed up spark sql with tables with a huge number of
 partitions, I've made sure that those CLs are included but it's still very
 slow



 On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao hao.ch...@intel.com
 wrote:

 Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled
 to false.



 BTW, which version are you using?



 Hao



 *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
 *Sent:* Thursday, August 20, 2015 12:16 PM
 *To:* Philip Weaver
 *Cc:* user
 *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
 partitions



 I guess the question is why does spark have to do partition discovery
 with all partitions when the query only needs to look at one partition? Is
 there a conf flag to turn this off?



 On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver philip.wea...@gmail.com
 wrote:

 I've had the same problem. It turns out that Spark (specifically
 parquet) is very slow at partition discovery. It got better in 1.5 (not yet
 released), but was still unacceptably slow. Sadly, we ended up reading
 parquet files manually in Python (via C++) and had to abandon Spark SQL
 because of this problem.



 On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang jerrickho...@gmail.com
 wrote:

 Hi all,



 I did a simple experiment with Spark SQL. I created a partitioned
 parquet table with only one partition (date=20140701). A simple `select
 count(*) from table where date=20140701` would run very fast (0.1 seconds).
 However, as I added more partitions the query takes longer and longer. When
 I added about 10,000 partitions, the query took way too long. I feel like
 querying for a single partition should not be affected by having more
 partitions. Is this a known behaviour? What does spark try to do here?



 Thanks,

 Jerrick












Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of CLs
trying to speed up spark sql with tables with a huge number of partitions,
I've made sure that those CLs are included but it's still very slow

On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao hao.ch...@intel.com wrote:

 Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to
 false.



 BTW, which version are you using?



 Hao



 *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com]
 *Sent:* Thursday, August 20, 2015 12:16 PM
 *To:* Philip Weaver
 *Cc:* user
 *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
 partitions



 I guess the question is why does spark have to do partition discovery with
 all partitions when the query only needs to look at one partition? Is there
 a conf flag to turn this off?



 On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver philip.wea...@gmail.com
 wrote:

 I've had the same problem. It turns out that Spark (specifically parquet)
 is very slow at partition discovery. It got better in 1.5 (not yet
 released), but was still unacceptably slow. Sadly, we ended up reading
 parquet files manually in Python (via C++) and had to abandon Spark SQL
 because of this problem.



 On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang jerrickho...@gmail.com
 wrote:

 Hi all,



 I did a simple experiment with Spark SQL. I created a partitioned parquet
 table with only one partition (date=20140701). A simple `select count(*)
 from table where date=20140701` would run very fast (0.1 seconds). However,
 as I added more partitions the query takes longer and longer. When I added
 about 10,000 partitions, the query took way too long. I feel like querying
 for a single partition should not be affected by having more partitions. Is
 this a known behaviour? What does spark try to do here?



 Thanks,

 Jerrick







Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
I guess the question is why does spark have to do partition discovery with
all partitions when the query only needs to look at one partition? Is there
a conf flag to turn this off?

On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver philip.wea...@gmail.com
wrote:

 I've had the same problem. It turns out that Spark (specifically parquet)
 is very slow at partition discovery. It got better in 1.5 (not yet
 released), but was still unacceptably slow. Sadly, we ended up reading
 parquet files manually in Python (via C++) and had to abandon Spark SQL
 because of this problem.

 On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang jerrickho...@gmail.com
 wrote:

 Hi all,

 I did a simple experiment with Spark SQL. I created a partitioned parquet
 table with only one partition (date=20140701). A simple `select count(*)
 from table where date=20140701` would run very fast (0.1 seconds). However,
 as I added more partitions the query takes longer and longer. When I added
 about 10,000 partitions, the query took way too long. I feel like querying
 for a single partition should not be affected by having more partitions. Is
 this a known behaviour? What does spark try to do here?

 Thanks,
 Jerrick





Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
Hi all,

I did a simple experiment with Spark SQL. I created a partitioned parquet
table with only one partition (date=20140701). A simple `select count(*)
from table where date=20140701` would run very fast (0.1 seconds). However,
as I added more partitions the query takes longer and longer. When I added
about 10,000 partitions, the query took way too long. I feel like querying
for a single partition should not be affected by having more partitions. Is
this a known behaviour? What does spark try to do here?

Thanks,
Jerrick


Re: Spark failed while trying to read parquet files

2015-08-07 Thread Jerrick Hoang
Yes! I was being dumb, should have caught that earlier, thank you Cheng Lian

On Fri, Aug 7, 2015 at 4:25 PM, Cheng Lian lian.cs@gmail.com wrote:

 It doesn't seem to be Parquet 1.7.0 since the package name isn't under
 org.apache.parquet (1.7.0 is the first official Apache release of
 Parquet). The version you were using is probably Parquet 1.6.0rc3 according
 to the line number information:
 https://github.com/apache/parquet-mr/blob/parquet-1.6.0rc3/parquet-hadoop/src/main/java/parquet/format/converter/ParquetMetadataConverter.java#L249
 And you're hitting PARQUET-136, which has been fixed in (the real) Parquet
 1.7.0 https://issues.apache.org/jira/browse/PARQUET-136

 Cheng


 On 8/8/15 6:20 AM, Jerrick Hoang wrote:

 Hi all,

 I have a partitioned parquet table (very small table with only 2
 partitions). The version of spark is 1.4.1, parquet version is 1.7.0. I
 applied this patch to spark [SPARK-7743] so I assume that spark can read
 parquet files normally, however, I'm getting this when trying to do a
 simple `select count(*) from table`,

 ```org.apache.spark.SparkException: Job aborted due to stage failure: Task
 29 in stage 44.0 failed 15 times, most recent failure: Lost task 29.14 in
 stage 44.0: java.lang.NullPointerException
 at
 parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
 at
 parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
 at
 parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
 at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
 at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
 at
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
 at
 parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
 at
 org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:153)
 at
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
 at
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)```

 Has anybody seen this before?

 Thanks





Spark failed while trying to read parquet files

2015-08-07 Thread Jerrick Hoang
Hi all,

I have a partitioned parquet table (very small table with only 2
partitions). The version of spark is 1.4.1, parquet version is 1.7.0. I
applied this patch to spark [SPARK-7743] so I assume that spark can read
parquet files normally, however, I'm getting this when trying to do a
simple `select count(*) from table`,

```org.apache.spark.SparkException: Job aborted due to stage failure: Task
29 in stage 44.0 failed 15 times, most recent failure: Lost task 29.14 in
stage 44.0: java.lang.NullPointerException
at
parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
at
parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
at
parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
at
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
at
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:153)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)```

Has anybody seen this before?

Thanks


Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Jerrick Hoang
how big is the dataset? how complicated is the query?

On Sun, Jul 26, 2015 at 12:47 AM Louis Hust louis.h...@gmail.com wrote:

 Hi, all,

 I am using spark DataFrame to fetch small table from MySQL,
 and i found it cost so much than directly access MySQL Using JDBC.

 Time cost for Spark is about 2033ms, and direct access at about 16ms.

 Code can be found at:


 https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java

 So If my configuration for spark is wrong? How to optimise Spark to
 achieve the similar performance like direct access?

 Any idea will be appreciated!




Re: Spark-hive parquet schema evolution

2015-07-21 Thread Jerrick Hoang
Hi Lian,

Sorry I'm new to Spark so I did not express myself very clearly. I'm
concerned about the situation when let's say I have a Parquet table some
partitions and I add a new column A to parquet schema and write some data
with the new schema to a new partition in the table. If i'm not mistaken,
if I do a sqlContext.read.parquet(table_path).printSchema() it will print
the correct schema with new column A. But if I do a 'describe table' from
SparkSQLCLI I won't see the new column being added. I understand that this
is because Hive doesn't support schema evolution. So what is the best way
to support CLI queries in this situation? Do I need to manually alter the
table everytime the underlying schema changes?

Thanks

On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian lian.cs@gmail.com wrote:

  Hey Jerrick,

 What do you mean by schema evolution with Hive metastore tables? Hive
 doesn't take schema evolution into account. Could you please give a
 concrete use case? Are you trying to write Parquet data with extra columns
 into an existing metastore Parquet table?

 Cheng


 On 7/21/15 1:04 AM, Jerrick Hoang wrote:

 I'm new to Spark, any ideas would be much appreciated! Thanks

 On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang jerrickho...@gmail.com
 wrote:

 Hi all,

  I'm aware of the support for schema evolution via DataFrame API. Just
 wondering what would be the best way to go about dealing with schema
 evolution with Hive metastore tables. So, say I create a table via SparkSQL
 CLI, how would I deal with Parquet schema evolution?

  Thanks,
 J






Re: Spark-hive parquet schema evolution

2015-07-20 Thread Jerrick Hoang
I'm new to Spark, any ideas would be much appreciated! Thanks

On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang jerrickho...@gmail.com
wrote:

 Hi all,

 I'm aware of the support for schema evolution via DataFrame API. Just
 wondering what would be the best way to go about dealing with schema
 evolution with Hive metastore tables. So, say I create a table via SparkSQL
 CLI, how would I deal with Parquet schema evolution?

 Thanks,
 J



Spark-hive parquet schema evolution

2015-07-18 Thread Jerrick Hoang
Hi all,

I'm aware of the support for schema evolution via DataFrame API. Just
wondering what would be the best way to go about dealing with schema
evolution with Hive metastore tables. So, say I create a table via SparkSQL
CLI, how would I deal with Parquet schema evolution?

Thanks,
J


Re: Getting not implemented by the TFS FileSystem implementation

2015-07-16 Thread Jerrick Hoang
So, this has to do with the fact that 1.4 has a new way to interact with
HiveMetastore, still investigating. Would really appreciate if anybody has
any insights :)

On Tue, Jul 14, 2015 at 4:28 PM, Jerrick Hoang jerrickho...@gmail.com
wrote:

 Hi all,

 I'm upgrading from spark1.3 to spark1.4 and when trying to run spark-sql
 CLI. It gave an ```ava.lang.UnsupportedOperationException: Not implemented
 by the TFS FileSystem implementation``` exception. I did not get this error
 with 1.3 and I don't use any TFS FileSystem. Full stack trace is

 ```Exception in thread main java.lang.RuntimeException:
 java.lang.UnsupportedOperationException: Not implemented by the TFS
 FileSystem implementation
 at
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
 at
 org.apache.spark.sql.hive.client.ClientWrapper.init(ClientWrapper.scala:105)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
 at
 org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:170)
 at
 org.apache.spark.sql.hive.client.IsolatedClientLoader.init(IsolatedClientLoader.scala:166)
 at
 org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:212)
 at
 org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175)
 at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:358)
 at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:205)
 at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:204)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
 at org.apache.spark.sql.SQLContext.init(SQLContext.scala:204)
 at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:71)
 at
 org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:53)
 at
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.init(SparkSQLCLIDriver.scala:248)
 at
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:136)
 at
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:483)
 at
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
 at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.UnsupportedOperationException: Not implemented by the
 TFS FileSystem implementation
 at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:214)
 at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2365)
 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
 at
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:342)
 ... 31 more```

 Thanks all



Getting not implemented by the TFS FileSystem implementation

2015-07-14 Thread Jerrick Hoang
Hi all,

I'm upgrading from spark1.3 to spark1.4 and when trying to run spark-sql
CLI. It gave an ```ava.lang.UnsupportedOperationException: Not implemented
by the TFS FileSystem implementation``` exception. I did not get this error
with 1.3 and I don't use any TFS FileSystem. Full stack trace is

```Exception in thread main java.lang.RuntimeException:
java.lang.UnsupportedOperationException: Not implemented by the TFS
FileSystem implementation
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
at
org.apache.spark.sql.hive.client.ClientWrapper.init(ClientWrapper.scala:105)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:170)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.init(IsolatedClientLoader.scala:166)
at
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:212)
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:358)
at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:205)
at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:204)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.sql.SQLContext.init(SQLContext.scala:204)
at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:71)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:53)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.init(SparkSQLCLIDriver.scala:248)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:136)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.UnsupportedOperationException: Not implemented by the
TFS FileSystem implementation
at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:214)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2365)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:342)
... 31 more```

Thanks all


Re: Basic Spark SQL question

2015-07-13 Thread Jerrick Hoang
Well for adhoc queries you can use the CLI

On Mon, Jul 13, 2015 at 5:34 PM, Ron Gonzalez zlgonza...@yahoo.com.invalid
wrote:

 Hi,
   I have a question for Spark SQL. Is there a way to be able to use Spark
 SQL on YARN without having to submit a job?
   Bottom line here is I want to be able to reduce the latency of running
 queries as a job. I know that the spark sql default submission is like a
 job, but was wondering if it's possible to run queries like one would with
 a regular db like MySQL or Oracle.

 Thanks,
 Ron


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




hive-site.xml spark1.3

2015-07-13 Thread Jerrick Hoang
Hi all,

I'm having conf/hive-site.xml pointing to my Hive metastore but sparksql
CLI doesn't pick it up. (copying the same conf/ files to spark1.4 and 1.2
works fine). Just wondering if someone has seen this before,
Thanks


Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Jerrick Hoang
Sorry all for not being clear. I'm using spark 1.4 and the table is a hive
table, and the table is partitioned.

On Sun, Jul 12, 2015 at 6:36 PM, Yin Huai yh...@databricks.com wrote:

 Jerrick,

 Let me ask a few clarification questions. What is the version of Spark? Is
 the table a hive table? What is the format of the table? Is the table
 partitioned?

 Thanks,

 Yin

 On Sun, Jul 12, 2015 at 6:01 PM, ayan guha guha.a...@gmail.com wrote:

 Describe computes statistics, so it will try to query the table. The one
 you are looking for is df.printSchema()

 On Mon, Jul 13, 2015 at 10:03 AM, Jerrick Hoang jerrickho...@gmail.com
 wrote:

 Hi all,

 I'm new to Spark and this question may be trivial or has already been
 answered, but when I do a 'describe table' from SparkSQL CLI it seems to
 try looking at all records at the table (which takes a really long time for
 big table) instead of just giving me the metadata of the table. Would
 appreciate if someone can give me some pointers, thanks!




 --
 Best Regards,
 Ayan Guha





SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Jerrick Hoang
Hi all,

I'm new to Spark and this question may be trivial or has already been
answered, but when I do a 'describe table' from SparkSQL CLI it seems to
try looking at all records at the table (which takes a really long time for
big table) instead of just giving me the metadata of the table. Would
appreciate if someone can give me some pointers, thanks!