Re: SPARK SQL- Parquet projection pushdown for nested data

2015-10-29 Thread Sadhan Sood
ra/browse/SPARK-4502 > > On Thu, Oct 29, 2015 at 6:00 PM, Sadhan Sood > wrote: > >> I noticed when querying struct data in spark sql, we are requesting the >> whole column from parquet files. Is this intended or is there some kind of >> config to control this behaviour? W

SPARK SQL- Parquet projection pushdown for nested data

2015-10-29 Thread Sadhan Sood
I noticed when querying struct data in spark sql, we are requesting the whole column from parquet files. Is this intended or is there some kind of config to control this behaviour? Wouldn't it be better to request just the struct field?

Re: hive thriftserver and fair scheduling

2015-10-20 Thread Sadhan Sood
/sql-programming-guide.html#scheduling > > You likely want to put each user in their own pool. > > On Tue, Oct 20, 2015 at 11:55 AM, Sadhan Sood > wrote: > >> Hi All, >> >> Does anyone have fair scheduling working for them in a hive server? I >> have one

hive thriftserver and fair scheduling

2015-10-20 Thread Sadhan Sood
Hi All, Does anyone have fair scheduling working for them in a hive server? I have one hive thriftserver running and multiple users trying to run queries at the same time on that server using a beeline client. I see that a big query is stopping all other queries from making any progress. Is this s

[SPARK-SQL] Requested array size exceeds VM limit

2015-09-25 Thread Sadhan Sood
I am trying to run a query on a month of data. The volume of data is not much, but we have a partition per hour and per day. The table schema is heavily nested with total of 300 leaf fields. I am trying to run a simple select count(*) query on the table and running into this exception: SELECT

SPARK-SQL parameter tuning for performance

2015-09-17 Thread Sadhan Sood
Hi Spark users, We are running Spark on Yarn and often query table partitions as big as 100~200 GB from hdfs. Hdfs is co-located on the same cluster on which Spark and Yarn run. I've noticed a much higher I/O read rates when I increase the number of executors cores from 2 to 8( Most tasks run in

Re: Spark cluster multi tenancy

2015-08-26 Thread Sadhan Sood
Interestingly, if there is nothing running on dev spark-shell, it recovers successfully and regains the lost executors. Attaching the log for that. Notice, the "Registering block manager .." statements in the very end after all executors were lost. On Wed, Aug 26, 2015 at 11:27 AM, S

Re: Spark cluster multi tenancy

2015-08-26 Thread Sadhan Sood
Attaching log for when the dev job gets stuck (once all its executors are lost due to preemption). This is a spark-shell job running in yarn-client mode. On Wed, Aug 26, 2015 at 10:45 AM, Sadhan Sood wrote: > Hi All, > > We've set up our spark cluster on aws running on yarn (run

Spark cluster multi tenancy

2015-08-26 Thread Sadhan Sood
o set spark.task.maxFailures to a really high value to recover from task failures in such scenarios? Are there other approaches that people have taken for spark multi tenancy that works better in such scenario? Thanks, Sadhan

Re: Error when cache partitioned Parquet table

2015-01-26 Thread Sadhan Sood
Hi Xu-dong, Thats probably because your table's partition path don't look like hdfs://somepath/key=value/*.parquet. Spark is trying to extract the partition key's value from the path while caching and hence the exception is being thrown since it can't find one. On Mon, Jan 26, 2015 at 10:45 AM, Z

Re: does spark sql support columnar compression with encoding when caching tables

2014-12-22 Thread Sadhan Sood
t are not supported. > > > On 12/20/14 6:17 AM, Sadhan Sood wrote: > > Hey Michael, > > Thank you for clarifying that. Is tachyon the right way to get compressed > data in memory or should we explore the option of adding compression to > cached data. This is because our uncomp

Re: does spark sql support columnar compression with encoding when caching tables

2014-12-19 Thread Sadhan Sood
Thanks Michael, that makes sense. On Fri, Dec 19, 2014 at 3:13 PM, Michael Armbrust wrote: > Yeah, tachyon does sound like a good option here. Especially if you have > nested data, its likely that parquet in tachyon will always be better > supported. > > On Fri, Dec 19, 2014 at

Re: does spark sql support columnar compression with encoding when caching tables

2014-12-19 Thread Sadhan Sood
t; > There is only column level encoding (run length encoding, delta encoding, > dictionary encoding) and no generic compression. > > On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood > wrote: >> >> Hi All, >> >> Wondering if when caching a table backed by lzo compr

does spark sql support columnar compression with encoding when caching tables

2014-12-18 Thread Sadhan Sood
Hi All, Wondering if when caching a table backed by lzo compressed parquet data, if spark also compresses it (using lzo/gzip/snappy) along with column level encoding or just does the column level encoding when "*spark.sql.inMemoryColumnarStorage.compressed" *is set to true. This is because when I

SparkSQL - can we add new column(s) to parquet files

2014-11-21 Thread Sadhan Sood
We create the table definition by reading the parquet file for schema and store it in hive metastore. But if someone adds a new column to the schema, and if we rescan the schema from the new parquet files and update the table definition, would it still work if we run queries on the table ? So, old

Re: SparkSQL exception on cached parquet table

2014-11-20 Thread Sadhan Sood
Thanks Michael, opened this https://issues.apache.org/jira/browse/SPARK-4520 On Thu, Nov 20, 2014 at 2:59 PM, Michael Armbrust wrote: > Can you open a JIRA? > > On Thu, Nov 20, 2014 at 10:39 AM, Sadhan Sood > wrote: > >> I am running on master, pulled yesterday I believe b

Re: Adding partitions to parquet data

2014-11-20 Thread Sadhan Sood
Ah awesome, thanks!! On Thu, Nov 20, 2014 at 3:01 PM, Michael Armbrust wrote: > In 1.2 by default we use Spark parquet support instead of Hive when the > SerDe contains the word "Parquet". This should work with hive partitioning. > > On Thu, Nov 20, 2014 at 10:33 AM

Re: SparkSQL exception on cached parquet table

2014-11-20 Thread Sadhan Sood
I am running on master, pulled yesterday I believe but saw the same issue with 1.2.0 On Thu, Nov 20, 2014 at 1:37 PM, Michael Armbrust wrote: > Which version are you running on again? > > On Thu, Nov 20, 2014 at 8:17 AM, Sadhan Sood > wrote: > >> Also attaching the parquet

Adding partitions to parquet data

2014-11-20 Thread Sadhan Sood
We are loading parquet data as temp tables but wondering if there is a way to add a partition to the data without going through hive (we still want to use spark's parquet serde as compared to hive). The data looks like -> /date1/file1, /date1/file2 ... , /date2/file1, /date2/file2,/daten/filem

Re: SparkSQL exception on cached parquet table

2014-11-20 Thread Sadhan Sood
Also attaching the parquet file if anyone wants to take a further look. On Thu, Nov 20, 2014 at 8:54 AM, Sadhan Sood wrote: > So, I am seeing this issue with spark sql throwing an exception when > trying to read selective columns from a thrift parquet file and also when > caching t

Re: SparkSQL exception on cached parquet table

2014-11-20 Thread Sadhan Sood
R:0 D:0 V: value 2: R:0 D:0 V: value 3: R:0 D:0 V: value 4: R:0 D:0 V: value 5: R:0 D:0 V: value 6: R:0 D:0 V: value 7: R:0 D:0 V: value 8: R:0 D:0 V: value 9: R:0 D:0 V: I am happy to provide more information but any help is appreciated. On Sun, Nov 16, 2014 at 7:40 PM, Sadhan Sood wrote: &g

Re: Exception in spark sql when running a group by query

2014-11-18 Thread Sadhan Sood
ah makes sense - Thanks Michael! On Mon, Nov 17, 2014 at 6:08 PM, Michael Armbrust wrote: > You are perhaps hitting an issue that was fixed by #3248 > <https://github.com/apache/spark/pull/3248>? > > On Mon, Nov 17, 2014 at 9:58 AM, Sadhan Sood > wrote: > >> W

Exception in spark sql when running a group by query

2014-11-17 Thread Sadhan Sood
While testing sparkSQL, we were running this group by with expression query and got an exception. The same query worked fine on hive. SELECT from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200), '/MM/dd') as pst_date, count(*) as num_xyzs FROM all_matched_abc GROUP BY

Re: SparkSQL exception on cached parquet table

2014-11-16 Thread Sadhan Sood
resolve the problem, I'll run it through a debugger and see if I can get more information on it in the meantime. Thanks, Sadhan On Sun, Nov 16, 2014 at 4:35 AM, Cheng Lian wrote: > (Forgot to cc user mail list) > > > On 11/16/14 4:59 PM, Cheng Lian wrote: > > Hey Sadh

Re: SparkSQL exception on cached parquet table

2014-11-15 Thread sadhan
Hi Cheng, Thanks for your response.Here is the stack trace from yarn logs: -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-exception-on-cached-parquet-table-tp18978p19020.html Sent from the Apache Spark User List mailing list archive at Nabble.co

Re: Cache sparkSql data without uncompressing it in memory

2014-11-14 Thread Sadhan Sood
in master and branch-1.2 is 10,000 > rows per batch. > > On 11/14/14 1:27 AM, Sadhan Sood wrote: > > Thanks Chneg, Just one more question - does that mean that we still > need enough memory in the cluster to uncompress the data before it can be > compressed again or does that j

SparkSQL exception on cached parquet table

2014-11-14 Thread Sadhan Sood
While testing SparkSQL on a bunch of parquet files (basically used to be a partition for one of our hive tables), I encountered this error: import org.apache.spark.sql.SchemaRDD import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path;

Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Sadhan Sood
pressed to true. This property is > already set to true by default in master branch and branch-1.2. > > On 11/13/14 7:16 AM, Sadhan Sood wrote: > > We noticed while caching data from our hive tables which contain data > in compressed sequence file format that it gets uncompresse

Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Sadhan Sood
We noticed while caching data from our hive tables which contain data in compressed sequence file format that it gets uncompressed in memory when getting cached. Is there a way to turn this off and cache the compressed data as is ?

Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
output location for shuffle 0 The data is lzo compressed sequence file with compressed size ~ 26G. Is there a way to understand why shuffle keeps failing for one partition. I believe we have enough memory to store the uncompressed data in memory. On Wed, Nov 12, 2014 at 2:50 PM, Sadhan Sood wrote

Re: Building spark targz

2014-11-12 Thread Sadhan Sood
I think you can provide -Pbigtop-dist to build the tar. On Wed, Nov 12, 2014 at 3:21 PM, Sean Owen wrote: > mvn package doesn't make tarballs. It creates artifacts that will > generally appear in target/ and subdirectories, and likewise within > modules. Look at make-distribution.sh > > On Wed,

Re: Building spark targz

2014-11-12 Thread Sadhan Sood
Just making sure but are you looking for the tar in assembly/target dir ? On Wed, Nov 12, 2014 at 3:14 PM, Ashwin Shankar wrote: > Hi, > I just cloned spark from the github and I'm trying to build to generate a > tar ball. > I'm doing : mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive > -Ds

Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
lerBackend (Logging.scala:logError(75)) - Asked to remove non-existent executor 372 2014-11-12 19:11:21,655 INFO scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Executor lost: 372 (epoch 3) On Wed, Nov 12, 2014 at 12:31 PM, Sadhan Sood wrote: > We are running spark on yarn with combined mem

Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
We are running spark on yarn with combined memory > 1TB and when trying to cache a table partition(which is < 100G), seeing a lot of failed collect stages in the UI and this never succeeds. Because of the failed collect, it seems like the mapPartitions keep getting resubmitted. We have more than en

Re: thrift jdbc server probably running queries as hive query

2014-11-11 Thread Sadhan Sood
Hi Cheng, I made sure the only hive server running on the machine is hivethriftserver2. /usr/lib/jvm/default-java/bin/java -cp /usr/lib/hadoop/lib/hadoop-lzo.jar::/mnt/sadhan/spark-3/sbin/../conf:/mnt/sadhan/spark-3/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.0.2.jar:/etc/hadoop/conf -Xms512m

thrift jdbc server probably running queries as hive query

2014-11-10 Thread Sadhan Sood
I was testing out the spark thrift jdbc server by running a simple query in the beeline client. The spark itself is running on a yarn cluster. However, when I run a query in beeline -> I see no running jobs in the spark UI(completely empty) and the yarn UI seem to indicate that the submitted query

Re: Building spark from source - assertion failed: org.eclipse.jetty.server.DispatcherType

2014-11-10 Thread sadhan
I ran into the same issue, reverting this commit seems to work https://github.com/apache/spark/commit/bd86cb1738800a0aa4c88b9afdba2f97ac6cbf25 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-spark-from-source-assertion-failed-org-eclipse-jetty-serve

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Sadhan Sood
; > On Fri, Oct 24, 2014 at 12:06 PM, Sadhan Sood > wrote: > >> Is there a way to cache certain (or most latest) partitions of certain >> tables ? >> >> On Fri, Oct 24, 2014 at 2:35 PM, Michael Armbrust > > wrote: >> >>> It does have support for c

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Sadhan Sood
Is there a way to cache certain (or most latest) partitions of certain tables ? On Fri, Oct 24, 2014 at 2:35 PM, Michael Armbrust wrote: > It does have support for caching using either CACHE TABLE or > CACHE TABLE AS SELECT > > On Fri, Oct 24, 2014 at 1:05 AM, ankits wrote: > >> I want t

Re: Job cancelled because SparkContext was shut down - failures!

2014-10-24 Thread Sadhan Sood
These seem like s3 connection errors for the table data. Wondering, since we don't see that many failures on hive. I also set the spark.task.maxFailures = 15. On Fri, Oct 24, 2014 at 12:15 PM, Sadhan Sood wrote: > Hi, > > Trying to run a query on spark-sql but it keeps failing w

Job cancelled because SparkContext was shut down - failures!

2014-10-24 Thread Sadhan Sood
Hi, Trying to run a query on spark-sql but it keeps failing with this error on the cli ( we are running spark-sql on a yarn cluster): org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$

Re: Sharing spark context across multiple spark sql cli initializations

2014-10-23 Thread Sadhan Sood
Thanks Michael, you saved me a lot of time! On Wed, Oct 22, 2014 at 6:04 PM, Michael Armbrust wrote: > The JDBC server is what you are looking for: > http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server > > On Wed, Oct 22, 2014 at 11:10 AM,

Sharing spark context across multiple spark sql cli initializations

2014-10-22 Thread Sadhan Sood
We want to run multiple instances of spark sql cli on our yarn cluster. Each instance of the cli is to be used by a different user. This looks non-optimal if each user brings up a different cli given how spark works on yarn by running executor processes (and hence consuming resources) on worker nod

Re: spark sql not able to find classes with --jars option

2014-10-21 Thread sadhan
It was mainly because spark was setting the jar classes in a thread local context classloader. The quick fix was to make our serde use the context classloader first. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-not-able-to-find-classes-with-jars

spark ui redirecting to port 8100

2014-10-21 Thread sadhan
Set up the spark port to a different one and the connection seems successful but get a 302 to /proxy on port 8100 ? Nothing is listening on that port as well. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ui-redirecting-to-port-8100-tp16956.html Sent

spark sql not able to find classes with --jars option

2014-10-20 Thread sadhan
when I update the classpath in bin/spark-class by providing the dependency jar, everything works fine but when I try to provide the same jar through --jars option, it throws an error while running sql queries that it cannot find relevant serde class files. I guess this is ok for standalone mode (u

Re: persist table schema in spark-sql

2014-10-14 Thread sadhan
I realized my mistake of not using hiveContext. So that error is gone but now I am getting this error: == FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: Unknown field info: binary -- View this message i

Re: persist table schema in spark-sql

2014-10-14 Thread sadhan
Thanks Michael. We are running 1.1 and I believe that is the latest release? I am getting this error when I tried doing what you suggested: org.apache.spark.sql.parquet.ParquetTypesConverter$ ParquetTypes.scalajava.lang.RuntimeException: [2.1] failure: ``UNCACHE'' expected but identifier CREATE fo

Re: read all parquet files in a directory in spark-sql

2014-10-13 Thread Sadhan Sood
to pass a > comma-delimited > > list of paths. > > > > I've opened SPARK-3928: Support wildcard matches on Parquet files to > request > > this feature. > > > > Nick > > > > On Mon, Oct 13, 2014 at 12:21 PM, Sadhan Sood > wrote: >

read all parquet files in a directory in spark-sql

2014-10-13 Thread Sadhan Sood
How can we read all parquet files in a directory in spark-sql. We are following this example which shows a way to read one file: // Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.// The result of loading a Parquet file is also a SchemaRDD.val

persist table schema in spark-sql

2014-10-13 Thread Sadhan Sood
We want to persist table schema of parquet file so as to use spark-sql cli on that table later on? Is it possible or is spark-sql cli only good for tables in hive metastore ? We are reading parquet data using this example: // Read in the parquet file created above. Parquet files are self-describi

Fwd: how to find the sources for spark-project

2014-10-11 Thread Sadhan Sood
-- Forwarded message -- From: Sadhan Sood Date: Sat, Oct 11, 2014 at 10:26 AM Subject: Re: how to find the sources for spark-project To: Stephen Boesch Thanks, I still didn't find it - is it under some particular branch ? More specifically, I am looking to modify the

how to find the sources for spark-project

2014-10-10 Thread sadhan
We have our own customization on top of parquet serde that we've been using for hive. In order to make it work with spark-sql, we need to be able to re-build spark with this. It'll be much easier to rebuild spark with this patch once I can find the sources for org.spark-project.hive. Not sure where

spark-sql failing for some tables in hive

2014-10-09 Thread sadhan
We have a hive deployement on which we tried running spark-sql. When we try to do describe for some of the tables, spark-sql fails with this: while it works for some of the other tables. Confused and not sure what's happening here. The same describe command works in hive. Whats confusing is the