Issue on Spark SQL insert or create table with Spark running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished

2015-04-01 Thread chutium
Hi, we always get issues on inserting or creating table with Amazon EMR Spark version, by inserting about 1GB resultset, the spark sql query will never be finished. by inserting small resultset (like 500MB), works fine. *spark.sql.shuffle.partitions* by default 200 or *set

RE: SchemaRDD - Parquet - insertInto makes many files

2014-09-08 Thread chutium
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-NumberofTasks it will be great, if something like hive.exec.reducers.bytes.per.reducer could be implemented. one idea is, get total size of all target blocks, then set number of partitions -- View this message in

Re: Storage Handlers in Spark SQL

2014-08-26 Thread chutium
it seems he means to query RDBMS or cassandra using Spark SQL, multi data sources for spark SQL. i looked through the link he posted https://docs.wso2.com/display/BAM241/Creating+Hive+Queries+to+Analyze+Data#CreatingHiveQueriestoAnalyzeData-CreatingHivetablesforvariousdatasources using their

Re: sqlContext.parquetFile(path) fails if path is a file but succeeds if a directory

2014-08-19 Thread chutium
it is definitively a bug, sqlContext.parquetFile should take both dir and single file as parameter. this if-check for isDir make no sense after this commit https://github.com/apache/spark/pull/1370/files#r14967550 i opened a ticket for this issue https://issues.apache.org/jira/browse/SPARK-3138

Re: Multiple column families vs Multiple tables

2014-08-19 Thread chutium
ö_ö you should send this message to hbase user list, not spark user list... but i can give you some personal advice about this, keep column families as few as possible! at least, use some prefix of column qualifier could also be an idea. but read performance may be worse for your use case like

Re: Is hive UDF are supported in HiveContext

2014-08-19 Thread chutium
there is no collect_list in hive 0.12 try this after this ticket is done https://issues.apache.org/jira/browse/SPARK-2706 i am also looking forward to this. -- View this message in context:

Re: How to direct insert vaules into SparkSQL tables?

2014-08-14 Thread chutium
oh, right, i meant within SqlContext alone, schemaRDD from text file with a case class -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-direct-insert-vaules-into-SparkSQL-tables-tp11851p12100.html Sent from the Apache Spark User List mailing list

Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-12 Thread chutium
spark.speculation was not set, any speculative execution on tachyon side? tachyon-env.sh only changed following export TACHYON_MASTER_ADDRESS=test01.zala #export TACHYON_UNDERFS_ADDRESS=$TACHYON_HOME/underfs export TACHYON_UNDERFS_ADDRESS=hdfs://test01.zala:8020 export

Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-12 Thread chutium
more interesting is if spark-shell started on master node (test01) then parquetFile.saveAsParquetFile(tachyon://test01.zala:19998/parquet_tablex) 14/08/12 11:42:06 INFO : initialize(tachyon://... ... ... 14/08/12 11:42:06 INFO : File does not exist:

Re: CDH5, HiveContext, Parquet

2014-08-11 Thread chutium
hive-thriftserver does not work with parquet tables in hive metastore also, this PR will fix it too? do not need to change any pom.xml ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CDH5-HiveContext-Parquet-tp11853p11880.html Sent from the Apache Spark

Re: How to direct insert vaules into SparkSQL tables?

2014-08-11 Thread chutium
no, spark sql can not insert or update textfile yet, can only insert into parquet files but, people.union(new_people).registerAsTable(people) could be an idea. -- View this message in context:

share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-11 Thread chutium
sharing /reusing RDDs is always useful for many use cases, is this possible via persisting RDD on tachyon? such as off heap persist a named RDD into a given path (instead of /tmp_spark_tachyon/spark-xxx-xxx-xxx) or saveAsParquetFile on tachyon i tried to save a SchemaRDD on tachyon, val

Re: How to use spark-cassandra-connector in spark-shell?

2014-08-08 Thread chutium
try to add following jars in classpath

Re: Spark with HBase

2014-08-07 Thread chutium
this two posts should be good for setting up spark+hbase environment and use the results of hbase table scan as RDD settings http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html some samples: http://www.abcn.net/2014/07/spark-hbase-result-keyvalue-bytearray.html -- View

Re: reduceByKey to get all associated values

2014-08-07 Thread chutium
a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk about performance (http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/) that, reduceByKey will be more efficient than groupByKey... he mentioned groupByKey copies all data over network.

Re: Save an RDD to a SQL Database

2014-08-07 Thread chutium
right, Spark is more like to act as an OLAP, i believe no one will use spark as an OLTP, so there is always some question about how to share the data between these two platform efficiently and a more important is that most of enterprise BI tools rely on RDBMS or at least a JDBC/ODBC interface

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-23 Thread chutium
in spark 1.1 maybe not so easy like spark 1.0 after commit: https://issues.apache.org/jira/browse/SPARK-2446 only binary with UTF8 annotation will be recognized as string after this commit, but in impala strings are always without UTF8 anno -- View this message in context:

why there is only getString(index) but no getString(columnName) in catalyst.expressions.Row.scala ?

2014-07-23 Thread chutium
i do not want to use always schemaRDD.map { case Row(xxx) = ... } using case we must rewrite the table schema again is there any plan to implement this? Thanks -- View this message in context:

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-21 Thread chutium
Hi, unfortunately it is not so straightforward xxx_parquet.db is a folder of managed database created by hive/impala, so, every sub element in it is a table in hive/impala, they are folders in HDFS, and each table has different schema, and in its folder there are one or more parquet files.

Re: gain access to persisted rdd

2014-07-21 Thread chutium
but at least if user want to access the persisted RDDs, they can use sc.getPersistentRDDs in the same context. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313p10337.html Sent from the Apache Spark User List mailing list

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-21 Thread chutium
no, something like this 14/07/20 00:19:29 ERROR cluster.YarnClientClusterScheduler: Lost executor 2 on 02.xxx: remote Akka client disassociated ... ... 14/07/20 00:21:13 WARN scheduler.TaskSetManager: Lost TID 832 (task 1.2:186) 14/07/20 00:21:13 WARN scheduler.TaskSetManager: Loss was

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-20 Thread chutium
like this: val sc = new SparkContext(new SparkConf().setAppName(SLA Filter)) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ val suffix = args(0) sqlContext.parquetFile(/user/hive/warehouse/xxx_parquet.db/xx001_ +

Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-19 Thread chutium
160G parquet files (ca. 30 files, snappy compressed, made by cloudera impala) ca. 30 full table scan, took 3-5 columns out, then some normal scala operations like substring, groupby, filter, at the end, save as file in HDFS yarn-client mode, 23 core and 60G mem / node but, always failed !

Re: Shark CDH5 Final Release

2014-04-10 Thread chutium
hi, you can take a look here: http://www.abcn.net/2014/04/install-shark-on-cdh5-hadoop2-spark.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shark-CDH5-Final-Release-tp3826p4055.html Sent from the Apache Spark User List mailing list archive at