Hi,
we always get issues on inserting or creating table with Amazon EMR Spark
version, by inserting about 1GB resultset, the spark sql query will never be
finished.
by inserting small resultset (like 500MB), works fine.
*spark.sql.shuffle.partitions* by default 200
or *set
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-NumberofTasks
it will be great, if something like hive.exec.reducers.bytes.per.reducer
could be implemented.
one idea is, get total size of all target blocks, then set number of
partitions
--
View this message in
it seems he means to query RDBMS or cassandra using Spark SQL, multi data
sources for spark SQL.
i looked through the link he posted
https://docs.wso2.com/display/BAM241/Creating+Hive+Queries+to+Analyze+Data#CreatingHiveQueriestoAnalyzeData-CreatingHivetablesforvariousdatasources
using their
it is definitively a bug, sqlContext.parquetFile should take both dir and
single file as parameter.
this if-check for isDir make no sense after this commit
https://github.com/apache/spark/pull/1370/files#r14967550
i opened a ticket for this issue
https://issues.apache.org/jira/browse/SPARK-3138
ö_ö you should send this message to hbase user list, not spark user list...
but i can give you some personal advice about this, keep column families as
few as possible!
at least, use some prefix of column qualifier could also be an idea. but
read performance may be worse for your use case like
there is no collect_list in hive 0.12
try this after this ticket is done
https://issues.apache.org/jira/browse/SPARK-2706
i am also looking forward to this.
--
View this message in context:
oh, right, i meant within SqlContext alone, schemaRDD from text file with a
case class
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-direct-insert-vaules-into-SparkSQL-tables-tp11851p12100.html
Sent from the Apache Spark User List mailing list
spark.speculation was not set, any speculative execution on tachyon side?
tachyon-env.sh only changed following
export TACHYON_MASTER_ADDRESS=test01.zala
#export TACHYON_UNDERFS_ADDRESS=$TACHYON_HOME/underfs
export TACHYON_UNDERFS_ADDRESS=hdfs://test01.zala:8020
export
more interesting is if spark-shell started on master node (test01)
then
parquetFile.saveAsParquetFile(tachyon://test01.zala:19998/parquet_tablex)
14/08/12 11:42:06 INFO : initialize(tachyon://...
...
...
14/08/12 11:42:06 INFO : File does not exist:
hive-thriftserver does not work with parquet tables in hive metastore also,
this PR will fix it too?
do not need to change any pom.xml ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/CDH5-HiveContext-Parquet-tp11853p11880.html
Sent from the Apache Spark
no, spark sql can not insert or update textfile yet, can only insert into
parquet files
but,
people.union(new_people).registerAsTable(people)
could be an idea.
--
View this message in context:
sharing /reusing RDDs is always useful for many use cases, is this possible
via persisting RDD on tachyon?
such as off heap persist a named RDD into a given path (instead of
/tmp_spark_tachyon/spark-xxx-xxx-xxx)
or
saveAsParquetFile on tachyon
i tried to save a SchemaRDD on tachyon,
val
try to add following jars in classpath
this two posts should be good for setting up spark+hbase environment and use
the results of hbase table scan as RDD
settings
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html
some samples:
http://www.abcn.net/2014/07/spark-hbase-result-keyvalue-bytearray.html
--
View
a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk about
performance
(http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/)
that, reduceByKey will be more efficient than groupByKey... he mentioned
groupByKey copies all data over network.
right, Spark is more like to act as an OLAP, i believe no one will use spark
as an OLTP, so there is always some question about how to share the data
between these two platform efficiently
and a more important is that most of enterprise BI tools rely on RDBMS or at
least a JDBC/ODBC interface
in spark 1.1 maybe not so easy like spark 1.0 after commit:
https://issues.apache.org/jira/browse/SPARK-2446
only binary with UTF8 annotation will be recognized as string after this
commit, but in impala strings are always without UTF8 anno
--
View this message in context:
i do not want to use always
schemaRDD.map {
case Row(xxx) =
...
}
using case we must rewrite the table schema again
is there any plan to implement this?
Thanks
--
View this message in context:
Hi,
unfortunately it is not so straightforward
xxx_parquet.db
is a folder of managed database created by hive/impala, so, every sub
element in it is a table in hive/impala, they are folders in HDFS, and each
table has different schema, and in its folder there are one or more parquet
files.
but at least if user want to access the persisted RDDs, they can use
sc.getPersistentRDDs in the same context.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313p10337.html
Sent from the Apache Spark User List mailing list
no, something like this
14/07/20 00:19:29 ERROR cluster.YarnClientClusterScheduler: Lost executor 2
on 02.xxx: remote Akka client disassociated
...
...
14/07/20 00:21:13 WARN scheduler.TaskSetManager: Lost TID 832 (task 1.2:186)
14/07/20 00:21:13 WARN scheduler.TaskSetManager: Loss was
like this:
val sc = new SparkContext(new SparkConf().setAppName(SLA Filter))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val suffix = args(0)
sqlContext.parquetFile(/user/hive/warehouse/xxx_parquet.db/xx001_
+
160G parquet files (ca. 30 files, snappy compressed, made by cloudera impala)
ca. 30 full table scan, took 3-5 columns out, then some normal scala
operations like substring, groupby, filter, at the end, save as file in HDFS
yarn-client mode, 23 core and 60G mem / node
but, always failed !
hi, you can take a look here:
http://www.abcn.net/2014/04/install-shark-on-cdh5-hadoop2-spark.html
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Shark-CDH5-Final-Release-tp3826p4055.html
Sent from the Apache Spark User List mailing list archive at
24 matches
Mail list logo