Hi Joe,
You need to make sure which RDD is used most frequently. In your case, rdd2
rdd3 are filtered result of rdd1, so usually they are relatively smaller
than rdd1, and it would be more reasonable to cache rdd2 and/or rdd3
if rdd1is not referenced elsewhere.
Say rdd1 takes 10G, rdd2 takes 1G
Your Spark solution first reduces partial results into a single partition,
computes the final result, and then collects to the driver side. This
involves a shuffle and two waves of network traffic. Instead, you can
directly collect partial results to the driver and then computes the final
results
Probably this JIRA
issuehttps://spark-project.atlassian.net/browse/SPARK-1006solves
your problem. When running with large iteration number, the lineage
DAG of ALS becomes very deep, both DAGScheduler and Java serializer may
overflow because they are implemented in a recursive way. You may resort
A tip: using println is only convenient when you are working with local
mode. When running Spark in clustering mode (standalone/YARN/Mesos), output
of println goes to executor stdout.
On Fri, Apr 18, 2014 at 6:53 AM, 诺铁 noty...@gmail.com wrote:
yeah, I got it.!
using println to debug is great
is better way to
debug?
On Fri, Apr 18, 2014 at 9:27 AM, Cheng Lian lian.cs@gmail.com wrote:
A tip: using println is only convenient when you are working with local
mode. When running Spark in clustering mode (standalone/YARN/Mesos), output
of println goes to executor stdout.
On Fri
Without caching, an RDD will be evaluated multiple times if referenced
multiple times by other RDDs. A silly example:
val text = sc.textFile(input.log)val r1 = text.filter(_ startsWith
ERROR)val r2 = text.map(_ split )val r3 = (r1 ++ r2).collect()
Here the input file will be scanned twice
:
Shouldnt the dag optimizer optimize these routines. Sorry if its a dumb
question :)
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Apr 23, 2014 at 12:29 PM, Cheng Lian lian.cs@gmail.comwrote:
Without
elements are printed only once.
On Wed, Apr 23, 2014 at 4:35 PM, Cheng Lian lian.cs@gmail.com wrote:
Good question :)
Although RDD DAG is lazy evaluated, it’s not exactly the same as Scala
lazy val. For Scala lazy val, evaluated value is automatically cached,
while evaluated RDD elements
You may try this:
val lastOption = sc.textFile(input).mapPartitions { iterator =
if (iterator.isEmpty) {
iterator
} else {
Iterator
.continually((iterator.next(), iterator.hasNext()))
.collect { case (value, false) = value }
.take(1)
}
}.collect().lastOption
Have you tried Broadcast.unpersist()?
On Mon, May 5, 2014 at 6:34 PM, Earthson earthson...@gmail.com wrote:
RDD.checkpoint works fine. But spark.cleaner.ttl is really ugly for
broadcast
cleaning. May be it could be removed automatically when no dependences.
--
View this message in
Hi Ajay, would you mind to synthesise a minimum code snippet that can
reproduce this issue and paste it here?
On Wed, Jun 4, 2014 at 8:32 PM, Ajay Srivastava a_k_srivast...@yahoo.com
wrote:
Hi,
I am doing join of two RDDs which giving different results ( counting
number of records ) each
What's the format of the file header? Is it possible to filter them out by
prefix string matching or regex?
On Wed, Jul 30, 2014 at 1:39 PM, Fengyun RAO raofeng...@gmail.com wrote:
It will certainly cause bad performance, since it reads the whole content
of a large file into one value,
Hey Zhun,
Thanks for the detailed problem description. Please see my comments inlined
below.
On Thu, Aug 7, 2014 at 6:18 PM, Zhun Shen shenzhunal...@gmail.com wrote:
Caused by: java.lang.IllegalAccessError: tried to access method
You may use groupByKey in this case.
On Aug 7, 2014, at 9:18 PM, Konstantin Kudryavtsev
kudryavtsev.konstan...@gmail.com wrote:
Hi there,
I'm interested if it is possible to get the same behavior as for reduce
function from MR framework. I mean for each key K get list of associated
The point is that in many cases the operation passed to reduceByKey aggregates
data into much smaller size, say + and * for integer. String concatenation
doesn’t actually “shrink” data, thus in your case, rdd.reduceByKey(_ ++ _) and
rdd.groupByKey suffer similar performance issue. In general,
Maybe a little off topic, but would you mind to share your motivation of saving
the RDD into an SQL DB?
If you’re just trying to do further transformations/queries with SQL for
convenience, then you may just use Spark SQL directly within your Spark
application without saving them into DB:
Things have changed a bit in the master branch, and the SQL programming
guide in master branch actually doesn’t apply to branch-1.0-jdbc.
In branch-1.0-jdbc, Hive Thrift server and Spark SQL CLI are included in
the hive profile and are thus not enabled by default. You need to either
- pass
Currently the SQL dialect provided by Spark SQL only support a set of most
frequently used structures and doesn't support DDL and DML operations. In
the long run, we'd like to replace it with a full featured SQL-92
implementation.
On Sat, Aug 9, 2014 at 8:11 AM, Sathish Kumaran Vairavelu
Hi Jenny, does this issue only happen when running Spark SQL with YARN in
your environment?
On Sat, Aug 9, 2014 at 3:56 AM, Jenny Zhao linlin200...@gmail.com wrote:
Hi,
I am able to run my hql query on yarn cluster mode when connecting to the
default hive metastore defined in
Since you were using hql(...), it’s probably not related to JDBC driver.
But I failed to reproduce this issue locally with a single node pseudo
distributed YARN cluster. Would you mind to elaborate more about steps to
reproduce this bug? Thanks
On Sun, Aug 10, 2014 at 9:36 PM, Cheng Lian
Hi John, the JDBC Thrift server resides in its own build profile and need
to be enabled explicitly by ./sbt/sbt -Phive-thriftserver assembly.
On Tue, Aug 5, 2014 at 4:54 AM, John Omernik j...@omernik.com wrote:
I am using spark-1.1.0-SNAPSHOT right now and trying to get familiar with
the
)
at
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162)
... 32 more
Any suggestions?
On Tue, Aug 12, 2014 at 12:47 AM, Cheng Lian lian.cs@gmail.com
wrote:
Hi John, the JDBC Thrift server resides in its own build profile and
need to be enabled
Maybe irrelevant, but this resembles a lot the S3 Parquet file issue we've
met before. It takes a dozen minutes to read the metadata because the
ParquetInputFormat tries to call geFileStatus for all part-files
sequentially.
Just checked SequenceFileInputFormat, and found that a MapFile may share
Hi John,
I tried to follow your description but failed to reproduce this issue.
Would you mind to provide some more details? Especially:
-
Exact Git commit hash of the snapshot version you were using
Mine: e0f946265b9ea5bc48849cf7794c2c03d5e29fba
Hey Matt, if you want to access existing Hive data, you still need a to run
a Hive metastore service, and provide a proper hive-site.xml (just drop it
in $SPARK_HOME/conf).
Could you provide the error log you saw?
On Wed, Aug 27, 2014 at 12:09 PM, Michael Armbrust mich...@databricks.com
You can use the Thrift server to access Hive tables that locates in legacy
Hive warehouse and/or those generated by Spark SQL. Simba provides Spark
SQL ODBC driver that enables applications like Tableau. But right now I'm
not 100% sure about whether the driver has officially released yet.
On
You can always use sqlContext.uncacheTable to uncache the old table.
On Fri, Sep 12, 2014 at 10:33 AM, pankaj.arora pankajarora.n...@gmail.com
wrote:
Hi Patrick,
What if all the data has to be keep in cache all time. If applying union
result in new RDD then caching this would result into
Ah, I see. So basically what you need is something like cache write through
support which exists in Shark but not implemented in Spark SQL yet. In
Shark, when inserting data into a table that has already been cached, the
newly inserted data will be automatically cached and “union”-ed with the
Hi Helene,
Thanks for the report. In Spark 1.1, we use a special exit code to
indicate |SparkSubmit| fails because of class not found. But
unfortunately I chose a not so special exit code — 1… So whenever the
process exit with 1 as exit code, the |-Phive| error message is shown. A
PR that
H Twinkle,
The failure is caused by case sensitivity. The temp table actually
stores the original un-analyzed logical plan, thus field names remain
capital (F1, F2, etc.). I believe this issue has already been fixed by
PR #2382 https://github.com/apache/spark/pull/2382. As a workaround,
you
You can avoid install Spark on each node by uploading Spark distribution
tarball file to HDFS setting |spark.executor.uri| to the HDFS location.
In this way, Mesos will download and the tarball file before launching
containers. Please refer to this Spark documentation page
This is reasonable, since the actual constructor gets called is
|Driver()| rather than |Driver(HiveConf)|. The former initializes the
|conf| field by:
|conf = SessionState.get().getConf()
|
And |SessionState.get()| reads a TSS value. Thus executing SQL queries
within another thread causes
If the size of each file is small, you may try
|SparkContext.wholeTextFiles|. Otherwise you can try something like this:
|val filenames: Seq[String] = ...
val combined: RDD[(String,String)] = filenames.map { name =
sc.textFile(name).map(line = name - line)
}.reduce(_ ++ _)
|
On 9/26/14
Yes it is. The in-memory storage used with |SchemaRDD| also uses
|RDD.cache()| under the hood.
On 9/26/14 4:04 PM, Haopu Wang wrote:
Hi, I'm querying a big table using Spark SQL. I see very long GC time in
some stages. I wonder if I can improve it by tuning the storage
parameter.
The
Would you mind to provide the DDL of this partitioned table together
with the query you tried? The stacktrace suggests that the query was
trying to cast a map into something else, which is not supported in
Spark SQL. And I doubt whether Hive support casting a complex type to
some other type.
This fix is reasonable, since the actual constructor gets called is
|Driver()| rather than |Driver(HiveConf)|. The former initializes the
|conf| field by:
|conf = SessionState.get().getConf()
|
And |SessionState.get()| reads a TSS value. Thus executing SQL queries
within another thread
Would you mind to provide the DDL of this partitioned table together
with the query you tried? The stacktrace suggests that the query was
trying to cast a map into something else, which is not supported in
Spark SQL. And I doubt whether Hive support casting a complex type to
some other type.
[ works ]
*queryResult1withSchema = hiveContext.applySchema( Queryresult1,
Queryresult1.schema )*
registerTempTable(*queryResult1withSchema*)
Queryresult2 = Query2 using *queryResult1withSchema* [ *works* ]
On Fri, Sep 26, 2014 at 5:13 PM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs
In your case, the table has only one row, whose contents is “data”,
which is an array. You need something like |SELECT data[0].name FROM
json_table| to access the |name| field.
On 9/29/14 11:08 PM, vdiwakar.malladi wrote:
Hello,
I'm exploring SparkSQL and I'm facing issue while using the
Cache table works with partitioned table.
I guess you’re experimenting with a default local metastore and the
metastore_db directory doesn’t exist at the first place. In this case,
all metastore tables/views don’t exist at first and will throw the error
message you saw when the |PARTITIONS|
Also make sure to call |hiveContext.sql| within the same thread where
|hiveContext| is created, because Hive uses thread-local variable to
initialize the |Driver.conf|.
On 10/3/14 4:52 PM, Michael Armbrust wrote:
Are you running master? There was briefly a regression here that is
hopefully
The build command should be correct. What exact error did you encounter
when trying Spark 1.1 + Hive 0.12 + Hadoop 2.5.0?
On 10/7/14 2:21 PM, Li HM wrote:
Thanks for the replied.
Please refer to my another post entitled How to make ./bin/spark-sql
work with hive. It has all the
Hi Mohammed,
Would you mind to share the DDL of the table |x| and the complete
stacktrace of the exception you got? A full Spark shell session history
would be more than helpful. PR #2084 had been merged in master in Aug,
and timestamp type is supported in 1.1.
I tried the following
Hi Poiuytrez, what version of Spark are you using? Exception details
like stacktrace are really needed to investigate this issue. You can
find them in the executor logs, or just browse the application
stderr/stdout link from Spark Web UI.
On 10/9/14 9:37 PM, poiuytrez wrote:
Hello,
I have a
Which version are you using? Also |.saveAsTable()| saves the table to
Hive metastore, so you need to make sure your Spark application points
to the same Hive metastore instance as the JDBC Thrift server. For
example, put |hive-site.xml| under |$SPARK_HOME/conf|, and run
|spark-shell| and
Hmm, there is a “T” in the timestamp string, which makes the string not
a valid timestamp string representation. Internally Spark SQL uses
|java.sql.Timestamp.valueOf| to cast a string to a timestamp.
On 10/11/14 2:08 AM, Mohammed Guller wrote:
scala rdd.registerTempTable(x)
scala val sRdd
How was the table created? Would you mind to share related code? It
seems that the underlying type of the |customer_id| field is actually
long, but the schema says it’s integer, basically it’s a type mismatch
error.
The first query succeeds because |SchemaRDD.count()| is translated to
Hmm, the details of the error didn't show in your mail...
On 10/10/14 12:25 AM, sadhan wrote:
We have a hive deployement on which we tried running spark-sql. When we try
to do describe table_name for some of the tables, spark-sql fails with
this:
while it works for some of the other tables.
Currently Spark SQL doesn’t support reading SQL specific configurations
via system properties. But for |HiveContext|, you can put them in
|hive-site.xml|.
On 10/13/14 4:28 PM, Kevin Paul wrote:
Hi all, I tried to set the configuration
spark.sql.inMemoryColumnarStorage.compressed, and
Denny Lee wrote an awesome article on how to connect to Tableau to Spark
SQL recently:
https://www.concur.com/blog/en-us/connect-tableau-to-sparksql
On 10/14/14 6:10 PM, Neeraj Garg02 wrote:
Hi Everybody,
I’m looking for information on possible Thrift JDBC/ODBC clients and
Thrift JDBC/ODBC
On 10/14/14 7:31 PM, Neeraj Garg02 wrote:
Hi All,
I’ve downloaded and installed Apache Spark 1.1.0 pre-built for Hadoop
2.4.
Now, I want to test two features of Spark:
|1.|*YARN deployment* : As per my understanding, I need to modify
“spark-defaults.conf” file with the settings mentioned
The warehouse location need to be specified before the |HiveContext|
initialization, you can set it via:
|./bin/spark-sql --hiveconf
hive.metastore.warehouse.dir=/home/spark/hive/warehouse
|
On 10/15/14 8:55 PM, Hao Ren wrote:
Hi,
The following query in sparkSQL 1.1.0 CLI doesn't work.
On 10/16/14 12:44 PM, neeraj wrote:
I would like to reiterate that I don't have Hive installed on the Hadoop
cluster.
I have some queries on following comment from Cheng Lian-2:
The Thrift server is used to interact with existing Hive data, and thus
needs Hive Metastore to access Hive catalog
Why do you need to convert a JavaSchemaRDD to SchemaRDD? Are you trying
to use some API that doesn't exist in JavaSchemaRDD?
On 10/15/14 5:50 PM, Earthson wrote:
I don't know why the JavaSchemaRDD.baseSchemaRDD is private[sql]. And I found
that DataTypeConversions is protected[sql].
Finally I
On 10/16/14 10:48 PM, neeraj wrote:
1. I'm trying to use Spark SQL as data source.. is it possible?
Unfortunately Spark SQL ODBC/JDBC support are based on the Thrift
server, so at least you need HDFS and a working Hive Metastore instance
(used to persist catalogs) to make things work.
2.
Hi Neeraj,
The Spark Summit 2014 tutorial uses Spark 1.0. I guess you're using
Spark 1.1? Parquet support got polished quite a bit since then, and
changed the string representation of the query plan, but this output
should be OK :)
Cheng
On 10/16/14 10:45 PM, neeraj wrote:
Hi,
I'm
Hi Michael,
I'm not sure I fully understood your question, but I think RDD.aggregate
can be helpful in your case. You can see it as a more general version of
fold.
Cheng
On 10/16/14 11:15 PM, Michael Misiewicz wrote:
Hi,
I'm working on a problem where I'd like to sum items in an RDD /in
I guess you're referring to the simple SQL dialect recognized by the
SqlParser component.
Spark SQL supports most DDL and DML of Hive. But the simple SQL dialect
is still very limited. Usually it's used together with some Spark
application written in Java/Scala/Python. Within a Spark
You can first union them into a single RDD and then call |foreach|. In
Scala:
|rddList.reduce(_.union(_)).foreach(myFunc)
|
For the serialization issue, I don’t have any clue unless more code can
be shared.
On 10/16/14 11:39 PM, /soumya/ wrote:
Hi, my programming model requires me to
, 2014 at 11:46 AM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com wrote:
Hi Michael,
I'm not sure I fully understood your question, but I think
RDD.aggregate can be helpful in your case. You can see it as a
more general version of fold.
Cheng
On 10/16/14
Currently Spark SQL uses Hive 0.12.0, which doesn't support the INSERT
INTO ... VALUES ... syntax.
On 10/18/14 1:33 AM, arthur.hk.c...@gmail.com wrote:
Hi,
When trying to insert records into HIVE, I got error,
My Spark is 1.1.0 and Hive 0.12.0
Any idea what would be wrong?
Regards
Arthur
Hi Jenny, how did you configure the classpath and start the Thrift
server (YARN client/YARN cluster/standalone/...)?
On 10/18/14 4:14 AM, Jenny Zhao wrote:
Hi,
if Spark thrift JDBC server is started with non-secure mode, it is
working fine. with a secured mode in case of pluggable
In your JSON snippet, 111 and 222 are quoted, namely they are strings.
Thus they are automatically inferred as string rather than tinyint by
|jsonRDD|. Try this in Spark shell:
|val sparkContext = sc
import org.apache.spark.sql._
import sparkContext._
val sqlContext = new
Instead of using Spark SQL, you can use JdbcRDD to extract data from SQL
server. Currently Spark SQL can't run queries against SQL server. The
foreign data source API planned in Spark 1.2 can make this possible.
On 10/21/14 6:26 PM, Ashic Mahtab wrote:
Hi,
Is there a simple way to run spark
You can resort to |SQLContext.jsonFile(path: String, samplingRate:
Double)| and set |samplingRate| to 1.0, so that all the columns can be
inferred.
You can also use |SQLContext.applySchema| to specify your own schema
(which is a |StructType|).
On 10/22/14 5:56 AM, Harivardan Jayaraman
Would you mind to share DDLs of all involved tables? What format are
these tables stored in? Is this issue specific to this query? I guess
Hive, Shark and Spark SQL all read from the same HDFS dataset?
On 10/27/14 3:45 PM, lyf刘钰帆 wrote:
Hi,
I am using SparkSQL 1.1.0 with cdh 4.6.0
LOCAL INPATH '/home/data/testFolder/qrytblB.txt' INTO TABLE
tblB;
*发件人:*Cheng Lian [mailto:lian.cs@gmail.com]
*发 送时间:*2014年10月27日16:48
*收件人:*lyf刘钰帆; user@spark.apache.org
*主题:*Re: SparkSQL display wrong result
Would you mind to share DDLs of all involved tables? What format
I have never tried this yet, but maybe you can use an in-memory Derby
database as metastore
https://db.apache.org/derby/docs/10.7/devguide/cdevdvlpinmemdb.html
I'll investigate this when free, guess we can use this for Spark SQL
Hive support testing.
On 10/27/14 4:38 PM, Jianshi Huang
https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-EmbeddedMetastore
Cheers
On Oct 27, 2014, at 6:20 AM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com wrote:
I have never tried this yet, but maybe you can use an in-memory Derby
Which version of Spark and Hadoop are you using? Could you please provide
the full stack trace of the exception?
On Tue, Oct 28, 2014 at 5:48 AM, Du Li l...@yahoo-inc.com.invalid wrote:
Hi,
I was trying to set up Spark SQL on a private cluster. I configured a
hive-site.xml under
Spark 1.1.0 doesn't support Hive 0.13.1. We plan to support it in 1.2.0,
and related PRs are already merged or being merged to the master branch.
On 10/29/14 7:43 PM, arthur.hk.c...@gmail.com wrote:
Hi,
My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13? Please advise.
Or, any news
Hi Jean,
Thanks for reporting this. This is indeed a bug: some column types (Binary,
Array, Map and Struct, and unfortunately for some reason, Boolean), a
NoopColumnStats is used to collect column statistics, which causes this
issue. Filed SPARK-4182 to track this issue, will fix this ASAP.
Just submitted a PR to fix this https://github.com/apache/spark/pull/3059
On Sun, Nov 2, 2014 at 12:36 AM, Jean-Pascal Billaud j...@tellapart.com
wrote:
Great! Thanks.
Sent from my iPad
On Nov 1, 2014, at 8:35 AM, Cheng Lian lian.cs@gmail.com wrote:
Hi Jean,
Thanks for reporting
You may use |RDD.zipWithIndex|.
On 11/10/14 10:03 PM, Lijun Wang wrote:
Hi,
I need a matrix with each row having a index, e.g., index = 0 for first
row, index = 1 for second row. Could someone tell me how to generate such
IndexedRowMatrix from an RowMatrix?
Besides, is there anyone
On 11/6/14 1:39 AM, Hao Ren wrote:
Hi,
I would like to understand the pipeline of spark's operation(transformation
and action) and some details on block storage.
Let's consider the following code:
val rdd1 = SparkContext.textFile(hdfs://...)
rdd1.map(func1).map(func2).count
For example, we
Hey Sadhan,
I really don't think this is Spark log... Unlike Shark, Spark SQL
doesn't even provide a Hive mode to let you execute queries against
Hive. Would you please check whether there is an existing HiveServer2
running there? Spark SQL HiveThriftServer2 is just a Spark port of
Currently there’s no way to cache the compressed sequence file directly.
Spark SQL uses in-memory columnar format while caching table rows, so we
must read all the raw data and convert them into columnar format.
However, you can enable in-memory columnar compression by setting
HTTP is not supported yet, and I don't think there's an JIRA ticket for it.
On 11/14/14 8:21 AM, vs wrote:
Does Spark JDBC thrift server allow connections over HTTP?
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#running-the-thrift-jdbc-server
doesn't see to indicate this
one more question - does that mean that we still
need enough memory in the cluster to uncompress the data before it can
be compressed again or does that just read the raw data as is?
On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com wrote
If you’re looking for executor side setup and cleanup functions, there
ain’t any yet, but you can achieve the same semantics via
|RDD.mapPartitions|.
Please check the “setup() and cleanup” section of this blog from
Cloudera for details:
can I write it like this?
rdd.mapPartition(i = setup(); i).map(...).mapPartition(i = cleanup(); i)
So I don't need to mess up the logic and still can use map, filter and
other transformations for RDD.
Jianshi
On Fri, Nov 14, 2014 at 12:20 PM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs
Which version are you using? You probably hit this bug
https://issues.apache.org/jira/browse/SPARK-3421 if some field name in
the JSON contains characters other than [a-zA-Z0-9_].
This has been fixed in https://github.com/apache/spark/pull/2563
On 11/14/14 6:35 PM, vdiwakar.malladi wrote:
Hm, I'm not sure whether this is the official way to upgrade CDH Spark,
maybe you can checkout https://github.com/cloudera/spark, apply required
patches, and then compile your own version.
On 11/14/14 8:46 PM, vdiwakar.malladi wrote:
Thanks for your response. I'm using Spark 1.1.0
Currently
13, 2014 at 10:50 PM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com wrote:
No, the columnar buffer is built in a small batching manner, the
batch size is controlled by the
|spark.sql.inMemoryColumnarStorage.batchSize| property. The
default value for this in master
Hi Sadhan,
Could you please provide the stack trace of the
|ArrayIndexOutOfBoundsException| (if any)? The reason why the first
query succeeds is that Spark SQL doesn’t bother reading all data from
the table to give |COUNT(*)|. In the second case, however, the whole
table is asked to be
(Forgot to cc user mail list)
On 11/16/14 4:59 PM, Cheng Lian wrote:
Hey Sadhan,
Thanks for the additional information, this is helpful. Seems that
some Parquet internal contract was broken, but I'm not sure whether
it's caused by Spark SQL or Parquet, or even maybe the Parquet file
itself
|SQLContext.jsonFile| assumes one JSON record per line. Although I
haven’t tried yet, it seems that this |JsonInputFormat| [1] can be
helpful. You may read your original data set with
|SparkContext.hadoopFile| and |JsonInputFormat|, then transform the
resulted |RDD[String]| into a |JsonRDD|
Hey Hao,
Which commit are you using? Just tried 64c6b9b with exactly the same
command line flags, couldn't reproduce this issue.
Cheng
On 11/17/14 10:02 PM, Hao Ren wrote:
Hi,
I am building spark on the most recent master branch.
I checked this page:
Ah... Thanks Ted! And Hao, sorry for being the original trouble maker :)
On 11/18/14 1:50 AM, Ted Yu wrote:
Looks like this was where you got that commandline:
http://search-hadoop.com/m/JW1q5RlPrl
Cheers
On Mon, Nov 17, 2014 at 9:44 AM, Hao Ren inv...@gmail.com
mailto:inv...@gmail.com
A not so efficient way can be this:
|val r0: RDD[OriginalRow] = ...
val r1 = r0.keyBy(row = extractKeyFromOriginalRow(row))
val r2 = r1.keys.distinct().zipWithIndex()
val r3 = r2.join(r1).values
|
On 11/18/14 8:54 PM, shahab wrote:
Hi,
In my spark application, I am loading some
When a field of an object is enclosed in a closure, the object itself is
also enclosed automatically, thus the object need to be serializable.
On 11/19/14 6:39 PM, Hao Ren wrote:
Hi,
When reading through ALS code, I find that:
class ALS private (
private var numUserBlocks: Int,
Hi Judy, could you please provide the commit SHA1 of the version you're
using? Thanks!
On 11/22/14 11:05 AM, Judy Nash wrote:
Hi,
Thrift server is failing to start for me on latest spark 1.2 branch.
I got the error below when I start thrift server.
Exception in thread main
You're probably hitting this issue
https://issues.apache.org/jira/browse/SPARK-4532
Patrick made a fix for this https://github.com/apache/spark/pull/3398
On 11/22/14 10:39 AM, tridib wrote:
After taking today's build from master branch I started getting this error
when run spark-sql:
Class
You may try |EXPLIAN EXTENDED sql| to see the logical plan, analyzed
logical plan, optimized logical plan and physical plan. Also
|SchemaRDD.toDebugString| shows storage related debugging information.
On 11/21/14 4:11 AM, Gordon Benjamin wrote:
hey,
Can anyone tell me how to debug a sql
This thread might be helpful
http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282.html
On 11/20/14 4:11 AM, Mohammed Guller wrote:
Hi – I was curious if anyone is using the Spark SQL Thrift JDBC server
with Cassandra. It would be great be if you could share
SparkContext unsuccessfully.
Let me know if you need anything else.
*From:*Cheng Lian [mailto:lian.cs@gmail.com]
*Sent:* Friday, November 21, 2014 8:02 PM
*To:* Judy Nash; u...@spark.incubator.apache.org
*Subject:* Re: latest Spark 1.2 thrift server fail with
NoClassDefFoundError on Guava
Hi
For the “never register a table” part, actually you /can/ use Spark SQL
without registering a table via its DSL. Say you’re going to extract an
|Int| field named |key| from the table and double it:
|import org.apache.sql.catalyst.dsl._
val data = sqc.parquetFile(path)
val double =
Spark SQL supports complex types, but casting doesn't work for complex
types right now.
On 11/25/14 4:04 PM, critikaled wrote:
https://github.com/apache/spark/blob/84d79ee9ec47465269f7b0a7971176da93c96f3f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Doesn't
Which version are you using? Or if you are using the most recent master
or branch-1.2, which commit are you using?
On 11/25/14 4:08 PM, david wrote:
Hi,
I have 2 files which come from csv import of 2 Oracle tables.
F1 has 46730613 rows
F2 has 3386740 rows
I build 2 tables with
:37 GMT+09:00 Cheng Lian lian.cs@gmail.com:
Spark SQL supports complex types, but casting doesn't work for complex types
right now.
On 11/25/14 4:04 PM, critikaled wrote:
https://github.com/apache/spark/blob/84d79ee9ec47465269f7b0a7971176da93c96f3f/sql/catalyst/src/main/scala/org/apache
1 - 100 of 364 matches
Mail list logo