You can create a partitioned hive table using Spark SQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
On Mon, Jan 26, 2015 at 5:40 AM, Danny Yates da...@codeaholics.org wrote:
Hi,
I've got a bunch of data stored in S3 under directories like this:
It seems likely that there is some sort of bug related to the reuse of
array objects that are returned by UDFs. Can you open a JIRA?
I'll also note that the sql method on HiveContext does run HiveQL
(configured by spark.sql.dialect) and the hql method has been deprecated
since 1.1 (and will
I'm not actually using Hive at the moment - in fact, I'm trying to avoid
it if I can. I'm just wondering whether Spark has anything similar I can
leverage?
Let me clarify, you do not need to have Hive installed, and what I'm
suggesting is completely self-contained in Spark SQL. We support
I'm aiming for 1.3.
On Mon, Jan 26, 2015 at 3:05 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Thanks Michael. I am sure there have been many requests for this support.
Any release targeted for this?
Thanks,
On Sat, Jan 24, 2015 at 11:47 AM, Michael Armbrust mich...@databricks.com
. So the HQL dialect provided by HiveContext, does it use
catalyst optimizer? I though HiveContext is only related to Hive
integration in Spark!
Would be grateful if you could clarify this
cheers
On Sun, Jan 25, 2015 at 1:23 AM, Michael Armbrust mich...@databricks.com
wrote:
I generally
I generally recommend people use the HQL dialect provided by the
HiveContext when possible:
http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started
I'll also note that this is distinct from the Hive on Spark project, which
is based on the Hive query optimizer / execution
I have never used Hive, so I'll have to investigate further.
To clarify, I wasn't recommending you use Apache Hive, but instead the
HiveContext
http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started
provided by Spark SQL. This will allow you to create views in a hive
Those annotations actually don't work because the timestamp is SQL has
optional nano-second precision.
However, there is a PR to add support using parquets INT96 type:
https://github.com/apache/spark/pull/3820
On Fri, Jan 23, 2015 at 12:08 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
You need to use lateral view explode:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
On Fri, Jan 23, 2015 at 7:02 AM, matthes mdiekst...@sensenetworks.com
wrote:
I try to work with nested parquet data. To read and write the parquet file
is
actually working now but
1) The fields in the SELECT clause are not pushed down to the predicate
pushdown API. I have many optimizations that allow fields to be filtered
out before the resulting object is serialized on the Accumulo tablet
server. How can I get the selection information from the execution plan?
I'm a
, Jan 17, 2015 at 3:38 PM, Michael Armbrust mich...@databricks.com
wrote:
1) The fields in the SELECT clause are not pushed down to the predicate
pushdown API. I have many optimizations that allow fields to be filtered
out before the resulting object is serialized on the Accumulo tablet
server
: Monday, 12 January 2015 1:21 am
To: Nathan nathan.mccar...@quantium.com.au, Michael Armbrust
mich...@databricks.com
Cc: user@spark.apache.org user@spark.apache.org
Subject: Re: SparkSQL schemaRDD MapPartitions calls - performance
issues - columnar formats?
On 1/11/15 1:40 PM, Nathan McCarthy
I'd open an issue on the github to ask us to allow you to use hadoops glob
file format for the path.
On Thu, Jan 15, 2015 at 4:57 AM, David Jones letsnumsperi...@gmail.com
wrote:
I've tried this now. Spark can load multiple avro files from the same
directory by passing a path to a directory.
This is a little confusing, but that code path is actually going through
hive. So the spark sql configuration does not help.
Perhaps, try:
set parquet.compression=GZIP;
On Fri, Jan 9, 2015 at 2:41 AM, Ayoub benali.ayoub.i...@gmail.com wrote:
Hello,
I tried to save a table created via the
The other thing to note here is that Spark SQL defensively copies rows when
we switch into user code. This probably explains the difference between 1
2.
The difference between 1 3 is likely the cost of decompressing the column
buffers vs. accessing a bunch of uncompressed primitive objects.
Have you looked at Spark SQL
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables?
It supports HiveQL, can read from the hive metastore, and does not require
hadoop.
On Wed, Jan 7, 2015 at 8:27 AM, jamborta jambo...@gmail.com wrote:
Hi all,
We have been building a system
The cache command caches the entire table, with each column stored in its
own byte buffer. When querying the data, only the columns that you are
asking for are scanned in memory. I'm not sure what mechanism spark is
using to report the amount of data read.
If you want to read only the data that
I want to support this but we don't yet. Here is the JIRA:
https://issues.apache.org/jira/browse/SPARK-3851
On Tue, Jan 6, 2015 at 5:23 PM, Adam Gilmore dragoncu...@gmail.com wrote:
Anyone got any further thoughts on this? I saw the _metadata file seems
to store the schema of every single
The types expected by applySchema are documented in the type reference
section:
http://spark.apache.org/docs/latest/sql-programming-guide.html#spark-sql-datatype-reference
I'd certainly accept a PR to improve the docs and add a link to this from
the applySchema section :)
Can you explain why you
Oh sorry, I'm rereading your email more carefully. Its only because you
have some setup code that you want to amortize?
On Mon, Jan 5, 2015 at 10:40 PM, Michael Armbrust mich...@databricks.com
wrote:
The types expected by applySchema are documented in the type reference
section:
http
Did you follow the link on that page?
THIS REPO HAS BEEN MOVED
https://github.com/marmbrus/sql-avro#please-go-to-the-version-hosted-by-databricksPlease
go to the version hosted by databricks
https://github.com/databricks/spark-avro
On Mon, Jan 5, 2015 at 1:12 PM, yanenli2 yane...@gmail.com
Predicate push down into the input format is turned off by default because
there is a bug in the current parquet library that null pointers when there
are full row groups that are null.
https://issues.apache.org/jira/browse/SPARK-4258
You can turn it on if you want:
I'll add that there is a JDBC connector for the Spark SQL data sources API
in the works, and this will work with python (though the standard SchemaRDD
type conversions).
On Mon, Jan 5, 2015 at 7:09 AM, Cody Koeninger c...@koeninger.org wrote:
JavaDataBaseConnectivity is, as far as I know, JVM
I think you are missing something:
$ javap -cp ~/Downloads/spark-sql_2.10-1.2.0.jar
org.apache.spark.sql.SchemaRDD|grep toJSON
public org.apache.spark.rdd.RDDjava.lang.String toJSON();
On Mon, Jan 5, 2015 at 3:11 AM, bchazalet bchaza...@companywatch.net
wrote:
Hi everyone,
I have just
-protobuf-2.5.jar is not generated from Spark source code,
right ?
What would be done after the JIRA is opened ?
Cheers
On Wed, Dec 31, 2014 at 12:16 PM, Michael Armbrust mich...@databricks.com
wrote:
This was not intended, can you open a JIRA?
On Tue, Dec 30, 2014 at 8:40 PM, Ted Yu
Anytime you see java.lang.NoSuchMethodError it means that you have
multiple conflicting versions of a library on the classpath, or you are
trying to run code that was compiled against the wrong version of a library.
On Tue, Dec 30, 2014 at 1:43 AM, sachin Singh sachin.sha...@gmail.com
wrote:
I
Yeah, this looks like a regression in the API due to the addition of
arbitrary decimal support. Can you open a JIRA?
On Sun, Dec 28, 2014 at 12:23 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Hi Zigen,
Looks like they missed it.
Thanks
Best Regards
On Sat, Dec 27, 2014 at 12:43 PM,
-dev +user
In general you cannot create new RDDs inside closures that run on the
executors (which is what sql inside of a foreach is doing).
I think what you want here is something like:
sqlContext.parquetFile(Data\\Test\\Parquet\\2).registerTempTable(temp2)
sql(SELECT col1, col2 FROM
You can't do this now without writing a bunch of custom logic (see here for
an example:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
)
I would like to make this easier as part of improvements to the datasources
api that we are
I would expect this to work. Can you run the standard spark-shell?
On Mon, Dec 29, 2014 at 2:34 AM, critikaled isasmani@gmail.com wrote:
How to make the spark ec2 script to install hive and spark sql on ec2 when
I
run the spark ec2 script and go to bin and run ./spark-sql and execute
You might also try the following, which I think is equivalent:
schemaRDD.map(_.mkString(,))
On Wed, Dec 24, 2014 at 8:12 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Wed, Dec 24, 2014 at 3:18 PM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:
I want to convert a schemaRDD into RDD
No, there is not. Can you open a JIRA?
On Tue, Dec 23, 2014 at 6:33 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
I am trying to load a Parquet file which has a comma in its name. Yes,
this is a valid file name in HDFS. However, sqlContext.parquetFile
interprets this as a
The various spark contexts generally aren't serializable because you can't
use them on the executors anyway. We made SQLContext serializable just
because it gets pulled into scope more often due to the implicit
conversions its contains. You should try marking the variable that holds
the context
Each JSON object needs to be on a single line since this is the boundary
the TextFileInputFormat uses when splitting up large files.
On Wed, Dec 24, 2014 at 12:34 PM, elliott cordo elliottco...@gmail.com
wrote:
I have generally been impressed with the way jsonFile eats just about
any json data
I would expect that killing a stage would kill the whole job. Are you not
seeing that happen?
On Mon, Dec 22, 2014 at 5:09 AM, Xiaoyu Wang wangxy...@gmail.com wrote:
Hello everyone!
Like the title.
I start the Spark SQL 1.2.0 thrift server. Use beeline connect to the
server to execute SQL.
With JDBC you often need to load the class so it can register the driver at
the beginning of your program. Usually this is something like:
Class.forName(com.mysql.jdbc.Driver);
On Fri, Dec 19, 2014 at 3:47 PM, durga durgak...@gmail.com wrote:
Hi I am facing an issue with mysql jars with
This is experimental, but you can start the JDBC server from within your
own programs
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L45
by
passing it the HiveContext.
On Fri, Dec 19, 2014 at 6:04
a separate table
for caching some partitions like 'cache table table_cached as select * from
table where date = 201412XX' - the way we are doing right now.
On Thu, Dec 18, 2014 at 6:46 PM, Michael Armbrust mich...@databricks.com
wrote:
There is only column level encoding (run length
with
partition firstly?
On Thursday, December 18, 2014 2:28 AM, Michael Armbrust
mich...@databricks.com wrote:
- Dev list
Have you looked at partitioned table support? That would only scan data
where the predicate matches the partition. Depending on the cardinality of
the customerId
There is only column level encoding (run length encoding, delta encoding,
dictionary encoding) and no generic compression.
On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood sadhan.s...@gmail.com wrote:
Hi All,
Wondering if when caching a table backed by lzo compressed parquet data,
if spark also
You can create an RDD[String] using whatever method and pass that to
jsonRDD.
On Wed, Dec 17, 2014 at 8:33 AM, Jerry Lam chiling...@gmail.com wrote:
Hi Ted,
Thanks for your help.
I'm able to read lzo files using sparkContext.newAPIHadoopFile but I
couldn't do the same for sqlContext because
- Dev list
Have you looked at partitioned table support? That would only scan data
where the predicate matches the partition. Depending on the cardinality of
the customerId column that could be a good option for you.
On Wed, Dec 17, 2014 at 2:25 AM, Xuelin Cao xuelin...@yahoo.com.invalid
because it is saving one stage. Did I
do something wrong?
Best Regards,
Jerry
On Wed, Dec 17, 2014 at 1:29 PM, Michael Armbrust mich...@databricks.com
wrote:
You can create an RDD[String] using whatever method and pass that to
jsonRDD.
On Wed, Dec 17, 2014 at 8:33 AM, Jerry Lam chiling
To be a little more clear jsonRDD and jsonFile use the same implementation
underneath. jsonFile is just a connivence method that does
jsonRDD(sc.textFile(...))
On Wed, Dec 17, 2014 at 11:37 AM, Michael Armbrust mich...@databricks.com
wrote:
The first pass is inferring the schema of the JSON
Underneath the covers, jsonFile uses TextInputFormat, which will split
files correctly based on new lines. Thus, there is no fixed maximum size
for a json object (other than the fact that it must fit into memory on the
executors).
On Mon, Dec 15, 2014 at 7:22 AM, Madabhattula Rajesh Kumar
Can you add this information to the JIRA?
On Mon, Dec 15, 2014 at 10:54 AM, shenghua wansheng...@gmail.com wrote:
Hello,
I met a problem when using Spark sql CLI. A custom UDTF with lateral view
throws ClassNotFound exception. I did a couple of experiments in same
environment (spark version
Is it possible that you are starting more than one SparkContext in a single
JVM with out stopping previous ones? I'd try testing with Spark 1.2, which
will throw an exception in this case.
On Mon, Dec 15, 2014 at 8:48 AM, Marius Soutier mps@gmail.com wrote:
Hi,
I’m seeing strange, random
I'll add that there is an experimental method that allows you to start the
JDBC server with an existing HiveContext (which might have registered
temporary tables).
I'm happy to discuss what it would take to make sure we can propagate this
information correctly. Please open a JIRA (and mention me in it).
Regarding including it in 1.2.1, it depends on how invasive the change ends
up being, but it is certainly possible.
On Thu, Dec 11, 2014 at 3:55 AM, nitin
BTW, I cannot use SparkSQL / case right now because my table has 200
columns (and I'm on Scala 2.10.3)
You can still apply the schema programmatically:
http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
a little better why the cache call trips this scenario
On Wed, Dec 10, 2014 at 3:50 PM, Michael Armbrust mich...@databricks.com
wrote:
Have you checked to make sure the schema in the metastore matches the
schema in the parquet file? One way to test would be to just use
can take this up as a task for myself if you
want (since this is very crucial for our release).
Thanks
-Nitin
On Wed, Dec 10, 2014 at 1:06 AM, Michael Armbrust mich...@databricks.com
wrote:
val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD,
existingSchemaRDD.schema
Yep, because sc.textFile will only guarantee that lines will be preserved
across splits, this is the semantic. It would be possible to write a
custom input format, but that hasn't been done yet. From the documentation:
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
Have you checked to make sure the schema in the metastore matches the
schema in the parquet file? One way to test would be to just use
sqlContext.parquetFile(...) which infers the schema from the file instead
of using the metastore.
On Wed, Dec 10, 2014 at 12:46 PM, Yana Kadiyska
Not yet unfortunately. You could cast the timestamp to a long if you don't
need nanosecond precision.
Here is a related JIRA: https://issues.apache.org/jira/browse/SPARK-4768
On Mon, Dec 8, 2014 at 11:27 PM, ZHENG, Xu-dong dong...@gmail.com wrote:
I meet the same issue. Any solution?
On
val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD,
existingSchemaRDD.schema)
This line is throwing away the logical information about existingSchemaRDD
and thus Spark SQL can't know how to push down projections or predicates
past this operator.
Can you describe more the problems
is to use a subquery to add a bunch of column
alias. I'll try it later.
Thanks,
Jianshi
On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust mich...@databricks.com
wrote:
This is by hive's design. From the Hive documentation:
The column change command will only modify Hive's metadata
That is correct. It the hive context will create an embedded metastore in
the current directory if you have not configured hive.
On Tue, Dec 9, 2014 at 5:51 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
From 1.1.1 documentation, it seems one can use HiveContext instead of
SQLContext without
This is by hive's design. From the Hive documentation:
The column change command will only modify Hive's metadata, and will not
modify data. Users should make sure the actual data layout of the
table/partition conforms with the metadata definition.
On Sat, Dec 6, 2014 at 8:28 PM, Jianshi
You can call .schema on SchemaRDDs. For example:
results.schema.fields.map(_.name)
On Sun, Dec 7, 2014 at 11:36 PM, abhishek reachabhishe...@gmail.com wrote:
Hi,
I have iplRDD which is a json, and I do below steps and query through
hivecontext. I get the results but without columns
On Sat, Dec 6, 2014 at 5:53 AM, spark.dubovsky.ja...@seznam.cz wrote:
Bonus question: Should the class
org.datanucleus.api.jdo.JDOPersistenceManagerFactory be part of assembly?
Because it is not in jar now.
No these jars cannot be put into the assembly because they have extra
metadata files
Not sure about maven, but you can run that test with sbt:
sbt/sbt sql/test-only org.apache.spark.sql.api.java.JavaAPISuite
On Sat, Dec 6, 2014 at 9:59 PM, Ted Yu yuzhih...@gmail.com wrote:
I tried to run tests for core but there were failures. e.g. :
^[[32mExternalAppendOnlyMapSuite:^[[0m
The command run fine for me on master. Note that Hive does print an
exception in the logs, but that exception does not propogate to user code.
On Thu, Dec 4, 2014 at 11:31 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I got exception saying Hive: NoSuchObjectException(message:table
It does not appear that the in-memory caching currently preserves the
information about the partitioning of the data so this optimization will
probably not work.
On Thu, Dec 4, 2014 at 8:42 PM, nitin nitin2go...@gmail.com wrote:
With some quick googling, I learnt that I can we can provide
All values in Hive are always nullable, though you should still not be
seeing this error.
It should be addressed by this patch:
https://github.com/apache/spark/pull/3150
On Fri, Dec 5, 2014 at 2:36 AM, Hao Ren inv...@gmail.com wrote:
Hi,
I am using SparkSQL on 1.1.0 branch.
The following
I'll add that some of our data formats will actual infer this sort of
useful information automatically. Both parquet and cached inmemory tables
keep statistics on the min/max value for each column. When you have
predicates over these sorted columns, partitions will be eliminated if they
can't
You need to import sqlContext._
On Thu, Dec 4, 2014 at 2:26 PM, Tim Chou timchou@gmail.com wrote:
I have tried to use function where and filter in SchemaRDD.
I have build class for tuple/record in the table like this:
case class Region(num:Int, str1:String, str2:String)
I also
It won't work until this is merged:
https://github.com/apache/spark/pull/3407
On Wed, Dec 3, 2014 at 9:25 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi folks,
I'm wondering if someone has successfully used wildcards with a
parquetFile call?
I saw this thread and it makes me think no?
There is an experimental method that allows you to start the JDBC server
with an existing HiveContext (which might have registered temporary tables).
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42
A little bit about how to read this output. Resolution occurs from the
bottom up and when you see a tick (') it means that a field is unresolved.
So here it looks like Year_2011_Month_0_Week_0_Site is missing from from
your RDD. (We are working on more obvious error messages).
Michael
On Tue,
, Nov 29, 2014 at 12:57 AM, Michael Armbrust mich...@databricks.com
wrote:
You probably don't need to create a new kind of SchemaRDD. Instead I'd
suggest taking a look at the data sources API that we are adding in Spark
1.2. There is not a ton of documentation, but the test cases show how
You probably don't need to create a new kind of SchemaRDD. Instead I'd
suggest taking a look at the data sources API that we are adding in Spark
1.2. There is not a ton of documentation, but the test cases show how to
implement the various interfaces
Exactly how the query is executed actually depends on a couple of factors
as we do a bunch of optimizations based on the top physical operator and
the final RDD operation that is performed. In general the compute function
is only used when you are doing SQL followed by other RDD operations (map,
In the past I have worked around this problem by avoiding sc.textFile().
Instead I read the data directly inside of a Spark job. Basically, you
start with an RDD where each entry is a file in S3 and then flatMap that
with something that reads the files and returns the lines.
Here's an example:
This has been fixed in Spark 1.1.1 and Spark 1.2
https://issues.apache.org/jira/browse/SPARK-3704
On Wed, Nov 26, 2014 at 7:10 PM, 诺铁 noty...@gmail.com wrote:
hi,
don't know whether this question should be asked here, if not, please
point me out, thanks.
we are currently using hive on
Probably the easiest/closest way to do this would be with a UDF, something
like:
registerFunction(makeString, (s: Seq[String]) = s.mkString(,))
sql(SELECT *, makeString(c8) AS newC8 FROM jRequests)
Although this does not modify a column, but instead appends a new column.
Another more
repartition and coalesce should both allow you to achieve what you
describe. Can you maybe share the code that is not working?
On Mon, Nov 24, 2014 at 8:24 PM, tridib tridib.sama...@live.com wrote:
Hello,
I am reading around 1000 input files from disk in an RDD and generating
parquet. It
:30 PM, Michael Armbrust mich...@databricks.com
wrote:
Parquet does a lot of serial metadata operations on the driver which
makes it really slow when you have a very large number of files (especially
if you are reading from something like S3). This is something we are aware
of and that I'd
?
Thanks again!
Daniel
On 25 בנוב׳ 2014, at 19:43, Michael Armbrust mich...@databricks.com
wrote:
Probably the easiest/closest way to do this would be with a UDF, something
like:
registerFunction(makeString, (s: Seq[String]) = s.mkString(,))
sql(SELECT *, makeString(c8) AS newC8 FROM
We don't support native UDAs at the moment in Spark SQL. You can write a
UDA using Hive's API and use that within Spark SQL
On Tue, Nov 25, 2014 at 10:10 AM, Barua, Seemanto
seemanto.ba...@jpmchase.com.invalid wrote:
Hi,
I am looking for some resources/tutorials that will help me achive
RDDs are immutable, so calling coalesce doesn't actually change the RDD but
instead returns a new RDD that has fewer partitions. You need to save that
to a variable and call saveAsParquetFile on the new RDD.
On Tue, Nov 25, 2014 at 10:07 AM, tridib tridib.sama...@live.com wrote:
public
I believe coalesce(..., true) and repartition are the same. If the input
files are of similar sizes, then coalesce will be cheaper as it introduces a
narrow dependency
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf,
meaning there won't be a shuffle. However, if there
You are pretty close. The QueryExecution is what drives the phases from
parsing to execution. Once we have a final SparkPlan (the physical plan),
toRdd just calls execute() which recursively calls execute() on children
until we hit a leaf operator. This gives us an RDD[Row] that will compute
Akshat is correct about the benefits of parquet as a columnar format, but
I'll add that some of this is lost if you just use a lambda function to
process the data. Since your lambda function is a black box Spark SQL does
not know which columns it is going to use and thus will do a full
tablescan.
I have to try to battle-test spark-avro.
Thanks
On Thu, Nov 20, 2014 at 6:30 PM, Michael Armbrust mich...@databricks.com
wrote:
One option (starting with Spark 1.2, which is currently in preview) is
to use the Avro library for Spark SQL. This is very new, but we would love
to get feedback
Can you give the full stack trace. You might be hitting:
https://issues.apache.org/jira/browse/SPARK-4293
On Sun, Nov 23, 2014 at 3:00 PM, critikaled isasmani@gmail.com wrote:
Hi,
I am trying to insert particular set of data from rdd to a hive table I
have Map[String,Map[String,Int]] in
Parquet does a lot of serial metadata operations on the driver which makes
it really slow when you have a very large number of files (especially if
you are reading from something like S3). This is something we are aware of
and that I'd really like to improve in 1.3.
You might try the (brand new
One option (starting with Spark 1.2, which is currently in preview) is to
use the Avro library for Spark SQL. This is very new, but we would love to
get feedback: https://github.com/databricks/spark-avro
On Thu, Nov 20, 2014 at 10:19 AM, al b beanb...@googlemail.com wrote:
I've read several
Which version are you running on again?
On Thu, Nov 20, 2014 at 8:17 AM, Sadhan Sood sadhan.s...@gmail.com wrote:
Also attaching the parquet file if anyone wants to take a further look.
On Thu, Nov 20, 2014 at 8:54 AM, Sadhan Sood sadhan.s...@gmail.com
wrote:
So, I am seeing this issue
If you run master or the 1.2 preview release then it should automatically
skip lines that fail to parse. The corrupted text will be in the column
_corrupted_record and the other columns will be null.
On Thu, Nov 20, 2014 at 7:34 AM, Daniel Haviv danielru...@gmail.com wrote:
Hi Guys,
I really
am wrong.
Thanks
On Wed, Nov 19, 2014 at 5:53 PM, Michael Armbrust mich...@databricks.com
wrote:
I would use just textFile unless you actually need a guarantee that you
will be seeing a whole file at time (textFile splits on new lines).
RDDs are immutable, so you cannot add data to them
Which SchemaRDD you can save out case classes to parquet (or JSON in Spark
1.2) automatically and when you read it back in the structure will be
preserved. However, you won't get case classes when its loaded back,
instead you'll get rows that you can query.
There is some experimental support for
Looks like intelij might be trying to load the wrong version of spark?
On Thu, Nov 20, 2014 at 4:35 PM, Sanjay Subramanian
sanjaysubraman...@yahoo.com.invalid wrote:
hey guys
I am at AmpCamp 2014 at UCB right now :-)
Funny Issue...
This code works in Spark-Shell but throws a funny
This is not by design. Can you please file a JIRA?
On Wed, Nov 19, 2014 at 9:19 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi all, I am running HiveThriftServer2 and noticed that the process stays
up even though there is no driver connected to the Spark master.
I started the server
On Tue, Nov 18, 2014 at 10:34 PM, Night Wolf nightwolf...@gmail.com wrote:
Is there a better way to mock this out and test Hive/metastore with
SparkSQL?
I would use TestHive which creates a fresh metastore each time it is
invoked.
On Wed, Nov 19, 2014 at 12:41 AM, Daniel Haviv danielru...@gmail.com
wrote:
Another problem I have is that I get a lot of small json files and as a
result a lot of small parquet files, I'd like to merge the json files into
a few parquet files.. how I do that?
You can use `coalesce` on any
I am not very familiar with the JSONSerDe for Hive, but in general you
should not need to manually create a schema for data that is loaded from
hive. You should just be able to call saveAsParquetFile on any SchemaRDD
that is returned from hctx.sql(...).
I'd also suggest you check out the
The whole stacktrack/exception would be helpful. Hive is an optional
dependency of Spark SQL, but you will need to include it if you are
planning to use the thrift server to connect to Tableau. You can enable it
by add -Phive when you build Spark.
You might also try asking on the cassandra
That error can mean a whole bunch of things (and we've been working in
recently to make it more descriptive). Often the actual cause is in the
executor logs.
On Wed, Nov 19, 2014 at 10:50 AM, Gary Malouf malouf.g...@gmail.com wrote:
Has anyone else received this type of error? We are not sure
You can override the schema inference by passing a schema as the second
argument to jsonRDD, however thats not a super elegant solution. We are
considering one option to make this easier here:
https://issues.apache.org/jira/browse/SPARK-4476
On Tue, Nov 18, 2014 at 11:06 PM, Akhil Das
701 - 800 of 1052 matches
Mail list logo