To reiterate, it's very important for Spark's workers to have the same
memory available. Think about Spark uniformly chopping up your data and
distributing the work to the nodes. The algorithm is not designed to
consider that a worker has less memory available than some other worker.
On Thu, Dec
I ran into something similar before. 19/20 partitions would complete very
quickly, and 1 would take the bulk of time and shuffle reads writes. This was
because the majority of partitions were empty, and 1 had all the data. Perhaps
something similar is going on here - I would suggest taking a
Using
dependency
groupIdcom.esotericsoftware/groupId
artifactIdkryo-shaded/artifactId
version3.0.0/version
/dependency
Instead of
dependency
groupIdcom.esotericsoftware.kryo/groupId
artifactIdkryo/artifactId
Good point, Ankit.
Steve - You can click on the link for '27' in the first column to get a
break down of how much data is in each of those 116 cached partitions. But
really, you want to also understand how much data is in the 4 non-cached
partitions, as they may be huge. One thing you can try
In general, most use cases don't need the RDD to be replicated in memory
multiple times. It would be a rare exception to do this. If it's really
expensive (time consuming) to recomputing a lost partition or if the use
case is extremely time sensitive, then maybe you could replicate it in
memory.
Are you running Spark in Local or Standalone mode? In either mode, you
should be able to hit port 4040 (to see the Spark
Jobs/Stages/Storage/Executors UI) on the machine where the driver is
running. However, in local mode, you won't have a Spark Master UI on 7080
or a Worker UI on 7081.
You can
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Hi,
I have a RDD like below:
(1, (10, 20))
(2, (30, 40, 10))
(3, (30))
…
Is there any way to map it to this:
(10,1)
(20,1)
(30,2)
(40,2)
(10,2)
(30,3)
…
generally, for each element, it might be mapped to multiple.
Thanks in advance!
Best,
Yifan LI
Hi,
You can access your logs in your /spark_home_directory/logs/ directory .
cat the file names and you will get the logs.
Thanks.
On Thu, Dec 4, 2014 at 2:27 PM, FFeng [via Apache Spark User List]
ml-node+s1001560n20344...@n3.nabble.com wrote:
I have wrote data to spark log.
I get it
Hi Sourabh,
I came across same problem as you. One workable solution for me was to
serialize the parts of model that can be used again to recreate it. I
serialize RDD's in my model using saveAsObjectFile with a time stamp
attached to it in HDFS. My other spark application read from the latest
Fantastic, thanks for the quick fix!
On 3 December 2014 at 22:11, Andrew Or and...@databricks.com wrote:
This should be fixed now. Thanks for bringing this to our attention.
2014-12-03 13:31 GMT-08:00 Andrew Or and...@databricks.com:
Yeah this is currently broken for 1.1.1. I will submit a
Hi All,
I want to hash partition (and then cache) a schema RDD in way that
partitions are based on hash of the values of a column (ID column in my
case).
e.g. if my table has ID column with values as 1,2,3,4,5,6,7,8,9 and
spark.sql.shuffle.partitions is configured as 3, then there should be 3
Hi,
I have a graph and I want to create RDDs equal in number to the nodes in
the graph. How can I do that?
If I have 10 nodes then I want to create 10 rdds. Is that possible in
GraphX?
Like in C language we have array of pointers. Do we have array of RDDs in
Spark.
Can we create such an array and
At 2014-12-04 02:08:45 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote:
I have a graph and I want to create RDDs equal in number to the nodes in
the graph. How can I do that?
If I have 10 nodes then I want to create 10 rdds. Is that possible in
GraphX?
This is possible: you can collect
Hi Sameer,
Your model recreation should be:
val model = new LinearRegressionModel(weights, intercept)
As you have already got weights for linear regression model using stochastic
gradient descent, you just have to use LinearRegressionModel to construct
new model. Other points to notice is that
There's no built-in support for doing this, so the best option is to copy and
modify Pregel to check the accumulator at the end of each iteration. This is
robust and shouldn't be too hard, since the Pregel code is short and only uses
public GraphX APIs.
Ankur
At 2014-12-03 09:37:01 -0800, Jay
Hi,
rdd.flatMap( e = e._2.map( i = ( i, e._1)))
Should work, but I didn't test it so maybe I'm missing something.
Paolo
Inviata dal mio Windows Phone
Da: Yifan LImailto:iamyifa...@gmail.com
Inviato: 04/12/2014 09:27
A:
Hi!
does anybody has some useful example of StreamingListener interface. When
and how can we use this interface to stop streaming when one batch of data
is processed?
Thanks alot
--
View this message in context:
Hi Kal El!
Have you done stopping streaming after first iteration? if yes can you share
example code.
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/running-Spark-Streaming-just-once-and-stop-it-tp1382p20359.html
Sent from the Apache Spark User
That was it, Thanks. (Posting here so people know it's the right answer in
case they have the same need :) ).
sowen wrote
Probabilities won't sum to 1 since this expression doesn't incorporate
the probability of the evidence, I imagine? it's constant across
classes so is usually excluded. It
Thanks Akhil
You are so helping Dear.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Example-usage-of-StreamingListener-tp20357p20362.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thanks, Paolo and Mark. :)
On 04 Dec 2014, at 11:58, Paolo Platter paolo.plat...@agilelab.it wrote:
Hi,
rdd.flatMap( e = e._2.map( i = ( i, e._1)))
Should work, but I didn't test it so maybe I'm missing something.
Paolo
Inviata dal mio Windows Phone
Da: Yifan LI
Hi Davies,
Thanks for the reply
The problem is I have empty dictionaries in my field3 as well. It gives me
an error :
Traceback (most recent call last):
File stdin, line 1, in module
File /root/spark/python/pyspark/sql.py, line 1042, in inferSchema
schema = _infer_schema(first)
Thanks for the reply!
To be honest, I was expecting spark to have some sort of Indexing for keys,
which would help it locate the keys efficiently.
I wasn't using Spark SQL here, but if it helps perform this efficiently, i
can try it out, can you please elaborate, how will it be helpful in this
I'm not sure sample is what i was looking for.
As mentioned in another post above. this is what i'm looking for.
1) My RDD contains this structure. Tuple2CustomTuple,Double.
2) Each CustomTuple is a combination of string id's e.g.
CustomTuple.dimensionOne=AE232323
Hi,
According to my knowledge of current Spark Streaming Kafka connector, I think
there's no chance for APP user to detect such kind of failure, this will either
be done by Kafka consumer with ZK coordinator, either by ReceiverTracker in
Spark Streaming, so I think you don't need to take care
Did anyone get a chance to look at this?
Please provide some help.
Thanks
Nikhil
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Integrate-openNLP-with-Spark-tp20117p20368.html
Sent from the Apache Spark User List mailing list archive at
Hello everyone,
I was wondering what is the most efficient way for retrieving the top K
values per key in a (key, value) RDD.
The simplest way I can think of is to do a groupByKey, sort the iterables
and then take the top K
elements for every key.
But reduceByKey is an operation that can be
Hello Folks:
I'd like to do market basket analysis using spark, what're my options?
Thanks,
Rohit Pujari
Solutions Architect, Hortonworks
--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information
You probably want to use combineByKey, and create an empty min queue
for each key. Merge values into the queue if its size is K. If = K,
only merge the value if it exceeds the smallest element; if so add it
and remove the smallest element.
This gives you an RDD of keys mapped to collections of
You can use the datastax's Cassandra connector.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md
Thanks
Best Regards
On Thu, Dec 4, 2014 at 8:21 PM, m.sar...@accenture.com wrote:
Hi,
I have written the code below which is streaming data from kafka, and
I am running a large job using 4000 partitions - after running for four
hours on a 16 node cluster it fails with the following message.
The errors are in spark code and seem address unreliability at the level of
the disk -
Anyone seen this and know what is going on and how to fix it.
Exception
I guess he's already doing so, given the 'saveToCassandra' usage.
What I don't understand is the question how do I specify a batch. That
doesn't make much sense to me. Could you explain further?
-kr, Gerard.
On Thu, Dec 4, 2014 at 5:36 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You can
Thanks - I found the same thing -
calling
boolean forceShuffle = true;
myRDD = myRDD.coalesce(120,forceShuffle );
worked - there were 120 partitions but forcing a shuffle distributes the
work
I believe there is a bug in my code causing memory to accumulate as
partitions grow in
https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/TaskContext.java
has a Java implementation if TaskContext wit a very useful method
/** * Return the currently active TaskContext. This can be called inside of
* user functions to access contextual information about
Is it possible to have some state across multiple calls to mapPartitions on
each partition, for instance, if I want to keep a database connection open?
I'll add that some of our data formats will actual infer this sort of
useful information automatically. Both parquet and cached inmemory tables
keep statistics on the min/max value for each column. When you have
predicates over these sorted columns, partitions will be eliminated if they
can't
Which version of Spark are you using? inferSchema() is improved to
support empty dict in 1.2+, could you try the 1.2-RC1?
Also, you can use applySchema():
from pyspark.sql import *
fields = [StructField('field1', IntegerType(), True),
StructField('field2', StringType(), True),
I used the command below because I'm using Spark 1.0.2 built with SBT and it
worked.
SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_GANGLIA_LGPL=true sbt/sbt
assembly
--
View this message in context:
Regarding: Can we create such an array and then parallelize it?
Parallelizing an array of RDDs - i.e. RDD[RDD[x]] is not possible.
RDD is not serializable.
From: Deep Pradhan [mailto:pradhandeep1...@gmail.com]
Sent: 04 December 2014 15:39
To: user@spark.apache.org
Subject: Determination of
Hello Samudrala,
Did you solve this issue about view metrics in Ganglia??
Because I have the same problem.
Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-metrics-for-ganglia-tp14335p20385.html
Sent from the Apache Spark User List mailing
This did not work for me. that is, rdd.coalesce(200, forceShuffle) . Does
anyone have ideas on how to distribute your data evenly and co-locate
partitions of interest?
--
View this message in context:
Is that Spark SQL? I'm wondering if it's possible without spark SQL.
On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian lian.cs@gmail.com wrote:
You may do this:
table(users).groupBy('zip)('zip, count('user), countDistinct('user))
On 12/4/14 8:47 AM, Arun Luthra wrote:
I'm wondering how to
Hi Cheng,
Thank you very much for taking your time and providing a detailed
explanation.
I tried a few things you suggested and some more things.
The ContactDetail table (8 GB) is the fact table and DAgents is the Dim
table (500 KB), reverse of what you are assuming, but your ideas still
apply.
Hi Guys:
I have succsefully installed apache-spark on Amazon ec2 using spark-ec2
command and I could login to the master node.
Here is the installation message:
RSYNC'ing /etc/ganglia to slaves...
ec2-54-148-197-89.us-west-2.compute.amazonaws.com
Shutting down GANGLIA gmond:
[FAILED]
I think it is related to my previous questions, but I separate them. In my
previous question, I could not connect to WebUI even though I could log
into the cluster without any problem.
Also, I tried lynx localhost:8080 and I could get the information about the
cluster;
I could also user
I have tried to use function where and filter in SchemaRDD.
I have build class for tuple/record in the table like this:
case class Region(num:Int, str1:String, str2:String)
I also successfully create a SchemaRDD.
scala val results = sqlContext.sql(select * from region)
results:
In my project I extend a new RDD type that wraps another RDD and some
metadata. The code I use is similar to FilteredRDD implementation:
case class PageRowRDD(
self: RDD[PageRow],
@transient keys: ListSet[KeyLike] = ListSet()
){
override def getPartitions:
You need to import sqlContext._
On Thu, Dec 4, 2014 at 2:26 PM, Tim Chou timchou@gmail.com wrote:
I have tried to use function where and filter in SchemaRDD.
I have build class for tuple/record in the table like this:
case class Region(num:Int, str1:String, str2:String)
I also
...
Thank you! I'm so stupid... This is the only thing I miss in the
tutorial...orz
Thanks,
Tim
2014-12-04 16:49 GMT-06:00 Michael Armbrust mich...@databricks.com:
You need to import sqlContext._
On Thu, Dec 4, 2014 at 2:26 PM, Tim Chou timchou@gmail.com wrote:
I have tried to use
I am trying to load a large Hbase table into SPARK RDD to run a SparkSQL
query on the entity. For an entity with about 6 million rows, it will take
about 35 seconds to load it to RDD. Is it expected? Is there any way to
shorten the loading process? I have been getting some tips from
Disclaimer : I am new at Spark
I did something similar in a prototype which works but I that did not test
at scale yet
val agg =3D users.mapValues(_ =3D 1)..aggregateByKey(new
CustomAggregation())(CustomAggregation.sequenceOp, CustomAggregation.comboO=
p)
class CustomAggregation() extends
Yes , It is working with this in spark-env.sh
export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native
export
Basic question:
What is the best way to loop through one of these and print their components?
Convert them to an array?
Thanks
Deb
Hi,
What is your cluster setup? How mich memory do you have? How much space
does one row only consisting of the 3 columns consume? Do you run other
stuff in the background?
Best regards
Am 04.12.2014 23:57 schrieb bonnahu bonn...@gmail.com:
I am trying to load a large Hbase table into SPARK
Hi,
I am a new spark and scala user. Was trying to use JdbcRDD to query a MySQL
table. It needs a lowerbound and upperbound as parameters, but I want to get
all the records from the table in a single query. Is there a way I can do
that?
--
View this message in context:
@ankurdave's concise code at
https://gist.github.com/ankurdave/587eac4d08655d0eebf9, responding to an
earlier thread
(http://apache-spark-user-list.1001560.n3.nabble.com/How-to-construct-graph-in-graphx-tt16335.html#a16355)
shows how to build a graph with multiple edge-types (predicates in
I want to have a database connection per partition of the RDD, and then
reuse that connection whenever mapPartitions is called, which results in
compute being called on the partition.
On Thu, Dec 4, 2014 at 11:07 AM, Paolo Platter paolo.plat...@agilelab.it
wrote:
Could you provide some further
Hi,
On Thu, Dec 4, 2014 at 11:58 PM, Rohit Pujari rpuj...@hortonworks.com
wrote:
I'd like to do market basket analysis using spark, what're my options?
To do it or not to do it ;-)
Seriously, could you elaborate a bit on what you want to know?
Tobias
Hi,
On Fri, Dec 5, 2014 at 3:56 AM, Akshat Aranya aara...@gmail.com wrote:
Is it possible to have some state across multiple calls to mapPartitions
on each partition, for instance, if I want to keep a database connection
open?
If you're using Scala, you can use a singleton object, this will
At 2014-12-04 16:26:50 -0800, spr s...@yarcdata.com wrote:
I'm also looking at how to represent literals as vertex properties. It seems
one way to do this is via positional convention in an Array/Tuple/List that is
the VD; i.e., to represent height, weight, and eyeColor, the VD could be a
Sure, I'm looking to perform frequent item set analysis on POS data set.
Apriori is a classic algorithm used for such tasks. Since Apriori
implementation is not part of MLLib yet, (see
https://issues.apache.org/jira/browse/SPARK-4001) What are some other
options/algorithms I could use to
Hi all
when I execute:
/spark-1.1.1-bin-hadoop2.4/bin/spark-submit --verbose --master yarn-cluster
--class spark.SimpleApp --jars
/spark-1.1.1-bin-hadoop2.4/lib/spark-assembly-1.1.1-hadoop2.4.0.jar
--executor-memory 1G --num-executors 2
I got the following error during Spark startup (Yarn-client mode):
14/12/04 19:33:58 INFO Client: Uploading resource
file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar
-
Hi,
Here is the configuration of the cluster:
Workers: 2
For each worker,
Cores: 24 Total, 0 Used
Memory: 69.6 GB Total, 0.0 B Used
For the spark.executor.memory, I didn't set it, so it should be the default
value 512M.
How much space does one row only consisting of the 3 columns consume?
the
Is it a limitation that spark does not support more than one case class at a
time.
Regards,
Rahul
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334p20415.html
Sent from the Apache Spark User
Looks like somehow Spark failed to find the core-site.xml in /et/hadoop/conf
I've already set the following env variables:
export YARN_CONF_DIR=/etc/hadoop/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HBASE_CONF_DIR=/etc/hbase/conf
Should I put $HADOOP_CONF_DIR/* to HADOOP_CLASSPATH?
Hi Ted,
Here is the information about the Regions:
Region Server Region Count
http://regionserver1:60030/ 44
http://regionserver2:60030/ 39
http://regionserver3:60030/ 55
--
View this message in context:
Actually my HADOOP_CLASSPATH has already been set to include
/etc/hadoop/conf/*
export
HADOOP_CLASSPATH=/etc/hbase/conf/hbase-site.xml:/usr/lib/hbase/lib/hbase-protocol.jar:$(hbase
classpath)
Jianshi
On Fri, Dec 5, 2014 at 11:54 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Looks like
On Fri, Dec 5, 2014 at 12:53 PM, Rahul Bindlish
rahul.bindl...@nectechnologies.in wrote:
Is it a limitation that spark does not support more than one case class at
a
time.
What do you mean? I do not have the slightest idea what you *could*
possibly mean by to support a case class.
Tobias
Hi, ALL
How can I group by one column and order by another one, then select the first
row for each group (which is just like window function doing) by SparkSQL?
Best Regards,
Kevin.
SparkContext.textfile() cannot load file using UNC path on windows
I run the following on Windows XP
val conf = new
SparkConf().setAppName(testproj1.ClassificationEngine).setMaster(local)
val sc = new SparkContext(conf)
Hi Tobias,
Thanks Tobias for your response.
I have created objectfiles [person_obj,office_obj] from
csv[person_csv,office_csv] files using case classes[person,office] with API
(saveAsObjectFile)
Now I restarted spark-shell and load objectfiles using API(objectFile).
*Once any of one
Window functions are not supported yet, but there is a PR for it:
https://github.com/apache/spark/pull/2953
On 12/5/14 12:22 PM, Dai, Kevin wrote:
Hi, ALL
How can I group by one column and order by another one, then select
the first row for each group (which is just like window function
With some quick googling, I learnt that I can we can provide distribute by
coulmn_name in hive ql to distribute data based on a column values. My
question now if I use distribute by id, will there be any performance
improvements? Will I be able to avoid data movement in shuffle(Excahnge
before
Looks like the datanucleus*.jar shouldn't appear in the hdfs path in
Yarn-client mode.
Maybe this patch broke yarn-client.
https://github.com/apache/spark/commit/a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53
Jianshi
On Fri, Dec 5, 2014 at 12:02 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Rahul,
On Fri, Dec 5, 2014 at 1:29 PM, Rahul Bindlish
rahul.bindl...@nectechnologies.in wrote:
I have created objectfiles [person_obj,office_obj] from
csv[person_csv,office_csv] files using case classes[person,office] with API
(saveAsObjectFile)
Now I restarted spark-shell and load
Correction:
According to Liancheng, this hotfix might be the root cause:
https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce
Jianshi
On Fri, Dec 5, 2014 at 12:45 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Looks like the datanucleus*.jar shouldn't appear in
Tobias,
Thanks for quick reply.
Definitely, after restart case classes need to be defined again.
I have done so thats why spark is able to load objectfile [e.g. person_obj]
and spark has maintained serialVersionUID [person_obj].
Next time when I am trying to load another objectfile [e.g.
Rahul,
On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish
rahul.bindl...@nectechnologies.in wrote:
I have done so thats why spark is able to load objectfile [e.g. person_obj]
and spark has maintained serialVersionUID [person_obj].
Next time when I am trying to load another objectfile [e.g.
Thanks for flagging this. I reverted the relevant YARN fix in Spark
1.2 release. We can try to debug this in master.
On Thu, Dec 4, 2014 at 9:51 PM, Jianshi Huang jianshi.hu...@gmail.com wrote:
I created a ticket for this:
https://issues.apache.org/jira/browse/SPARK-4757
Jianshi
On Fri,
Hi all,
I made some code changes in mllib project and as mentioned in the previous
mails I did
mvn install -pl mllib
Now I run a program in examples using run-example, the new code is not
executing.Instead the previous code itself is running.
But if I do an mvn install in the entire spark
Hi Gerard/Akhil,
By how do I specify a batch I was trying to ask that when does the data in
the JavaDStream gets flushed into Cassandra table?.
I read somewhere that the streaming data in batches gets written in Cassandra.
This batch can be of some particular time, or one particular run.
can someone please explain how RDD.aggregate works? i looked at the average
example done with aggregate() but i'm still confused about this function...
much appreciated.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-tp20434.html
Sent from
Batch is the batch duration that you are specifying while creating the
StreamingContext, so at the end of every batch's computation the data will
get flushed to Cassandra, and why are you stopping your program with Ctrl +
C? You can always specify the time with the sc.awaitTermination(Duration)
Tobias,
Find csv and scala files and below are steps:
1. Copy csv files in current directory.
2. Open spark-shell from this directory.
3. Run one_scala file which will create object-files from csv-files in
current directory.
4. Restart spark-shell
5. a. Run two_scala file, while running it is
It says connection refused, just make sure the network is configured
properly (open the ports between master and the worker nodes). If the ports
are configured correctly, then i assume the process is getting killed for
some reason and hence connection refused.
Thanks
Best Regards
On Fri, Dec 5,
Yes, there is away. Just add the following piece of code before creating
the SparkContext.
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger(org).setLevel(Level.OFF)
Logger.getLogger(akka).setLevel(Level.OFF)
Thanks
Best Regards
On Fri, Dec 5, 2014 at 12:48 AM,
Its working http://ec2-54-148-248-162.us-west-2.compute.amazonaws.com:8080/
If it ddn't install it correctly, then you could try spark-ec2 script with
*--resume* i think
Thanks
Best Regards
On Fri, Dec 5, 2014 at 3:11 AM, Xingwei Yang happy...@gmail.com wrote:
Hi Guys:
I have succsefully
Hello,
I work for an eCommerce company. Currently we are looking at building a Data
warehouse platform as described below:
DW as a Service
|
REST API
|
SQL On No SQL (Drill/Pig/Hive/Spark SQL)
|
No SQL databases (One or more. May be RDBMS directly too)
| (Bulk load)
My SQL
Sorry for the late of follow-up.
I used Hao's DESC EXTENDED command and found some clue:
new (broadcast broken Spark build):
parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
old (broadcast working
If I run ANALYZE without NOSCAN, then Hive can successfully get the size:
parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589,
COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}
Is Hive's PARQUET support broken?
Jianshi
On Fri, Dec 5, 2014 at 3:30
Rahul,
On Fri, Dec 5, 2014 at 3:51 PM, Rahul Bindlish
rahul.bindl...@nectechnologies.in wrote:
1. Copy csv files in current directory.
2. Open spark-shell from this directory.
3. Run one_scala file which will create object-files from csv-files in
current directory.
4. Restart spark-shell
With Liancheng's suggestion, I've tried setting
spark.sql.hive.convertMetastoreParquet false
but still analyze noscan return -1 in rawDataSize
Jianshi
On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
If I run ANALYZE without NOSCAN, then Hive can successfully
94 matches
Mail list logo