Looks like it is this PR:
https://github.com/mesos/spark-ec2/pull/133
On Tue, Aug 25, 2015 at 9:52 AM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
Yeah thats a know issue and we have a PR out to fix it.
Shivaram
On Tue, Aug 25, 2015 at 7:39 AM, Garry Chen g...@cornell.edu
However I do think it's easier than it seems to write the implicits;
it doesn't involve new classes or anything. Yes it's pretty much just
what you wrote. There is a class Vector in Spark. This declaration
can be in an object; you don't implement your own class. (Also you can
use toBreeze to
This worked for me locally:
spark-1.4.1-bin-hadoop2.4/bin/spark-submit --conf
spark.executor.extraClassPath=/.m2/repository/ch/qos/logback/logback-core/1.1.2/logback-core-1.1.2.jar:/.m2/repository/ch/qos/logback/logback-classic/1.1.2/logback-classic-1.1.2.jar
--conf
On Tue, Aug 25, 2015 at 10:48 AM, Utkarsh Sengar utkarsh2...@gmail.com wrote:
Now I am going to try it out on our mesos cluster.
I assumed spark.executor.extraClassPath takes csv as jars the way --jars
takes it but it should be : separated like a regular classpath jar.
Ah, yes, those options
Hi,
I've just started playing about with SparkR (Spark 1.4.1), and noticed
that a number of the functions haven't been exported. For example,
the textFile function
https://github.com/apache/spark/blob/master/R/pkg/R/context.R
isn't exported, i.e. the function isn't in the NAMESPACE file. This
I wouldn't try to play with forwarding tunnelling; always hard to work out
what ports get used everywhere, and the services like hostname==URL in paths.
Can't you just set up an entry in the windows /etc/hosts file? It's what I do
(on Unix) to talk to VMs
On 25 Aug 2015, at 04:49, Dino
What about declaring a few simple implicit conversions between the
MLlib and Breeze Vector classes? if you import them then you should be
able to write a lot of the source code just as you imagine it, as if
the Breeze methods were available on the Vector object in MLlib.
The problem is that
Hi,
I wish to know if MLlib supports CHAID regression and classifcation trees.
If yes, how can I build them in spark?
Thanks,
Jatin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/CHAID-Decision-Trees-tp24449.html
Sent from the Apache Spark User List
Well, this is very strange. My only change is to add -X to
make-distribution and it succeeds:
% git diff
(spark/spark)
*diff --git a/make-distribution.sh b/make-distribution.sh*
*index a2b0c43..351fac2 100755*
*--- a/make-distribution.sh*
*+++
Final chance to fill out the survey!
http://goo.gl/forms/erct2s6KRR
I'm gonna close it to new responses tonight and send out a summary of the
results.
Nick
On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas nicholas.cham...@gmail.com
wrote:
I'm planning to close the survey to further responses
Hmm. I have a lot of code on the local linear algebra operations using
Spark's Matrix and Vector representations
done for https://issues.apache.org/jira/browse/SPARK-6442.
I can make a Spark package with that code if people are interested.
Best,
Burak
On Tue, Aug 25, 2015 at 10:54 AM, Kristina
Hello,
I'm using direct spark streaming (from kafka) with checkpointing, and
everything works well until a restart. When I shut down (^C) the first
streaming job, wait 1 minute, then re-submit, there is somehow a series of 0
event batches that get queued (corresponding to the 1 minute when the
YES PLEASE!
:)))
On Tue, Aug 25, 2015 at 1:57 PM, Burak Yavuz brk...@gmail.com wrote:
Hmm. I have a lot of code on the local linear algebra operations using
Spark's Matrix and Vector representations
done for https://issues.apache.org/jira/browse/SPARK-6442.
I can make a Spark package
Does the first batch after restart contain all the messages received while
the job was down?
On Tue, Aug 25, 2015 at 12:53 PM, suchenzang suchenz...@gmail.com wrote:
Hello,
I'm using direct spark streaming (from kafka) with checkpointing, and
everything works well until a restart. When I
Hello,
I am using sbt and created a unit test where I create a `HiveContext` and
execute some query and then return. Each time I run the unit test the JVM
will increase it's memory usage until I get the error:
Internal error when running tests: java.lang.OutOfMemoryError: PermGen space
Exception
Hi all,
when I read parquet files with required fields aka nullable=false they
are read correctly. Then I save them (df.write.parquet) and read again all
my fields are saved and read as optional, aka nullable=true. Which means I
suddenly have files with incompatible schemas. This happens on
Yes, you're right that it's quite on purpose to leave this API to
Breeze, in the main. As you can see the Spark objects have already
sprouted a few basic operations anyway; there's a slippery slope
problem here. Why not addition, why not dot products, why not
determinants, etc.
What about
Corrected a typo in the subject of your email.
What you cited seems to be from worker node startup.
Was there other error you saw ?
Please list the command you used.
Cheers
On Tue, Aug 25, 2015 at 7:39 AM, Garry Chen g...@cornell.edu wrote:
Hi All,
I am trying to lunch a
Hi All,
I have the following scenario:
There exists a booking table in cassandra, which holds the fields like,
bookingid, passengeName, contact etc etc.
Now in my spark streaming application, there is one class Booking which
acts as a container and holds all the field details -
class
Hi All,
I am trying to lunch a spark cluster on ec2 with spark 1.4.1
version. The script finished but getting error at the end as following. What
should I do to correct this issue. Thank you very much for your input.
Starting httpd: httpd: Syntax error on line 199 of
I'm not sure why the UI appears broken like that either and haven't
investigated it myself yet, but if you instead go to the YARN
ResourceManager UI (port 8088 if you are using emr-4.x; port 9026 for 3.x,
I believe), then you should be able to click on the ApplicationMaster link
(or the History
Here is the error
yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User
class threw exception: Log directory
hdfs://Sandbox/user/spark/applicationHistory/application_1438113296105_0302
already exists!)
I am using cloudera 5.3.2 with Spark 1.2.0
Any help is appreciated.
I'm having trouble with refreshTable, I suspect because I'm using it
incorrectly.
I am doing the following:
1. Create DF from parquet path with wildcards, e.g. /foo/bar/*.parquet
2. use registerTempTable to register my dataframe
3. A new file is dropped under /foo/bar/
4. Call
I'm trying to build Spark using Intellij on Windows. But I'm repeatedly
getting this error
spark-master\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala
Error:(46, 66) not found: type SparkFlumeProtocol
val transactionTimeout: Int, val
Yes I get all that too and I think there's a legit question about
whether moving a little further down the slippery slope is worth it
and if so how far. The other catch here is: either you completely
mimic another API (in which case why not just use it directly, which
has its own problems) or you
Well, yes, the hack below works (that's all I have time for), but is not
satisfactory - it is not safe, and is verbose and very cumbersome to use,
does not separately deal with SparseVector case and is not complete either.
My question is, out of hundreds of users on this list, someone must have
Any resources on this
On Aug 25, 2015, at 3:15 PM, shahid qadri shahidashr...@icloud.com wrote:
I would like to implement sorted neighborhood approach in spark, what is the
best way to write that in pyspark.
-
To
Hi Feynman,
Thanks for the information. Is there a way to depict decision tree as a
visualization for large amounts of data using any other technique/library?
Thanks,
Jatin
On Tue, Aug 25, 2015 at 11:42 PM, Feynman Liang fli...@databricks.com
wrote:
Nothing is in JIRA
Hi,
Sure you can. StreamingContext has property /def sparkContext:
SparkContext/(see docs
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext
). Think about DStream - main abstraction in Spark Streaming, as a sequence
of RDD. Each DStream can be
Hi community members,
Apache Spark is Fantastic and very easy to learn.. Awesome work!!!
Question:
I have multiple files in a folder and and the first line in each file is name
of the asset that the file belongs to. Second line is csv header row and data
starts from third row..
Ex:
The error in #1 below was not informative.
Are you able to get more detailed error message ?
Thanks
On Aug 25, 2015, at 6:57 PM, Todd bit1...@163.com wrote:
Thanks Ted Yu.
Following are the error message:
1. The exception that is shown on the UI is :
Exception in thread Thread-113
I think the answer is No. I only see such message on the console..and #2 is the
thread stack trace。
I am thinking is that in Spark SQL Perf forks many dsdgen process to generate
data when the scalafactor is increased which at last exhaust the JVM
When thread exception is thrown on the console
I figured it all out after this:
http://apache-spark-user-list.1001560.n3.nabble.com/WebUI-on-yarn-through-ssh-tunnel-affected-by-AmIpfilter-td21540.html
The short is that I needed to set
SPARK_PUBLIC_DNS (not DNS_HOME) = ec2_publicdns
then
the YARN proxy gets in the way, so I needed to go to:
For a single decision tree, the closest I can think of is printDebugString,
which gives you a text representation of the decision thresholds and paths
down the tree.
I don't think there's anything in MLlib for visualizing GBTs or random
forests
On Tue, Aug 25, 2015 at 9:20 PM, Jatinpreet Singh
Go to the module settings of the project and in the dependencies section
check the scope of scala jars. It would be either Test or Provided. Change
it to compile and it should work. Check the following link to understand
more about scope of modules:
As I remember, you also need to change guava and jetty related dependency
to compile if you run to run SparkPi in intellij.
On Tue, Aug 25, 2015 at 3:15 PM, Hemant Bhanawat hemant9...@gmail.com
wrote:
Go to the module settings of the project and in the dependencies section
check the scope of
I have pretty much the same symptoms - the computation itself is pretty
fast, but most of my computation is spent in JavaToPython steps (~15min).
I'm using the Spark 1.5.0-rc1 with DataFrame and ML Pipelines.
Any insights into what these steps are exactly ?
2015-06-02 9:18 GMT+02:00 Karlson
Ok, I see, thanks for the correction, but this should be optimized.
From: Shixiong Zhu [mailto:zsxw...@gmail.com]
Sent: Tuesday, August 25, 2015 2:08 PM
To: Cheng, Hao
Cc: Jeff Zhang; user@spark.apache.org
Subject: Re: DataFrame#show cost 2 Spark Jobs ?
That's two jobs. `SparkPlan.executeTake`
Thanks you guys.
Yes, I have fixed the guava and spark core and scala and jetty. And I can run
Pi now.
At 2015-08-25 15:28:51, Jeff Zhang zjf...@gmail.com wrote:
As I remember, you also need to change guava and jetty related dependency to
compile if you run to run SparkPi in intellij.
In spark shell use database not working saying use not found in the
shell?
did you ran this with scala shell ?
On 24 August 2015 at 18:26, Ishwardeep Singh ishwardeep.si...@impetus.co.in
wrote:
Hi Jeetendra,
I faced this issue. I did not specify the database where this table
exists.
Hi,
We have a spark standalone cluster running on linux.
We have a job that we submit to the spark cluster on windows. When
submitting this job using windows the execution failed with this error
in the Notes java.lang.IllegalArgumentException: Invalid environment
variable name: =::. When
Attribute is the Catalyst name for an input column from a child operator.
An AttributeReference has been resolved, meaning we know which input column
in particular it is referring too. An AttributeReference also has a known
DataType. In contrast, before analysis there might still exist
Port 8020 is not the only port you need tunnelled for HDFS to work. If you
only list the contents of a directory, port 8020 is enough... for instance,
using something
val p = new org.apache.hadoop.fs.Path(hdfs://localhost:8020/)
val fs = p.getFileSystem(sc.hadoopConfiguration)
fs.listStatus(p)
Ok, I went in the direction of system vars since beginning probably because
the question was to pass variables to a particular job.
Anyway, the decision to use either system vars or environment vars would
solely depend on whether you want to make them available to all the spark
processes on a
Hi Jeetendra,
Please try the following in spark shell. it is like executing an sql command.
sqlContext.sql(use database name)
Regards,
Ishwardeep
From: Jeetendra Gangele gangele...@gmail.com
Sent: Tuesday, August 25, 2015 12:57 PM
To: Ishwardeep Singh
spark-shell and spark-sql can not be deployed with yarn-cluster mode,
because you need to make spark-shell or spark-sql scripts run on your local
machine rather than container of YARN cluster.
2015-08-25 16:19 GMT+08:00 Jeetendra Gangele gangele...@gmail.com:
Hi All i am trying to launch the
I have set spark.sql.shuffle.partitions=1000 then also its failing.
On Tue, Aug 25, 2015 at 11:36 AM, Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
Did you try increasing sql partitions?
On Tue, Aug 25, 2015 at 11:06 AM, kundan kumar iitr.kun...@gmail.com
wrote:
I am running
Thanks Chenghao!
At 2015-08-25 13:06:40, Cheng, Hao hao.ch...@intel.com wrote:
Yes, check the source code
under:https://github.com/apache/spark/tree/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst
From: Todd [mailto:bit1...@163.com]
Sent: Tuesday, August 25, 2015 1:01
I cloned the code from https://github.com/apache/spark to my machine. It can
compile successfully,
But when I run the sparkpi, it throws an exception below complaining the
scala.collection.Seq is not found.
I have installed scala2.10.4 in my machine, and use the default profiles:
O, Sorry, I miss reading your reply!
I know the minimum tasks will be 2 for scanning, but Jeff is talking about 2
jobs, not 2 tasks.
From: Shixiong Zhu [mailto:zsxw...@gmail.com]
Sent: Tuesday, August 25, 2015 1:29 PM
To: Cheng, Hao
Cc: Jeff Zhang; user@spark.apache.org
Subject: Re:
Did you try increasing sql partitions?
On Tue, Aug 25, 2015 at 11:06 AM, kundan kumar iitr.kun...@gmail.com
wrote:
I am running this query on a data size of 4 billion rows and
getting org.apache.spark.shuffle.FetchFailedException error.
select adid,position,userid,price
from (
select
Sorry am I missing something? There is a method sortBy on both RDD and
PairRDD.
def sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int =
this.partitions.length
That's two jobs. `SparkPlan.executeTake` will call `runJob` twice in this
case.
Best Regards,
Shixiong Zhu
2015-08-25 14:01 GMT+08:00 Cheng, Hao hao.ch...@intel.com:
O, Sorry, I miss reading your reply!
I know the minimum tasks will be 2 for scanning, but Jeff is talking about
2 jobs, not
We plan to upgrade our spark cluster to 1.4, and I just have a test in
local mode which reference here:
http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/
but an exception caused when running the example, the stack trace as below:
*Exception in thread main
Are you actually losing messages then?
On Tue, Aug 25, 2015 at 1:15 PM, Susan Zhang suchenz...@gmail.com wrote:
No; first batch only contains messages received after the second job
starts (messages come in at a steady rate of about 400/second).
On Tue, Aug 25, 2015 at 11:07 AM, Cody
Yeah. All messages are lost while the streaming job was down.
On Tue, Aug 25, 2015 at 11:37 AM, Cody Koeninger c...@koeninger.org wrote:
Are you actually losing messages then?
On Tue, Aug 25, 2015 at 1:15 PM, Susan Zhang suchenz...@gmail.com wrote:
No; first batch only contains messages
Sounds like something's not set up right... can you post a minimal code
example that reproduces the issue?
On Tue, Aug 25, 2015 at 1:40 PM, Susan Zhang suchenz...@gmail.com wrote:
Yeah. All messages are lost while the streaming job was down.
On Tue, Aug 25, 2015 at 11:37 AM, Cody Koeninger
No; first batch only contains messages received after the second job starts
(messages come in at a steady rate of about 400/second).
On Tue, Aug 25, 2015 at 11:07 AM, Cody Koeninger c...@koeninger.org wrote:
Does the first batch after restart contain all the messages received while
the job was
Nothing is in JIRA
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22CHAID%22
so AFAIK no, only random forests and GBTs using entropy or GINI for
information gain is supported.
On Tue, Aug 25, 2015 at 9:39 AM, jatinpreet jatinpr...@gmail.com wrote:
Hi,
I
Hello All,
PySpark currently has two ways of performing a join: specifying a join
condition or column names.
I would like to perform a join using a list of columns that appear in both
the left and right DataFrames. I have created an example in this question
on Stack Overflow
That's what I'd suggest too. Furthermore, if you use vagrant to spin up
VMs, there's a module that can do that automatically for you.
R.
2015-08-25 10:11 GMT-07:00 Steve Loughran ste...@hortonworks.com:
I wouldn't try to play with forwarding tunnelling; always hard to work
out what ports get
While it's true locality might speed things up, I'd say it's a very bad idea to
mix your Spark and ES clusters - if your ES cluster is serving production
queries (and in particular using aggregations), you'll run into performance
issues on your production ES cluster.
ES-hadoop uses ES scan
I would like to implement sorted neighborhood approach in spark, what is the
best way to write that in pyspark.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
Instead of foreach try to use forEachPartitions, that will initialize the
connector per partition rather than per record.
Thanks
Best Regards
On Fri, Aug 14, 2015 at 1:13 PM, Dawid Wysakowicz
wysakowicz.da...@gmail.com wrote:
No the connector does not need to be serializable cause it is
You can change the names, whatever program that is pushing the record must
follow the naming conventions. Try to replace : with _ or something.
Thanks
Best Regards
On Tue, Aug 18, 2015 at 10:20 AM, Brian Stempin brian.stem...@gmail.com
wrote:
Hi,
I'm running Spark on Amazon EMR (Spark 1.4.1,
If the data is local to the machine then obviously it will be faster
compared to pulling it through the network and storing it locally (either
memory or disk etc). Have a look at the data locality
You hit block not found issues when you processing time exceeds the batch
duration (this happens with receiver oriented streaming). If you are
consuming messages from Kafka then try to use the directStream or you can
also set StorageLevel to MEMORY_AND_DISK with receiver oriented consumer.
(This
Thank you Michael for the detail explanation, it makes clear to me. Thanks!
At 2015-08-25 15:37:54, Michael Armbrust mich...@databricks.com wrote:
Attribute is the Catalyst name for an input column from a child operator. An
AttributeReference has been resolved, meaning we know which input
Kristina,
Thanks for the discussion. I followed up on your problem and learned that Scala
doesn't support multiple implicit conversions in a single expression
http://stackoverflow.com/questions/8068346/can-scala-apply-multiple-implicit-conversions-in-one-expression
for
complexity reasons. I'm
Hi,
I am using Spark-1.4 and Kafka-0.8.2.1
As per google suggestions, I rebuilt all the classes with protobuff-2.5
dependencies. My new protobuf is compiled using 2.5. However now, my spark
job does not start. Its throwing different error. Does Spark or any other
its dependencies uses old
Did your spark build with Hive?
I met the same problem before because the hive-exec jar in the maven itself
include protobuf class, which will be included in the Spark jar.
Yong
Date: Tue, 25 Aug 2015 12:39:46 -0700
Subject: Re: Protobuf error when streaming from Kafka
From: lcas...@gmail.com
I downloaded below binary version of spark.
spark-1.4.1-bin-cdh4
On Tue, Aug 25, 2015 at 1:03 PM, java8964 java8...@hotmail.com wrote:
Did your spark build with Hive?
I met the same problem before because the hive-exec jar in the maven
itself include protobuf class, which will be included in
The PermGen space error is controlled with MaxPermSize parameter. I run
with this in my pom, I think copied pretty literally from Spark's own
tests... I don't know what the sbt equivalent is but you should be able to
pass it...possibly via SBT_OPTS?
plugin
Thanks. I just tried and still am having trouble. It seems to still be
using the private address even if I try going through the resource manager.
On Tue, Aug 25, 2015 at 12:34 PM, Kelly, Jonathan jonat...@amazon.com
wrote:
I'm not sure why the UI appears broken like that either and haven't
Hi, On our production environment, we have a unique problems related to Spark
SQL, and I wonder if anyone can give me some idea what is the best way to
handle this.
Our production Hadoop cluster is IBM BigInsight Version 3, which comes with
Hadoop 2.2.0 and Hive 0.12.
Right now, we build spark
Sure thing!
The main looks like:
--
val kafkaBrokers = conf.getString(s$varPrefix.metadata.broker.list)
val kafkaConf = Map(
zookeeper.connect - zookeeper,
group.id - options.group,
It's good to support this, could you create a JIRA for it and target for 1.6?
On Tue, Aug 25, 2015 at 11:21 AM, Michal Monselise
michal.monsel...@gmail.com wrote:
Hello All,
PySpark currently has two ways of performing a join: specifying a join
condition or column names.
I would like to
Based on what I've read it appears that when using spark streaming there is
no good way of optimizing the files on HDFS. Spark streaming writes many
small files which is not scalable in apache hadoop. Only other way seem to
be to read files after it has been written and merge them to a bigger
So do I need to manually copy these 2 jars on my spark executors?
On Tue, Aug 25, 2015 at 10:51 AM, Marcelo Vanzin van...@cloudera.com
wrote:
On Tue, Aug 25, 2015 at 10:48 AM, Utkarsh Sengar utkarsh2...@gmail.com
wrote:
Now I am going to try it out on our mesos cluster.
I assumed
Hi,
I have stumbled upon an issue with iterative Graphx computation (using v
1.4.1). It goes thusly --
Setup
1. Construct a graph.
2. Validate that the graph satisfies certain conditions. Here I do some
assert(*conditions*) within graph.triplets.foreach(). [Notice that this
materializes the
Hi,
I just wonder if there's any way that I can get some sample data (10-20
rows) out of Spark's Hive using NodeJs?
Submitting a spark job to show 20 rows of data in web page is not good for
me.
I've set up Spark Thrift Server as shown in Spark Doc. The server works
because I can use *beeline*
Thanks for the quick response.
I have tried the direct word count python example and it also seems to be
slow. Lot of times it is not fetching the words that are sent by the
producer.
I am using SPARK version 1.4.1 and KAFKA 2.10-0.8.2.0.
On Tue, Aug 25, 2015 at 2:05 AM, Tathagata Das
Great advice.
Thanks a lot Nick.
In fact, if we use rdd.persist(DISK) command at the beginning of the
program to avoid hitting the network again and again. The speed is not
influenced a lot. In my case, it is just 1 min more compared to the
situation that we put the data in local HDFS.
Cheers
when I am launching with yarn-client also its giving me below error
bin/spark-sql --master yarn-client
15/08/25 13:53:20 ERROR YarnClientSchedulerBackend: Yarn application has
already exited with state FINISHED!
Exception in thread Yarn application state monitor
org.apache.spark.SparkException:
Looks like you were attaching images to your email which didn't go through.
Consider using third party site for images - or paste error in text.
Cheers
On Tue, Aug 25, 2015 at 4:22 AM, Todd bit1...@163.com wrote:
Hi,
The spark sql perf itself contains benchmark data generation. I am using
yes , when i see my yarn logs for that particular failed app_id, i got the
following error.
ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting
for 10 ms. Please check earlier log output for errors. Failing the
application
For this error, I need to change the
I am not quite sure about this but should the notation not be
s3n://redactedbucketname/*
instead of
s3a://redactedbucketname/*
The best way is to use s3://bucketname/path/*
Regards,
Gourav
On Tue, Aug 25, 2015 at 10:35 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You can change the names,
Hi All,
I want to use Convert() function in sql in one of my spark sql query.
Can any one tell me whether it is supported or not?
Hi,
The spark sql perf itself contains benchmark data generation. I am using spark
shell to run the spark sql perf to generate the data with 10G memory for both
driver and executor.
When I increase the scalefactor to be 30,and run the job, Then I got the
following error:
When I jstack it to
Tried adding 50010, 50020 and 50090. Still no difference.
I can't imagine I'm the only person on the planet wanting to do this.
Anyway, thanks for trying to help.
Dino.
On 25 August 2015 at 08:22, Roberto Congiu roberto.con...@gmail.com wrote:
Port 8020 is not the only port you need tunnelled
Hello,
We had the same problem. I've written a blog post with the detailed
explanation and workaround:
http://labs.totango.com/spark-read-file-with-colon/
Greetings,
Romi K.
On Tue, Aug 25, 2015 at 2:47 PM Gourav Sengupta gourav.sengu...@gmail.com
wrote:
I am not quite sure about this but
Hi,
I have serious problems with saving DataFrame as parquet file.
I read the data from the parquet file like this:
val df = sparkSqlCtx.parquetFile(inputFile.toString)
and print the schema (you can see both fields are required)
root
|-- time: long (nullable = false)
|-- time_ymdhms: long
Do you think this binary would have issue? Do I need to build spark from
source code?
On Tue, Aug 25, 2015 at 1:06 PM, Cassa L lcas...@gmail.com wrote:
I downloaded below binary version of spark.
spark-1.4.1-bin-cdh4
On Tue, Aug 25, 2015 at 1:03 PM, java8964 java8...@hotmail.com wrote:
Did
A quick question regarding this: how come the artifacts (spark-core in
particular) on Maven Central are built with JDK 1.6 (according to the
manifest), if Java 7 is required?
On Aug 21, 2015 5:32 PM, Sean Owen so...@cloudera.com wrote:
Spark 1.4 requires Java 7.
On Fri, Aug 21, 2015, 3:12 PM
I want to persist a large _sorted_ table to Parquet on S3 and then read
this in and join it using the Sorted Merge Join strategy against another
large sorted table.
The problem is: even though I sort these tables on the join key beforehand,
once I persist them to Parquet, they lose the
Hi,
I am trying to start a spark thrift server using the following command on
Spark 1.3.1 running on yarn:
* ./sbin/start-thriftserver.sh --master yarn://resourcemanager.snc1:8032
--executor-memory 512m --hiveconf
hive.server2.thrift.bind.host=test-host.sn1 --hiveconf
Did you register temp table via the beeline or in a new Spark SQL CLI?
As I know, the temp table cannot cross the HiveContext.
Hao
From: Udit Mehta [mailto:ume...@groupon.com]
Sent: Wednesday, August 26, 2015 8:19 AM
To: user
Subject: Spark thrift server on yarn
Hi,
I am trying to start a
I registered it in a new Spark SQL CLI. Yeah I thought so too about how the
temp tables were accessible across different applications without using a
job-server. I see that running*
HiveThriftServer2.startWithContext(hiveContext) *within the spark app
starts up a thrift server.
On Tue, Aug 25,
-cdh-user
This suggests that Maven is still using Java 6. I think this is indeed
controlled by JAVA_HOME. Use 'mvn -X ...' to see a lot more about what
is being used and why. I still suspect JAVA_HOME is not visible to the
Maven process. Or maybe you have JRE 7 installed but not JDK 7 and
it's
Hi all,
I have SomeClass[TYPE] { def some_method(args: fixed_type_args): TYPE }
And on runtime, I create instances of this class with different AnyVal + String
types, but the return type of some_method varies.
I know I could do this with an implicit object, IF some_method received a type,
but
1 - 100 of 110 matches
Mail list logo