That's a Hive version issue, not Hadoop version issue.
On Tue, Oct 7, 2014 at 7:21 AM, Li HM wrote:
> Thanks for the replied.
>
> Please refer to my another post entitled "How to make ./bin/spark-sql
> work with hive". It has all the error/exceptions I am getting.
>
> If I understand you correctl
Thanks for the replied.
Please refer to my another post entitled "How to make ./bin/spark-sql
work with hive". It has all the error/exceptions I am getting.
If I understand you correctly, I can build the package with
mvn -Phive,hadoop-2.4 -Dhadoop.version=2.5.0 clean package
This is what I actua
The hadoop-2.4 profile is really intended to be "Hadoop 2.4+". It
should compile and run fine with Hadoop 2.5 as far as I know. CDH 5.2
is Hadoop 2.5 + Spark 1.1, so there is evidence it works. You didn't
say what doesn't work.
On Tue, Oct 7, 2014 at 6:07 AM, hmxxyy wrote:
> Does Spark 1.1.0 work
Does Spark 1.1.0 work with Hadoop 2.5.0?
The maven build instruction only has command options up to hadoop 2.4.
Anybody ever made it work?
I am trying to run spark-sql with hive 0.12 on top of hadoop 2.5.0 but can't
make it work.
--
View this message in context:
http://apache-spark-user-lis
Hi,
The other option to consider is using G1 GC, which should behave better
with large heaps. But pointers are not compressed in heaps > 32 GB in
size, so you may be better off staying under 32 GB.
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearc
For two large key-value data sets, if they have the same set of keys, what is
the fastest way to join them into one? Suppose all keys are unique in each
data set, and we only care about those keys that appear in both data sets.
input data I have: (k, v1) and (k, v2)
data I want to get from the
Ok, cool. This seems to be general issues in JVM with very large heaps. I
agree that the best workaround would be to keep the heap size below 32GB.
Thanks guys!
Mingyu
From: Arun Ahuja
Date: Monday, October 6, 2014 at 7:50 AM
To: Andrew Ash
Cc: Mingyu Kim , "user@spark.apache.org"
, Dennis
Rohit,
Thank you very much for release the H2 version and now my app compiles
file and there is no more runtime error wrt. hadoop 1.x class or interface.
Tian
On Saturday, October 4, 2014 9:47 AM, Rohit Rai wrote:
Hi Tian,
We have published a build against Hadoop 2.0 with version 1.1.0-CT
In Spark 1.0, outer join are resolved to BroadcastNestedLoopJoin. You can
use Spark 1.1 which resolves outer join to hash joins.
Hope this helps!
Liqua
On Mon, Oct 6, 2014 at 4:20 PM, Benyi Wang wrote:
> I'm using CDH 5.1.0 with Spark-1.0.0. There is spark-sql-1.0.0 in
> clouder'a maven reposit
I'm using CDH 5.1.0 with Spark-1.0.0. There is spark-sql-1.0.0 in clouder'a
maven repository. After put it into the classpath, I can use spark-sql in
my application.
One of issue is that I couldn't make the join as a hash join. It gives
CartesianProduct when I join two SchemaRDDs as follows:
scal
Hi,
I have started a Spark Cluster on EC2 using Spark Standalone cluster
manager but spark is trying to identify the worker threads using the
hostnames which are not accessible publicly.
So when I try to submit jobs from eclipse it is failing, is there some way
spark can use IP address instead of
Another rule of thumb is that definitely cache the RDD over which you need
to do iterative analysis...
For rest of them only cache if you have lot of free memory !
On Mon, Oct 6, 2014 at 2:39 PM, Sean Owen wrote:
> I think you mean that data2 is a function of data1 in the first
> example. I ima
I think you mean that data2 is a function of data1 in the first
example. I imagine that the second version is a little bit more
efficient.
But it is nothing to do with memory or caching. You don't have to
cache anything here if you don't want to. You can cache what you like.
Once memory for the ca
@Davies
I know that gensim.corpora.wikicorpus.extract_pages will be for sure the bottle
neck on the master node.
Unfortunately I am using Spark on EC2 and I don't have enough space on my nodes
to store there whole data that needs to be parsed by extract_pages. I have my
data on S3 and I kind of
The problem was mine: I was using FunSuite in the wrong way. The test
method of FunSuite registers a "test" to be triggered when the tests are
running. My localTest is instead creating and stopping the SparkContext
during the test registration and as result my SparkContext is stopped
when the t
The problem is that listing the metadata for all these files in S3 takes a long
time. Something you can try is the following: split your files into several
non-overlapping paths (e.g. s3n://bucket/purchase/2014/01,
s3n://bucket/purchase/2014/02, etc), then do sc.parallelize over a list of such
Hi,
I see that this type of question has been asked before, however still a
little confused about it in practice. Such as there are two ways I could
deal with a series of RDD transformation before I do a RDD action, which way
is faster:
Way 1:
val data = sc.textFile()
val data1 = data.map(x => f1
Sean,
Thanks a ton Sean... This is exactly what I was looking for.
As mentioned in the code -
// This horrible, separate declaration is necessary to appease the
compiler
@SuppressWarnings("unchecked")
Class> outputFormatClass = (Class>) (Class) SequenceFileOutputFormat.class;
wri
Try a Hadoop Custom InputFormat - I can give you some samples -
While I have not tried this an input split has only a length (could be
ignores if the format treats as non splittable) and a String for a location.
If the location is a URL into wikipedia the whole thing should work.
Hadoop InputFormat
On Mon, Oct 6, 2014 at 1:08 PM, wrote:
> Hi,
>
>
> Thank you for your advice. It really might work, but to specify my problem a
> bit more, think of my data more like one generated item is one parsed
> wikipedia page. I am getting this generator from the parser and I don't want
> to save it to th
Any suggestions? I'm thinking of submitting a feature request for mutable
broadcast. Is it doable?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p15807.html
Sent from the Apache Spark User
Hi,
Thank you for your advice. It really might work, but to specify my problem a
bit more, think of my data more like one generated item is one parsed wikipedia
page. I am getting this generator from the parser and I don't want to save it
to the storage, but directly apply parallelize and crea
Hi,
The partition information for Spark master is not updated after a stage
failure. In case of HDFS, Spark gets partition information from InputFormat
and if a data node in HDFS is down when spark is performing computation for
a certain stage, this stage will fail and be resubmitted using the sam
Unfortunately not. Again, I wonder if adding support targeted at this
"small files problem" would make sense for Spark core, as it is a common
problem in our space.
Right now, I don't know of any other options.
Nick
On Mon, Oct 6, 2014 at 2:24 PM, Landon Kuhn wrote:
> Nicholas, thanks for the
Is the RDD partition index you get when you call mapPartitionWithIndex
consistent under fault-tolerance condition?
I.e.
1. Say index is 1 for one of the partitions when you call
data.mapPartitionWithIndex((index, rows) => ) // Say index is 1
2. The partition fails (maybe a long with a bunch o
Here's an example:
https://github.com/OryxProject/oryx/blob/master/oryx-lambda/src/main/java/com/cloudera/oryx/lambda/BatchLayer.java#L131
On Mon, Oct 6, 2014 at 7:39 PM, Abraham Jacob wrote:
> Hi All,
>
> Would really appreciate from the community if anyone has implemented the
> saveAsNewAPIHad
TD has addressed this. It should be available in 1.2.0.
https://issues.apache.org/jira/browse/SPARK-3495
On Thu, Oct 2, 2014 at 9:45 AM, maddenpj wrote:
> I am seeing this same issue. Bumping for visibility.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.
Hello,
This may seem like a strange question, but is it possible to combine
python and java code in a single application.
i.e. I would like to create the spark context in java and then perform
RDD transformations in python, or use parts of MLlib in python and
others in Java.
one solution that
Hi All,
Would really appreciate from the community if anyone has implemented the
saveAsNewAPIHadoopFiles method in "Java" found in the
org.apache.spark.streaming.api.java.JavaPairDStream
Any code snippet or online link would be greatly appreciated.
Regards,
Jacob
Nicholas, thanks for the tip. Your suggestion certainly seemed like the
right approach, but after a few days of fiddling I've come to the
conclusion that s3distcp will not work for my use case. It is unable to
flatten directory hierarchies, which I need because my source directories
contain hour/mi
No, not yet. Only Hive UDAFs are supported.
On Mon, Oct 6, 2014 at 2:18 AM, Pei-Lun Lee wrote:
> Hi,
>
> Does spark sql currently support user-defined custom aggregation function
> in scala like the way UDF defined with sqlContext.registerFunction? (not
> hive UDAF)
>
> Thanks,
> --
> Pei-Lun
>
Weird, it seems like this is trying to use the SparkContext before it's
initialized, or something like that. Have you tried unrolling this into a
single method? I wonder if you just have multiple versions of these libraries
on your classpath or something.
Matei
On Oct 4, 2014, at 1:40 PM, Mari
After disabled the client side authorization and no anything in the
SPARK_CLASSPATH, I am still getting no class found error.
hive.security.authorization.enabled
false
Perform authorization checks on the client
Am I hitting a dead end? Please help.
spark-sql> use mydb;
OK
Time taken: 4.5
One diff I can find is you may have different kernel functions for your
training, In Spark, you end up using Linear Kernel whereas for scikit you
are using rbk kernel. That can explain the different in the coefficients
you are getting.
On Mon, Oct 6, 2014 at 10:15 AM, Adamantios Corais <
adamantio
Hi again,
Finally, I found the time to play around with your suggestions.
Unfortunately, I noticed some unusual behavior in the MLlib results, which
is more obvious when I compare them against their scikit-learn equivalent.
Note that I am currently using spark 0.9.2. Long story short: I find it
di
sc.parallelize() to distribute a list of data into numbers of partitions, but
generator can not be cut and serialized automatically.
If you can partition your generator, then you can try this:
sc.parallelize(range(N), N).flatMap(lambda x: generate_partiton(x))
such as you want to generate xrange
We have used the strategy that you suggested, Andrew - using many workers
per machine and keeping the heaps small (< 20gb).
Using a large heap resulted in workers hanging or not responding (leading
to timeouts). The same dataset/job for us will fail (most often due to
akka disassociated or fetch
Hi,
I would like to ask if it is possible to use generator, that generates data
bigger than size of RAM across all the machines as the input for sc =
SparkContext(), sc.paralelize(generator). I would like to create RDD this way.
When I am trying to create RDD by sc.TextFile(file) where file ha
I have created
https://issues.apache.org/jira/browse/SPARK-3814
https://issues.apache.org/jira/browse/SPARK-3815
Will probably try my hand at 3814, seems like a good place to get started...
On Fri, Oct 3, 2014 at 3:06 PM, Michael Armbrust
wrote:
> Thanks for digging in! These both look like t
Arko,
It would be useful to know more details on the use case you are trying to
solve. As Tobias wrote, Spark Streaming works on DStream, which is a
continuous series of RDDs.
Do check out performance tuning :
https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tunin
Doesn't spark keep track of the DAG lineage and start from where it has stopped
? Does it have to always start from the beginning of the lineage when the job
fails ?
From: Massimiliano Tomassi [max.toma...@gmail.com]
Sent: Monday, October 06, 2014 2:40 PM
To: Jah
Hi,
Does spark sql currently support user-defined custom aggregation function
in scala like the way UDF defined with sqlContext.registerFunction? (not
hive UDAF)
Thanks,
--
Pei-Lun
additional logs here.
14/10/06 18:06:16 DEBUG HeartbeatReceiver: [actor] received message
Heartbeat(11,[Lscala.Tuple2;@630ff939,BlockManagerId(11, cluster12, 48053,
0)) from Actor[akka.tcp://sparkExecutor@cluster12:57686/temp/$e]
14/10/06 18:06:16 DEBUG BlockManagerMasterActor: [actor] received me
>From the Spark Streaming Programming Guide (
http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node
):
*...output operations (like foreachRDD) have at-least once semantics, that
is, the transformed data may get written to an external entity more than
once in
Given that I have multiple worker nodes and when Spark schedules the job again
on the worker nodes that are alive, does it then again store the data in
elastic search and then flume or does it only run functions to store in flume ?
Regards,
Madhu Jahagirdar
From
Hi
You can basically have a rest api to collect the data (possibly store it in
a NoSQL DB or put that in Kafka etc) from android devices and then you can
use the Spark's Streaming feature to process these data.
Thanks
Best Regards
On Sat, Oct 4, 2014 at 12:13 PM, ll wrote:
> any comment/feedba
AFAIK spark doesn't restart worker nodes itself. You can have multiple
worker nodes and in that case if one worker node goes down, then spark will
try to recompute those lost RDDs again with those workers who are alive.
Thanks
Best Regards
On Sun, Oct 5, 2014 at 5:19 AM, Jahagirdar, Madhu <
madhu
Thanks MLnick,
I fixed the error.
First i compile spark with original version later I download this pom file
to examples folder
https://github.com/tedyu/spark/commit/70fb7b4ea8fd7647e4a4ddca4df71521b749521c
Then i recompile with maven.
mvn -Dhbase.profile=hadoop-provided -Phadoop-2.4 -Dha
48 matches
Mail list logo