Hi,
I am using CDH5.3.1
I am getting bellow error while, even spark context not getting created,
I am submitting my job like this -
submitting command-
spark-submit --jars
I have copied hive-site.xml to spark conf folder cp
/etc/hive/conf/hive-site.xml /usr/lib/spark/conf
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/issue-creating-spark-context-with-CDH-5-3-1-tp21968p21969.html
Sent from the Apache Spark User List mailing
This one is CDH-specific and is already answered in the forums, so I'd
go there instead.
Ex:
http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-sql-and-Hive-tables/td-p/22051
On Mon, Mar 9, 2015 at 12:33 PM, sachin Singh sachin.sha...@gmail.com wrote:
Hi,
I am using
Hi All,
I have a lot of parquet files, and I try to open them directly instead of
load them into RDD in driver (so I can optimize some performance through
special logic).
But I do some research online and can't find any example to access parquet
directly from scala, anyone has done this
Hi Jaonary,
The RowPartitionedMatrix is a special case of the BlockMatrix, where the
colsPerBlock = nCols. I hope that helps.
Burak
On Mar 6, 2015 9:13 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hi Shivaram,
Thank you for the link. I'm trying to figure out how can I port this to
mllib.
+user
On Mar 9, 2015 8:47 AM, Burak Yavuz brk...@gmail.com wrote:
Hi,
In the web UI, you don't see every single task. You see the name of the
last task before the stage boundary (which is a shuffle like a groupByKey),
which in your case is a flatMap. Therefore you only see flatMap in the UI.
See http://search-hadoop.com/m/JW1q5AneoE1
Cheers
On Mon, Mar 9, 2015 at 7:29 AM, rok rokros...@gmail.com wrote:
I'm using log aggregation on YARN with Spark and I am not able to see the
logs through the YARN web UI after the application completes:
Failed redirect for
Hi All,
If I have a set of time series data files, they are in parquet format and
the data for each day are store in naming convention, but I will not know
how many files for one day.
20150101a.parq
20150101b.parq
20150102a.parq
20150102b.parq
20150102c.parq
.
201501010a.parq
.
I got pass the issues with the cluster not started problem by adding Yarn
to mapreduce.framework.name .
But when I try to to distcp , if I use uRI with s3://path to my bucket .. I
get invalid path even though the bucket exists.
If I use s3n:// it just hangs.
Did anyone else face anything like
I'm basically running a sorting using spark. The spark program will read from
HDFS, sort on composite keys, and then save the partitioned result back to
HDFS.
pseudo code is like this:
input = sc.textFile
pairs = input.mapToPair
sorted = pairs.sortByKey
values = sorted.values
Hi,
In my spark application I queried a hive table and tried to take only
one record, but got java.lang.RuntimeException: Couldn't find function Some
val rddCoOrd = sql(SELECT date, x, y FROM coordinate where order
by date limit 1)
valresultCoOrd = rddCoOrd.take(1)(0)
Any ideas? I
Hello,
I am working on a project where we want to split graphs of data into
snapshots across partitions and I was wondering what would happen if one of
the snapshots we had was too large to fit into a single partition. Would the
snapshot be split over the two partitions equally, for example, and
From: Saba Sehrish ssehr...@fnal.govmailto:ssehr...@fnal.gov
Date: March 9, 2015 at 4:11:07 PM CDT
To: user-...@spark.apache.orgmailto:user-...@spark.apache.org
Subject: Using top, takeOrdered, sortByKey
I am using spark for a template matching problem. We have 77 million events in
the
This is a Java problem, not really Spark.
From this page:
http://stackoverflow.com/questions/18520972/converting-java-file-url-to-file-path-platform-independent-including-u
You can see that using Java.nio.* on JDK 7, it will fix this issue. But Path
class in Hadoop will use java.io.*, instead of
Hi,
I was launching a spark cluster with 4 work nodes, each work nodes contains
8 cores and 56gb ram, and I was testing my logistic regression problem.
The training set is around 1.2 million records.When I was using 2**10
(1024) features, the whole program works fine, but when I use 2**14
Yarn+ Spark:
I am running my spark job (on yarn) on 6 data node cluster of 512GB each. I
was having tough time configuring it since the job would hang in one or more
tasks on any of the executors for indefinite time. The stage can be as
simple as rdd count. And the bottleneck point is not always
I am running Spark on windows 2008 R2. I use sc.textFile() to load text file
using UNC path, it does not work.
sc.textFile(rawfile:10.196.119.230/folder1/abc.txt, 4).count()
Input path does not exist: file:/10.196.119.230/folder1/abc.txt
org.apache.hadoop.mapred.InvalidInputException:
You need to change `== 1` to `== i`. `println(t)` happens on the
workers, which may not be what you want. Try the following:
noSets.filter(t = model.predict(Utils.featurize(t)) ==
i).collect().foreach(println)
-Xiangrui
On Sat, Mar 7, 2015 at 3:20 PM, Pierce Lamb
richard.pierce.l...@gmail.com
Spark Streaming has StreamingContext.socketStream()
http://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/streaming/StreamingContext.html#socketStream(java.lang.String,
int, scala.Function1, org.apache.spark.storage.StorageLevel,
scala.reflect.ClassTag)
TD
On Mon, Mar 9, 2015 at 11:37 AM,
cache() is lazy. The data is stored into memory after the first time
it gets materialized. So the first time you call `predict` after you
load the model back from HDFS, it still takes time to load the actual
data. The second time will be much faster. Or you can call
`userJavaRDD.count()` and
Hi, Currently most of the data in our production is using Avro + Snappy. I want
to test the benefits if we store the data in Parquet format. I changed the our
ETL to generate the Parquet format, instead of Avor, and want to test a simple
sql in Spark SQL, to verify the benefits from Parquet.
I
Hi,
I am trying to join data based on the latitude and longitude. I have
reference data which has city information with their latitude and longitude.
I have a data source with user information with their latitude and
longitude. I want to find the nearest city to the user's latitude and
Dear all,
Could you send me a list for input data source that spark streaming could
support?
My list is HDFS, Kafka, textfile?…
I am wondering if spark streaming could directly read data from certain port
(443 e.g.) that my devices directly send to?
Best regards,
Cui Lin
Does the code flow similar to following work for you, which processes each
partition of an RDD sequentially?
while( iterPartition RDD.partitions.length) {
val res = sc.runJob(this, (it: Iterator[T]) = somFunc, iterPartition,
allowLocal = true)
Some other function after processing
Hi All,
I am processing some time series data. For one day, it might has 500GB, then
for each hour, it is around 20GB data.
I need to sort the data before I start process. Assume I can sort them
successfully
dayRDD.sortByKey
but after that, I might have thousands of partitions (to
Hi,
Sorry to ask this, but how do I compute the sum of 2 (or more) mllib
SparseVectors in Python?
Thanks,
Ron
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
I have some applications developed using PHP and currently we have a problem
in connecting these applications to spark sql thrift server. ( Here is the
problem I am talking about.
http://apache-spark-user-list.1001560.n3.nabble.com/Connection-PHP-application-to-Spark-Sql-thrift-server-td21925.html
Please fine the query plan
scala sqlContext.sql(SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS
AVG_SDP_USAGE FROM (SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE)
AS SDP_USAGE FROM (SELECT * FROM date_d AS dd JOIN interval_f AS intf ON
intf.DATE_WID = dd.WID WHERE intf.DATE_WID =
Hi,
well, it really depends on what you want to do ;)
TF-IDF is a measure that originates in the information retrieval context
and that can be used to judge the relevancy of a document in context of a
given search term.
It's also often used for text-related machine learning tasks. E.g. have a
I do have a schemaRDD where I want to group by a given field F1, but want
the result to be not a single row per group but multiple rows per group
where only the rows that have the N top F2 field values are kept.
The issue is that the groupBy operation is an aggregation of multiple rows
to a
Hi All,
What are the default values for the following conf properities if we don't
set in the conf file?
# spark.history.fs.updateInterval 10
# spark.history.retainedApplications 500
Regards,
Srini.
Hi,
Vertices are simply hash-paritioned by their 64-bit IDs, so
they are evenly spread over parititons.
As for edges, GraphLoader#edgeList builds edge paritions
through hadoopFile(), so the initial parititons depend
on InputFormat#getSplits implementations
(e.g, partitions are mostly equal to
Link to custom receiver guide
https://spark.apache.org/docs/latest/streaming-custom-receivers.html
On Mon, Mar 9, 2015 at 5:55 PM, Shao, Saisai saisai.s...@intel.com wrote:
Hi Lin,
AFAIK, currently there’s no built-in receiver API for RDBMs, but you can
customize your own receiver to get
Hi Lin,
AFAIK, currently there's no built-in receiver API for RDBMs, but you can
customize your own receiver to get data from RDBMs, for the details you can
refer to the docs.
Thanks
Jerry
From: Cui Lin [mailto:cui@hds.com]
Sent: Tuesday, March 10, 2015 8:36 AM
To: Tathagata Das
Cc:
Hi Yong
Thanks for the reply. Yes it works with local drive letter. But I really need
to use UNC path because the path is input from at runtime. I cannot dynamically
assign a drive letter to arbitrary UNC path at runtime.
Is there any work around that I can use UNC path for sc.textFile(...)?
No, I don’t have tow master instances.
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 2015年3月9日 15:03
To: Dai, Kevin
Cc: user@spark.apache.org
Subject: Re: A strange problem in spark sql join
Make sure you don't have two master instances running on the same machine. It
could happen
Hi,
I read this page,
http://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html. But I am
wondering, how to use this TF-IDF RDD? What is this TF-IDF vector looks
like?
Can someone provide me some guide?
Thanks,
[image: --]
Xi Shen
[image: http://]about.me/davidshen
Hi, guys
I encounter a strange problem as follows:
I joined two tables(which are both parquet files) and then did the groupby. The
groupby took 19 hours to finish.
However, when I kill this job twice in the groupby stage. The third try will su
But after I killed this job and run it again. It
Hi,
I used the method on this
http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/train.html
passage to save my k-means model.
But now, I have no idea how to load it back...I tried
sc.objectFile(/path/to/data/file/directory/)
But I got this error:
Make sure you don't have two master instances running on the same machine.
It could happen like you were running the job and in the middle you tried
to stop the cluster which didn't completely stopped it and you did a
start-all again which will eventually end up having 2 master instances
running,
You would have needed to configure it by
setting yarn.scheduler.capacity.resource-calculator to something ending in
DominantResourceCalculator. If you haven't configured it, there's a high
probability that the recently committed
https://issues.apache.org/jira/browse/SPARK-6050 will fix your
you see, the core of ALS 1.0.0 is the following code:
there should be flatMap and groupByKey when running ALS iterations , right?
but when I run als iteration, there are ONLY flatMap tasks...
do you know why?
private def updateFeatures(
products: RDD[(Int,
Did you try something like:
myRDD.saveAsObjectFile(tachyon://localhost:19998/Y)
val newRDD = sc.objectFile[MyObject](tachyon://localhost:19998/Y)
Thanks
Best Regards
On Sun, Mar 8, 2015 at 3:59 PM, Yijie Shen henry.yijies...@gmail.com
wrote:
Hi,
I would like to share a RDD in several Spark
Hi,
We wrote a spark steaming app that receives file names on HDFS from Kafka
and opens them using Hadoop's libraries.
The problem with this method is that I'm not utilizing data locality
because any worker might open any file without giving precedence to data
locality.
I can't open the files
Hi,
I am trying to run examples of spark(master branch from git) from
Intellij(14.0.2) but facing errors. These are the steps I followed:
1. git clone the master branch of apache spark.2. Build it using mvn
-DskipTests clean install3. In Intellij select Import Projects and choose the
POM.xml
45 matches
Mail list logo