Hello Marcelo Vanzin,
Can you explain bit more on this? I tried using client mode but can you
explain how can i use this port to write the log or output to this
port?Thanks in advance!
--
View this message in context:
Hi,Kevin
I tried it on spark1.0.0, it works fine.
It's a bug in spark1.0.1 ...
Thanks,
Victor
--
View this message in context:
Aha, that makes sense. Thanks for the response! I guess one of the
areas Spark could need some love in in error messages (:
On Fri, Jul 18, 2014 at 9:41 PM, Michael Armbrust
mich...@databricks.com wrote:
Sorry for the non-obvious error message. It is not valid SQL to include
attributes in the
Hi Michael,
Thanks for the suggestion. In my query, both table are too large to use
broadcast join.
When SPARK-2211 is done, will spark sql automatically choose join
algorithms?
Is there some way to manually hint the optimizer?
2014-07-19 5:23 GMT+08:00 Michael Armbrust mich...@databricks.com:
It seems MLlib right now doesn't support weighted training, training samples
have equal importance. Weighted training can be very useful to reduce data
size and speed up training.
Do you have plan to support it in future? The data format will be something
like:
label:*weight * index1:value1
Hi,
I'm running LogisticRegression of mllib. But I can't see the rdd information
from storage panel.
http://apache-spark-user-list.1001560.n3.nabble.com/file/n10296/a.png
http://apache-spark-user-list.1001560.n3.nabble.com/file/n10296/b.png
--
View this message in context:
Am getting the same issue .
Spark version : 1.0
On 21/07/14 4:16 PM, binbinbin915 binbinbin...@live.cn wrote:
Hi,
I'm running LogisticRegression of mllib. But I can't see the rdd
information
from storage panel.
http://apache-spark-user-list.1001560.n3.nabble.com/file/n10296/a.png
Thanks so much, Ankur, :))
Excuse me but I am wondering that:
(for a chosen partition strategy for my application)
1.1) how to check the size of each partition? is there any api, or log file?
1.2) how to check the processing cost of each partition(time, memory, etc)?
2.1) and the global
Hi,
I am invoking the spark-shell (Spark 1.0.0) with:
spark-shell --jars \
libs/aws-java-sdk-1.3.26.jar,\
libs/httpclient-4.1.1.jar,\
libs/httpcore-nio-4.1.jar,\
libs/gson-2.1.jar,\
libs/httpclient-cache-4.1.1.jar,\
libs/httpmime-4.1.1.jar,\
libs/hive-dynamodb-handler-0.11.0.jar,\
Hi, all
When I try hiveContext.hql(drop table if exists abc) where abc is a non-exist
table
I still received an exception about non-exist table though if exists is there
the same statement runs well in hive shell
Some feedback from Hive community is here:
For those curious I used the JavaSparkContext and got access to an
AvroSequenceFile (wrapper around Sequence File) using the following:
file = sc.newAPIHadoopFile(hdfs path to my file,
AvroSequenceFileInputFormat.class, AvroKey.class, AvroValue.class,
new Configuration())
--
View this
Hi Yifan
This works for me:
export SPARK_JAVA_OPTS=-Xms10g -Xmx40g -XX:MaxPermSize=10g
export ADD_JARS=/home/abel/spark/MLI/target/MLI-assembly-1.0.jar
export SPARK_MEM=40g
./spark-shell
Regards
On Mon, Jul 21, 2014 at 7:48 AM, Yifan LI iamyifa...@gmail.com wrote:
Hi,
I am trying to load
Hello guys,
I'm just trying to use spark streaming features.
I noticed that there is join example for filtering spam, so I just want
to try.
But, nothing happens after join, the output JavaPairDStream content is
same as before.
So, is there any examples that I can refer to?
Thanks for any
Hi Experts,
I setup Yarn and Spark env: all services runs on a single node. And then
submited a WordCount job using spark-submit script with
command:./bin/spark-submit tests/wordcount-spark-scala.jar --class
scala.spark.WordCount --num-executors 1 --driver-memory 300M --executor-memory
300M
Thanks, Abel.
Best,
Yifan LI
On Jul 21, 2014, at 4:16 PM, Abel Coronado Iruegas acoronadoirue...@gmail.com
wrote:
Hi Yifan
This works for me:
export SPARK_JAVA_OPTS=-Xms10g -Xmx40g -XX:MaxPermSize=10g
export ADD_JARS=/home/abel/spark/MLI/target/MLI-assembly-1.0.jar
export
Hello fellow Sparkians.
In https://groups.google.com/d/msg/spark-users/eXV7Bp3phsc/gVgm-MdeEkwJ,
Matei suggested that Spark might get deferred grouping and forced execution
of multiple jobs in an efficient way. His code sample:
rdd.reduceLater(reduceFunction1) // returns Future[ResultType1]
So I think I may end up using hourglass
(https://engineering.linkedin.com/datafu/datafus-hourglass-incremental-data-processing-hadoop)
a hadoop framework for incremental data processing, it would be very cool if
spark (not streaming ) could support something like this
--
View this message in
Hi,
I am using pyspark and have persisted a list of rdds within a function, but
I don't have a reference to them anymore. The RDD's are listed in the UI,
under the Storage tab, and they have names associated to them (e.g. 4). Is
it possible to access the RDD's to unpersist them?
Thanks!
--
Hello,
Currently I work on a project in which:
I spawn a standalone Apache Spark MLlib job in Standalone mode, from a
running Java Process.
In the code of the Spark Job I have the following code:
SparkConf sparkConf = new SparkConf().setAppName(SparkParallelLoad);
Hi Victor,
Instead of importing sqlContext.createSchemaRDD, can you explicitly call
sqlContext.createSchemaRDD(rdd) to create a SchemaRDD?
For example,
You have a case class Record.
case class Record(data_date: String, mobile: String, create_time: String)
Then, you create a RDD[Record] and
Hi TD,
Thanks for the help.
The only problem left here is that the dstreamTime contains some extra
information which seems date i.e. 1405944367000 ms whereas my application
timestamps are just in sec which I converted
to ms. e.g. 2300, 2400, 2500 etc. So the filter doesn't take effect.
I
That is just standard Unix time.
1405944367000 = Sun, 09 Aug 46522 05:56:40 GMT
On Mon, Jul 21, 2014 at 5:43 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
Hi TD,
Thanks for the help.
The only problem left here is that the dstreamTime contains some extra
information which seems date i.e.
Uh, right. I mean:
1405944367 = Mon, 21 Jul 2014 12:06:07 GMT
On Mon, Jul 21, 2014 at 5:47 PM, Sean Owen so...@cloudera.com wrote:
That is just standard Unix time.
1405944367000 = Sun, 09 Aug 46522 05:56:40 GMT
On Mon, Jul 21, 2014 at 5:43 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
Hi
Thank you Abel,
It seems that your advice worked. Even though I receive a message that it
is a deprecated way of defining Spark Memory (the system prompts that I
should set spark.driver.memory), the memory is increased.
Again, thank you,
Nick
On Mon, Jul 21, 2014 at 9:42 AM, Abel Coronado
It is really nice that Spark RDD's provide functions that are often
equivalent to functions found in Scala collections. For example, I can
call:
myArray.map(myFx)
and equivalently
myRdd.map(myFx)
Awesome!
My question is this. Is it possible to write code that works on either
an RDD or
It's really more of a Scala question than a Spark question, but the standard OO
(not Scala-specific) way is to create your own custom supertype (e.g.
MyCollectionTrait), inherited/implemented by two concrete classes (e.g. MyRDD
and MyArray), each of which manually forwards method calls to the
heya,
Without a bit of gymnastic at the type level, nope. Actually RDD doesn't
share any functions with the scala lib (the simple reason I could see is
that the Spark's ones are lazy, the default implementations in Scala
aren't).
However, it'd be possible by implementing an implicit converter
I have the same error! Did you manage to fix it?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/LiveListenerBus-throws-exception-and-weird-web-UI-bug-tp8330p10324.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Instead of using union, can you try sqlContext.parquetFile(/user/
hive/warehouse/xxx_parquet.db).registerAsTable(parquetTable)?
Then, var all = sql(select some_id, some_type, some_time from
parquetTable).map(line
= (line(0), (line(1).toString, line(2).toString.substring(0, 19
Thanks,
Yin
Hi All,
Currently, if you are running Spark HiveContext API with Hive 0.12, it won't
work due to the following 2 libraries which are not consistent with Hive 0.12
and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common
practice, they should be consistent to work inter-operable).
I haven't seen anyone actively 'unwilling' -- I hope not. See
discussion at https://issues.apache.org/jira/browse/SPARK-2420 where I
sketch what a downgrade means. I think it just hasn't gotten a looking
over.
Contrary to what I thought earlier, the conflict does in fact cause
problems in theory,
Hi Sam,
Did you specify the MASTER in your spark-env.sh? I ask because I didn't see
a --master in your launch command. Also, your app seems to take in a master
(yarn-standalone). This is not exactly correct because by the time the
SparkContext is launched locally, which is the default, it is too
Thanks Michael,
That is one solution that I had thought of. It seems like a bit of
overkill for the few methods I want to do this for - but I will think
about it. I guess I was hoping that I was missing something more
obvious/easier.
Philip
On 07/21/2014 11:20 AM, Michael Malak wrote:
Hi all,
This error happens because we receive a completed event for a particular
stage that we don't know about, i.e. a stage we haven't received a
submitted event for. The root cause of this, as Baoxu explained, is
usually because the event queue is full and the listener begins to drop
events.
Dear all,
Is there any example of mapPartitions that fork external process or how to
make RDD.pipe working on every data of a partition ?
Cheers,
Jaonary
Hi Maria,
If you don't have a reference to a persisted RDD, it will be automatically
unpersisted on the next GC by the ContextCleaner. This is implemented for
scala, but should still work in python because python uses reference
counting to clean up objects that are no longer strongly referenced.
Hi Nick and Abel,
Looks like you are requesting 8g for your executors, but only allowing 2g
on the workers. You should set SPARK_WORKER_MEMORY to at least 8g if you
intend to use that much memory in your application. Also, you shouldn't
have to set SPARK_DAEMON_JAVA_OPTS; you can just set
Hi,
unfortunately it is not so straightforward
xxx_parquet.db
is a folder of managed database created by hive/impala, so, every sub
element in it is a table in hive/impala, they are folders in HDFS, and each
table has different schema, and in its folder there are one or more parquet
files.
Hi Rindra,
Depending on what you're doing with your groupBy, you may end up inflating
your data quite a bit. Even if your machine has 16G, by default spark-shell
only uses 512M, and the amount used for storing blocks is only 60% of that
(spark.storage.memoryFraction), so this space becomes ~300M.
but at least if user want to access the persisted RDDs, they can use
sc.getPersistentRDDs in the same context.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313p10337.html
Sent from the Apache Spark User List mailing list
Hi Tathagata,
I am currentlycreating multiple DStream to consumefrom different topics.
How can I let each consumer consume from different partitions. I find the
following parameters from Spark API:
createStream[K, V, U : Decoder[_], T : Decoder[_]](jssc:
JavaStreamingContext
I got this working by having our sysadmin update our security group to
allow incoming traffic from the local subnet on ports 1-65535. I'm not
sure if there's a more specific range I could have used, but so far,
everything is running!
Thanks for all the responses Marcelo and Andrew!!
Matt
I would like to programmatically start a spark cluster in ec2 from another
app running in ec2, run my job and then destroy the cluster. I can launch a
spark EMR cluster easily enough using the SDK however I ran into two
problems:
1) I was only able to retrieve the address of the master node from
This is a useful feature but it may be hard to have it in v1.1 due to
limited time. Hopefully, we can support it in v1.2. -Xiangrui
On Mon, Jul 21, 2014 at 12:58 AM, Jiusheng Chen chenjiush...@gmail.com wrote:
It seems MLlib right now doesn't support weighted training, training samples
have
What's the exception you're seeing? Is it an OOM?
On Mon, Jul 21, 2014 at 11:20 AM, chutium teng@gmail.com wrote:
Hi,
unfortunately it is not so straightforward
xxx_parquet.db
is a folder of managed database created by hive/impala, so, every sub
element in it is a table in
An also i am facing one issue. If i run the program in yarn-cluster mode it
works absolutely fine but if i change it to yarn-client mode i get this
below error.
Application application_1405471266091_0055 failed 2 times due to AM
Container for appattempt_1405471266091_0055_02 exited with
no, something like this
14/07/20 00:19:29 ERROR cluster.YarnClientClusterScheduler: Lost executor 2
on 02.xxx: remote Akka client disassociated
...
...
14/07/20 00:21:13 WARN scheduler.TaskSetManager: Lost TID 832 (task 1.2:186)
14/07/20 00:21:13 WARN scheduler.TaskSetManager: Loss was
Hi,
Our application is required to do some aggregations on data that will be
coming as a stream for over two months. I would like to know if spark
streaming will be suitable for our requirement. After going through some
documentation and videos i think we can do aggregations on data based on
Hi, all
When I run some Spark application (actually unit test of the application in
Jenkins ), I found that I always hit the FileNotFoundException when reading
broadcast variable
The program itself works well, except the unit test
Here is the example log:
14/07/21 19:49:13 INFO
When SPARK-2211 is done, will spark sql automatically choose join
algorithms?
Is there some way to manually hint the optimizer?
Ideally we will select the best algorithm for you. We are also considering
ways to allow the user to hint.
You will have to use some function that converts the dstreamTime (ms since
epoch, same format as returned by System.currentTimeMillis), and your
application-level time.
TD
On Mon, Jul 21, 2014 at 9:47 AM, Sean Owen so...@cloudera.com wrote:
Uh, right. I mean:
1405944367 = Mon, 21 Jul 2014
Just to confirm, are you interested in submitting the spark job inside the
cluster of the spark standalone mode (that is, one of the workers will be
running the driver)? For that, spark-submit does support that fully yet.
You can probably use the instructions present in Spark 0.9.1 to do that.
Could you share your code snippet so that we can take a look?
TD
On Mon, Jul 21, 2014 at 7:23 AM, hawkwang wanghao.b...@gmail.com wrote:
Hello guys,
I'm just trying to use spark streaming features.
I noticed that there is join example for filtering spam, so I just want to
try.
But,
Hi, TD,
Thanks for the reply
I tried to reproduce this in a simpler program, but no luck
However, the program has been very simple, just load some files from HDFS and
write them to HBase….
---
It seems that the issue only appears when I run the unit test in Jenkins (not
fail every time,
Hi,
I am new to Spark, have used hadoop for some time and just entered the
mailing list.
I am considering using spark in my application, reading data from Cassandra
in Python and writting mapped data back to Cassandra or to ES after it.
The first question I have is: Is it possible to use
Hi, TD,
I think I got more insights to the problem
in the Jenkins test file, I mistakenly pass a wrong value to spark.cores.max,
which is much larger than the expected value
(I passed master address as local[6], and spark.core.max as 200)
If I set a more consistent value, everything goes
Actually the script in the master branch is also broken (it's pointing to an
older AMI). Try 1.0.1 for launching clusters.
On Jul 20, 2014, at 2:25 PM, Chris DuBois chris.dub...@gmail.com wrote:
I pulled the latest last night. I'm on commit 4da01e3.
On Sun, Jul 20, 2014 at 2:08 PM, Matei
Ah, I see,
thanks, Yin
--
Nan Zhu
On Monday, July 21, 2014 at 5:00 PM, Yin Huai wrote:
Hi Nan,
It is basically a log entry because your table does not exist. It is not a
real exception.
Thanks,
Yin
On Mon, Jul 21, 2014 at 7:10 AM, Nan Zhu zhunanmcg...@gmail.com
Hello,
I have only just started playing around with spark to see if it fits my
needs. I was trying to read some data from elasticsearch as an rdd, so that
I could perform some python based analytics on it. I am unable to create the
rdd object as of now, failing with a serialization error.
That is definitely weird. spark.core.max should not affect thing when they
are running local mode.
And, I am trying to think of scenarios that could cause a broadcast
variable used in the current job to fall out of scope, but they all seem
very far fetched. So i am really curious to see the code
So this is a bug unsolved (for java) yet?
From: Jack Yang [mailto:j...@uow.edu.au]
Sent: Friday, 18 July 2014 4:52 PM
To: user@spark.apache.org
Subject: error from DecisonTree Training:
Hi All,
I got an error while using DecisionTreeModel (my program is written in Java,
spark 1.0.1, scala
You can also try a different region. I tested us-west-2 yesterday, and
it worked well. -Xiangrui
On Sun, Jul 20, 2014 at 4:35 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Actually the script in the master branch is also broken (it's pointing to an
older AMI). Try 1.0.1 for launching
This is a known issue:
https://issues.apache.org/jira/browse/SPARK-2197 . Joseph is working
on it. -Xiangrui
On Mon, Jul 21, 2014 at 4:20 PM, Jack Yang j...@uow.edu.au wrote:
So this is a bug unsolved (for java) yet?
From: Jack Yang [mailto:j...@uow.edu.au]
Sent: Friday, 18 July 2014 4:52
Ah, sorry, sorry, my brain just damaged….. sent some wrong information
not “spark.cores.max” but the minPartitions in sc.textFile()
Best,
--
Nan Zhu
On Monday, July 21, 2014 at 7:17 PM, Tathagata Das wrote:
That is definitely weird. spark.core.max should not affect thing when they
I'm not sure if you guys ever picked a preferred method for doing this, but
I just encountered it and came up with this method that's working
reasonably well on a small dataset. It should be quite easily
generalizable to non-String RDDs.
def addRowNumber(r: RDD[String]): RDD[Tuple2[Long,String]]
That is nice.
Thanks Xiangrui.
-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com]
Sent: Tuesday, 22 July 2014 9:31 AM
To: user@spark.apache.org
Subject: Re: error from DecisonTree Training:
This is a known issue:
https://issues.apache.org/jira/browse/SPARK-2197 . Joseph is
Hi all,
Here is my fix https://github.com/apache/spark/pull/1356, although not
handsome, but work well. Any Suggestions?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/LiveListenerBus-throws-exception-and-weird-web-UI-bug-tp8330p10324.html
Hi
I have peculiar problem,
I have two data sets (large ones) .
Data set1:
((timestamp),iterable[Any]) = {
(2014-07-10T00:02:45.045+,ArrayBuffer((2014-07-10T00:02:45.045+,98.4859,22)))
(2014-07-10T00:07:32.618+,ArrayBuffer((2014-07-10T00:07:32.618+,75.4737,22)))
}
DataSet2:
I have the same problem
On Sat, Jul 19, 2014 at 12:31 AM, lihu lihu...@gmail.com wrote:
Hi,
Everyone. I have a piece of following code. When I run it,
it occurred the error just like below, it seem that the SparkContext is not
serializable, but i do not try to use the SparkContext
First of all, I do not know Scala, but learning.
I'm doing a proof of concept by streaming content from a socket, counting
the words and write it to a Tachyon disk. A different script will read the
file stream and print out the results.
val lines = ssc.socketTextStream(args(0), args(1).toInt,
This is a very interesting problem. SparkSQL supports the Non Equi Join, but it
is in very low efficiency with large tables.
One possible solution is make both table partition based and the partition keys
are (cast(ds as bigint) / 240), and with each partition in dataset1, you
probably can
Bill,
numPartitions means the number of Spark partitions that the data received
from Kafka will be split to. It has nothing to do with Kafka partitions, as
far as I know.
If you create multiple Kafka consumers, it doesn't seem like you can
specify which consumer will consume which Kafka
Hi,
I'm just a new one in the world big data and I'm trying understand the use
cases of several projects. Of course one of them is Spark.
I wanna know that what is the proper way of examining my data that resides
on my MySQL server?
Think that I'm saving every page view of a user with the
Hi,
I started to use spark on yarn recently and found a problem while
tuning my program.
When SparkContext is initialized as sc and ready to read text file from hdfs,
the textFile(path, defaultMinPartitions) method is called.
I traced down the second parameter in the spark source code
Sandy,
I just tried the standalone cluster and didn't have chance to try Yarn yet.
So if I understand correctly, there are *no* special handling of task
assignment according to the HDFS block's location when Spark is running as a
*standalone* cluster.
Please correct me if I'm wrong. Thank
On Mon, Jul 21, 2014 at 8:05 PM, Bin WU bw...@connect.ust.hk wrote:
I am not sure how to specify different initial values to each node in the
graph. Moreover, I am wondering why initial message is necessary. I think
we can instead initialize the graph and then pass it into Pregel interface?
Does anyone know what this error means:
14/07/21 23:07:22 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
14/07/21 23:07:22 INFO TaskSetManager: Starting task 3.0:0 as TID 1620 on
executor 27: r104u05.oculus.local (PROCESS_LOCAL)
14/07/21 23:07:22 INFO TaskSetManager: Serialized task
well, I think you miss this line of code in SparkContext.scala
line 1242-1243(master):
/** Default min number of partitions for Hadoop RDDs when not given by user */
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
so the defaultMinPartitions will be 2 unless the
Yes, Great. I thought it was math.max instead of math.min on that line. Thank
you!
From: Ye Xianjin [mailto:advance...@gmail.com]
Sent: Tuesday, July 22, 2014 11:37 AM
To: user@spark.apache.org
Subject: Re: defaultMinPartitions in textFile
well, I think you miss this line of code in
Hi Chen,
I am new to the Spark as well as SparkSQL , could you please explain how
would I create a table and run query on top of it.That would be super
helpful.
Thanks,
D.
--
View this message in context:
Actually it's just a pseudo algorithm I described, you can do it with spark
API. Hope the algorithm helpful.
-Original Message-
From: durga [mailto:durgak...@gmail.com]
Sent: Tuesday, July 22, 2014 11:56 AM
To: u...@spark.incubator.apache.org
Subject: RE: Joining by timestamp.
Hi Chen,
Hi Chen,
Thank you very much for your reply. I think I do not understand how can I do
the join using spark api. If you have time , could you please write some
code .
Thanks again,
D.
--
View this message in context:
Durga, you can start from the documents
http://spark.apache.org/docs/latest/quick-start.html
http://spark.apache.org/docs/latest/programming-guide.html
-Original Message-
From: durga [mailto:durgak...@gmail.com]
Sent: Tuesday, July 22, 2014 12:45 PM
To:
83 matches
Mail list logo