Do you have specific reasons to use Ceph? I used Ceph before, I'm not too
in love with it especially when I was using the Ceph Object Gateway S3 API.
There are some incompatibilities with aws s3 api. You really really need to
try it because making the commitment. Did you managed to install it?
On
Hi Amit,
Have you looked at Amazon EMR? Most people using EMR use s3 for persistency
(both as input and output of spark jobs).
Best Regards,
Jerry
Sent from my iPhone
> On 21 Sep, 2015, at 9:24 pm, Amit Ramesh wrote:
>
>
> A lot of places in the documentation mention
I just noticed you found 1.4 has the same issue. I added that as well in
the ticket.
On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam wrote:
> Hi Yin,
>
> You are right! I just tried the scala version with the above lines, it
> works as expected.
> I'm not sure if it happens als
actually a bit. I created a
ticket for this (SPARK-10731
<https://issues.apache.org/jira/browse/SPARK-10731>).
Best Regards,
Jerry
On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai wrote:
> btw, does 1.4 has the same problem?
>
> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote:
>
&
aster. Not to mention that if I do:
df.rdd.take(1) //runs much faster.
Is this expected? Why head/first/take is so slow for dataframe? Is it a bug
in the optimizer? or I did something wrong?
Best Regards,
Jerry
what language do the developers are comfortable with?
- what are the components in the system that will constraint the choice of
the language?
Best Regards,
Jerry
On Tue, Sep 8, 2015 at 11:59 AM, Dean Wampler wrote:
> It's true that Java 8 lambdas help. If you've read Learning Spark
Thanks!
On Wed, Aug 26, 2015 at 2:06 PM, Marcelo Vanzin wrote:
> On Wed, Aug 26, 2015 at 2:03 PM, Jerry wrote:
> > Assuming your submitting the job from terminal; when main() is called,
> if I
> > try to open a file locally, can I assume the machine is always the one I
>
a cluster. The
file I'm opening is purely for the driver program and not something the
worker nodes are going to read from.
Thanks,
Jerry
Hi.
I want to parse a file and return a key-value pair with pySpark, but
result is strange to me.
the test.sql is a big fie and each line is usename and password, with
# between them, I use below mapper2 to map data, and in my
understanding, i in words.take(10) should be a tuple, but the result
is
Hi Prabeesh,
That's even better!
Thanks for sharing
Jerry
On Tue, Aug 18, 2015 at 1:31 PM, Prabeesh K. wrote:
> Refer this post
> http://blog.prabeeshk.com/blog/2015/06/19/pyspark-notebook-with-docker/
>
> Spark + Jupyter + Docker
>
> On 18 August 2015 at 21:29, Je
So from what I understand, those usually pull dependencies for a given
project? I'm able to run the spark shell so I'd assume I have everything.
What am I missing from the big picture and what directory do I run maven on?
Thanks,
Jerry
On Tue, Aug 18, 2015 at 11:15 AM, Ted
Hi Guru,
Thanks! Great to hear that someone tried it in production. How do you like
it so far?
Best Regards,
Jerry
On Tue, Aug 18, 2015 at 11:38 AM, Guru Medasani wrote:
> Hi Jerry,
>
> Yes. I’ve seen customers using this in production for data science work.
> I’m currently us
ot exist
import org.apache.spark.sql.hive.*;
Let me know what I'm doing wrong.
Thanks,
Jerry
cannot do this.
Other solutions (e.g. Zeppelin) seem to reinvent the wheel that IPython has
already offered years ago. It would be great if someone can educate me the
reason behind this.
Best Regards,
Jerry
server:
/etc/httpd/modules/mod_authz_core.so: cannot open shared object file: No
such file or directory
[FAILED]
Best Regards,
Jerry
On Mon, Aug 17, 2015 at 11:09 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:
> Howdy folks!
>
> I’m interested in hearing about what people
12)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
On Fri, Aug 14, 2015 at 1:39 PM, Jerry wrote:
> Hi Salih,
> Normally I do sort before performing that operation, but since I've been
> trying to get this working for a week, I'm just loading something simple to
> test if la
ther. Hopefully those links
point me to something useful. Let me know if you can run the above code/
what you did different to get that code to run.
Thanks,
Jerry
On Fri, Aug 14, 2015 at 1:23 PM, Salih Oztop wrote:
> Hi Jerry,
> This blog post is perfect for window function
So it seems like dataframes aren't going give me a break and just work. Now
it evaluates but goes nuts if it runs into a null case OR doesn't know how
to get the correct data type when I specify the default value as a string
expression. Let me know if anyone has a work around to this. PLEASE HELP
M
Great stuff Tim. This definitely will make Mesos users life easier
Sent from my iPad
On 2015-08-12, at 11:52, Haripriya Ayyalasomayajula
wrote:
> Thanks Tim, Jerry.
>
> On Wed, Aug 12, 2015 at 1:18 AM, Tim Chen wrote:
> Yes the options are not that configurable yet but I think i
Just out of curiosity, what is the advantage of using parquet without hadoop?
Sent from my iPhone
> On 11 Aug, 2015, at 11:12 am, wrote:
>
> I confirm that it works,
>
> I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450
>
> Saif
>
> From: Ellafi, Saif A.
> S
My experience with Mesos + Spark is not great. I saw one executor with 30 CPU
and the other executor with 6. So I don't think you can easily configure it
without some tweaking at the source code.
Sent from my iPad
On 2015-08-11, at 2:38, Haripriya Ayyalasomayajula
wrote:
> Hi Tim,
>
> Spark
By the way, if Hive is present in the Spark install, does show up in text
when you start the spark shell? Any commands I can run to check if it
exists? I didn't setup the spark machine that I use, so I don't know what's
present or absent.
Thanks,
Jerry
On Mon, Aug 10,
Thanks... looks like I now hit that bug about HiveMetaStoreClient as I
now get the message about being unable to instantiate it. On a side note,
does anyone know where hive-site.xml is typically located?
Thanks,
Jerry
On Mon, Aug 10, 2015 at 2:03 PM, Michael Armbrust
wrote:
>
pointed to with -cp
when starting the spark shell, so all I do is "Test.run(sc)" in shell.
Let me know what to look for to debug this problem. I'm not sure where to
look to solve this problem.
Thanks,
Jerry
Hi Akshat,
Is there a particular reason you don't use s3a? From my experience,s3a performs
much better than the rest. I believe the inefficiency is from the
implementation of the s3 interface.
Best Regards,
Jerry
Sent from my iPhone
> On 9 Aug, 2015, at 5:48 am, Akhil Da
experience this problem before?
Best Regards,
Jerry
one is based on.
Thank you for your help!
Jerry
On Thu, Jul 30, 2015 at 11:10 AM, Ted Yu wrote:
> The files were dated 16-Jul-2015
> Looks like nightly build either was not published, or published at a
> different location.
>
> You can download spark-1.5.0-SNAPSHOT.tgz and binary
gards,
Jerry
pping. The speed is 4x faster in
the data-without-mapping that means that the more columns a parquet file
has the slower it is even only a specific column is needed.
Anyone has an explanation on this? I was expecting both of them will finish
approximate the same time.
Best Regards,
Jerry
lared
type (org.apache.parquet.io.api.Binary) does not match the schema found in
file metadata. Column item is of type: FullTypeDescriptor(PrimitiveType:
BINARY, OriginalType: ENUM)
Valid types for this column are: null
Is it because Spark does not recognize ENUM type in parquet?
Best Regards,
Jerry
On Wed, Jul 22, 201
Hi guys,
I noticed that too. Anders, can you confirm that it works on Spark 1.5
snapshot? This is what I tried at the end. It seems it is 1.4 issue.
Best Regards,
Jerry
On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg wrote:
> No, never really resolved the problem, except by increasing
sing that dispatcher ?
> ------
> *From:* Jerry Lam [chiling...@gmail.com]
> *Sent:* Monday, July 20, 2015 8:27 AM
> *To:* Jahagirdar, Madhu
> *Cc:* user; d...@spark.apache.org
> *Subject:* Re: Spark Mesos Dispatcher
>
> Yes.
>
> Sent from my iPhone
>
> On 19 Jul,
Yes.
Sent from my iPhone
> On 19 Jul, 2015, at 10:52 pm, "Jahagirdar, Madhu"
> wrote:
>
> All,
>
> Can we run different version of Spark using the same Mesos Dispatcher. For
> example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ?
>
> Regards,
> Madhu Jahagirdar
>
> Th
Hi Nikunj,
Sorry, I totally misread your question.
I think you need to first groupbykey (get all values of the same key together),
then follow by mapValues (probably put the values into a set and then take the
size of it because you want a distinct count)
HTH,
Jerry
Sent from my iPhone
>
You mean this does not work?
SELECT key, count(value) from table group by key
On Sun, Jul 19, 2015 at 2:28 PM, N B wrote:
> Hello,
>
> How do I go about performing the equivalent of the following SQL clause in
> Spark Streaming? I will be using this on a Windowed DStream.
>
> SELECT key, coun
-Bits-and-Bytes.html
>
> Probably if re-ran the benchmarks with 1.5/tungsten line would close the
> gap a bit(or a lot) with spark moving towards similar style off-heap memory
> mgmt, more planning optimizations
>
>
> *From:* Jerry Lam [mailto:chiling...@gmail.com ]
> *Sent:* Sun
in
comparisons to Flink is one of the immediate questions I have. It would be
great if they have the benchmark software available somewhere for other
people to experiment.
just my 2 cents,
Jerry
On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu wrote:
> There was no mentioning of the versions of Flink
spark. However, I didn't use the
spark-csv package though. I did that manually so I cannot comment on the
spark-csv.
HTH,
Jerry
On Thu, Feb 5, 2015 at 9:32 AM, Spico Florin wrote:
> Hello!
> I'm using spark-csv 2.10 with Java from the maven repository
> com.databricks
> s
Hi Deep,
How do you know the cluster is not responsive because of "Union"?
Did you check the spark web console?
Best Regards,
Jerry
On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan
wrote:
> The cluster hangs.
>
> On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam wrote:
>
>>
Hi Deep,
what do you mean by stuck?
Jerry
On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan
wrote:
> Hi,
> Is there any better operation than Union. I am using union and the cluster
> is getting stuck with a large data set.
>
> Thank you
>
x27;m not affiliate
with Cloudera but it seems they are the only one who is very active in the
spark project and provides a hadoop distribution.
HTH,
Jerry
btw, who is Paco Nathan?
On Thu, Jan 22, 2015 at 10:03 AM, Babu, Prashanth <
prashanth.b...@nttdata.com> wrote:
> Sudipta,
>
>
Hi guys,
Does this issue affect 1.2.0 only or all previous releases as well?
Best Regards,
Jerry
On Thu, Jan 8, 2015 at 1:40 AM, Xuelin Cao wrote:
>
> Yes, the problem is, I've turned the flag on.
>
> One possible reason for this is, the parquet file supports "pr
mance wasn't that bad at all. If it is not indexed,
I expect it to take much longer time.
Can IndexedRDD be sorted by keys as well?
Best Regards,
Jerry
On Tue, Jan 13, 2015 at 11:06 AM, Andrew Ash wrote:
> Hi Jem,
>
> Linear time in scaling on the big table doesn't seem
avro objects.
I'm thinking of overriding the saveAsParquetFile method to allows me to
persist the avro schema inside parquet. Is this possible at all?
Best Regards,
Jerry
On Fri, Jan 9, 2015 at 2:05 AM, Raghavendra Pandey <
raghavendra.pan...@gmail.com> wrote:
> I cam
to do that. Is there
another API that allows me to do this?
Best Regards,
Jerry
->E.
Is this something already possible with spark/tachyon? If not, do you think
it is possible? Does anyone mind to share their experience in capturing the
data lineage in a data processing pipeline?
Best Regards,
Jerry
on.toRdd$lzycompute(HiveContext.scala:382)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382)
Is this supported?
Best Regards,
Jerry
Michael,
Thanks. Is this still turned off in the released 1.2? Is it possible to
turn it on just to get an idea of how much of a difference it makes?
-Jerry
On 05/12/14 12:40 am, Michael Armbrust wrote:
I'll add that some of our data formats will actual infer this sort of
useful inform
Hi Sean and Madhu,
Thank you for the explanation. I really appreciate it.
Best Regards,
Jerry
On Fri, Dec 19, 2014 at 4:50 AM, Sean Owen wrote:
> coalesce actually changes the number of partitions. Unless the
> original RDD had just 1 partition, coalesce(1) will make an RDD
Hi Spark users,
I wonder if val resultRDD = RDDA.union(RDDB) will always have records in
RDDA before records in RDDB.
Also, will resultRDD.coalesce(1) change this ordering?
Best Regards,
Jerry
Thanks, that helped. And I needed SchemaRDD.as() to provide an alias for
the RDD.
-Jerry
On 17/12/14 12:12 pm, Tobias Pfeiffer wrote:
Jerry,
On Wed, Dec 17, 2014 at 3:35 PM, Jerry Raj mailto:jerry@gmail.com>> wrote:
Another problem with the DSL:
t1.where('term == &q
files)
In some scenarios, Hadoop is faster because it is saving one stage. Did I
do something wrong?
Best Regards,
Jerry
On Wed, Dec 17, 2014 at 1:29 PM, Michael Armbrust
wrote:
>
> You can create an RDD[String] using whatever method and pass that to
> jsonRDD.
>
> On Wed, Dec
Hi Ted,
Thanks for your help.
I'm able to read lzo files using sparkContext.newAPIHadoopFile but I
couldn't do the same for sqlContext because sqlContext.josnFile does not
provide ways to configure the input file format. Do you know if there are
some APIs to do that?
Best Regards,
Jer
Hi spark users,
Do you know how to read json files using Spark SQL that are LZO compressed?
I'm looking into sqlContext.jsonFile but I don't know how to configure it
to read lzo files.
Best Regards,
Jerry
Another problem with the DSL:
t1.where('term == "dmin").count() returns zero. But
sqlCtx.sql("select * from t1 where term = 'dmin').count() returns 700,
which I know is correct from the data. Is there something wrong with how
I'm using the DSL?
Thanks
On
, on = 't1.user_id == t2.user_id)
nor
t1.join(t2, on = Some('t1.user_id == t2.user_id))
work, or even compile. I could not find any examples of how to perform a
join using the DSL. Any pointers will be appreciated :)
Thanks
-Jerry
Hi Mark,
Thank you for helping out.
The items I got back from Spark SQL has the type information as follows:
scala> items
res16: org.apache.spark.sql.Row = [WrappedArray([1,orange],[2,apple])]
I tried to iterate the items as you suggested but no luck.
Best Regards,
Jerry
On Mon, Dec
not find a method in which I can do that. The farthest I can get to
is to convert items.toSeq. The type information I got back is:
scala> items.toSeq
res57: Seq[Any] = [WrappedArray([1,orange],[2,apple])]
Any suggestion?
Best Regards,
Jerry
ere(contain('item, "name",
"apple")).collect()
the contain function will loop through the item with "name" = "apple" with
early stopping.
Is this possible? If yes, how one implements the contain function?
Best Regards,
Jerry
Hi,
If I create a SchemaRDD from a file that I know is sorted on a certain
field, is it possible to somehow pass that information on to Spark SQL
so that SQL queries referencing that field are optimized?
Thanks
-Jerry
-
To
tion in thread "main" java.lang.RuntimeException: [1.57] failure:
``('' expected but identifier myudf found
I also tried returning a List of Ints, that did not work either. Is
there a way to write a UDF that returns a list?
Thanks
-Jerry
sing Spark SQL. It is a good starting point.
Best Regards,
Jerry
On Fri, Jul 25, 2014 at 5:38 PM, Sameer Tilak wrote:
> Hi Michael,
> Thanks. I am not creating HiveContext, I am creating SQLContext. I am
> using CDH 5.1. Can you please let me know which conf/ directory you
Hi Sameer,
Maybe this page will help you:
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
Best Regards,
Jerry
On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak wrote:
> Hi All,
> I am trying to load data from Hive tables using Spark SQL. I am using
> sp
spark.
Best Regards,
Jerry
Sent from my iPad
> On Jul 24, 2014, at 6:53 AM, Sameer Sayyed wrote:
>
> Hello All,
>
> I am new user of spark, I am using cloudera-quickstart-vm-5.0.0-0-vmware for
> execute sample examples of Spark.
> I am very sorry for silly and basic qu
it is executed in spark regardless of
dialect although the execution might be different for the same query."
Best Regards,
Jerry
On Tue, Jul 15, 2014 at 6:22 PM, Michael Armbrust
wrote:
> hql and sql are just two different dialects for interacting with data.
> After parsing is complete
i.e. --jars A.jar,B.jar,C.jar not --jars A.jar, B.jar, C.jar
I'm just guessing because when I used --jars I never have spaces in it.
HTH,
Jerry
On Wed, Jul 16, 2014 at 5:30 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:
> Hi Team,
>
> Now i've changed my
that
HiveContext needs a metastore and it has a more powerful SQL support
borrowed from Hive. Can you shed some lights on this when you get a minute?
Thanks,
Jerry
On Tue, Jul 15, 2014 at 4:32 PM, Michael Armbrust
wrote:
> No, that is why I included the link to SPARK-2096
lt;https://issues.apache.org/jira/browse/SPARK-2483> seems to
address only HiveQL.
Best Regards,
Jerry
On Tue, Jul 15, 2014 at 3:38 AM, anyweil wrote:
> Thank you so much for the information, now i have merge the fix of #1411
> and
> seems the HiveSQL works with:
> SELECT name FROM
ated codes. uber jar it and run it just like any other simple
java program. If you still have connection issues, then at least you know
the problem is from the configurations.
HTH,
Jerry
On Tue, Jul 15, 2014 at 12:10 PM, Krishna Sankar
wrote:
> One vector to check is the HBase libraries in the
ApplicationMaster, the SparkSubmit will return "yarnAppState: KILLED" and
then terminated itself. This is what happens to me using cdh 5.0.2
Which distribution of hadoop you are using?
On Tue, Jul 15, 2014 at 10:42 AM, Jerry Lam wrote:
> when I use yarn application -kill, both Sp
lication -kill" If you do jps You'll have a list
> of SparkSubmit and ApplicationMaster
>
> After you use yarn applicaton -kill you only kill the SparkSubmit
>
>
>
> On Mon, Jul 14, 2014 at 4:29 PM, Jerry Lam wrote:
>
>> Then yarn application -kill appi
Hi Rajesh,
can you describe your spark cluster setup? I saw localhost:2181 for
zookeeper.
Best Regards,
Jerry
On Tue, Jul 15, 2014 at 9:47 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:
> Hi Team,
>
> Could you please help me to resolve the issue.
>
> *Iss
Then yarn application -kill appid should work. This is what I did 2 hours ago.
Sorry I cannot provide more help.
Sent from my iPhone
> On 14 Jul, 2014, at 6:05 pm, "hsy...@gmail.com" wrote:
>
> yarn-cluster
>
>
>> On Mon, Jul 14, 2014 at 2:44 PM, Jerry Lam
Hi Siyuan,
I wonder if you --master yarn-cluster or yarn-client?
Best Regards,
Jerry
On Mon, Jul 14, 2014 at 5:08 PM, hsy...@gmail.com wrote:
> Hi all,
>
> A newbie question, I start a spark yarn application through spark-submit
>
> How do I kill this app. I can kill
rn of spark, but maybe not.
HTH,
Jerry
On Mon, Jul 14, 2014 at 3:09 PM, Matei Zaharia
wrote:
> You currently can't use SparkContext inside a Spark task, so in this case
> you'd have to call some kind of local K-means library. One example you can
> try to use is Weka (http:/
tesianProduct
HiveTableScan [], (MetastoreRelation test, m, None), None
HiveTableScan [id#106], (MetastoreRelation test, s, Some(s)), None
Best Regards,
Jerry
On Thu, Jul 10, 2014 at 7:16 PM, Michael Armbrust
wrote:
> Hi Jerry,
>
> Thanks for reporting this. It would be helpf
overhead, then there must be something additional
that SparkSQL adds to the overall overheads that Hive doesn't have.
Best Regards,
Jerry
On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust
wrote:
> On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam wrote:
>
>> For the curious mind, the
Hi Spark users,
Also, to put the performance issue into perspective, we also ran the query
on Hive. It took about 5 minutes to run.
Best Regards,
Jerry
On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam wrote:
> By the way, I also try hql("select * from m").count. It is terrib
By the way, I also try hql("select * from m").count. It is terribly slow
too.
On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam wrote:
> Hi Spark users and developers,
>
> I'm doing some simple benchmarks with my team and we found out a potential
> performance issue usi
sing pure spark.
I wonder if anyone knows what causes the performance issue?
For the curious mind, the dataset is about 200-300GB and we are using 10
machines for this benchmark. Given the env is equal between the two
experiments, why pure spark is faster than SparkSQL?
Best Regards,
Jerry
on (s.id=m_id)").collect().foreach(println)
It will work. Am I doing something wrong or it is a bug in spark sql?
Best Regards,
Jerry
rk as a
>>> separate service (just like MySQL and JDBC, for example). With spark-submit
>>> I'm bound to Spark as a main framework that defines how my application
>>> should look like. In my humble opinion, using Spark as embeddable library
>>> rather than ma
+1 as well for being able to submit jobs programmatically without using
shell script.
we also experience issues of submitting jobs programmatically without using
spark-submit. In fact, even in the Hadoop World, I rarely used "hadoop jar"
to submit jobs in shell.
On Wed, Jul 9, 2014 at 9:47 AM,
and trigger the error you
saw. By reducing the number of cores, there are more cpu resources
available to a task so the GC could finish before the error gets throw.
HTH,
Jerry
On Tue, Jul 8, 2014 at 1:35 PM, Aaron Davidson wrote:
> There is a difference from actual GC overhead, which
Hi guys,
I ended up reserving a room at the Phoenix (Hotel:
http://www.jdvhotels.com/hotels/california/san-francisco-hotels/phoenix-hotel)
recommended by my friend who has been in SF.
According to Google, it takes 11min to walk to the conference which is not
too bad.
Hope this helps!
Jerry
Hi Spark users,
Do you guys plan to go the spark summit? Can you recommend any hotel near
the conference? I'm not familiar with the area.
Thanks!
Jerry
spark,
I don't think you can get a lot of performance from scanning HBase unless
you are talking about caching the results from HBase in spark and reuse it
over and over.
HTH,
Jerry
On Wed, Apr 9, 2014 at 12:02 PM, David Quigley wrote:
> Hi all,
>
> We are currently using hbase
Hi Shark,
Should I assume that Shark users should not use the shark APIs since there
are no documentations for it? If there are documentations, can you point it
out?
Best Regards,
Jerry
On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam wrote:
> Hello everyone,
>
> I have successfully
T * FROM users WHERE age < 20")
scala> println(youngUsers.count)
...
scala> val featureMatrix = youngUsers.map(extractFeatures(_))
scala> kmeans(featureMatrix)
Is there a more complete sample code to start a program using Shark API in
Spark?
Thanks!
Jerry
101 - 189 of 189 matches
Mail list logo