the developers are comfortable with?
- what are the components in the system that will constraint the choice of
the language?
Best Regards,
Jerry
On Tue, Sep 8, 2015 at 11:59 AM, Dean Wampler <deanwamp...@gmail.com> wrote:
> It's true that Java 8 lambdas help. If you've read Learning Spa
Thanks!
On Wed, Aug 26, 2015 at 2:06 PM, Marcelo Vanzin van...@cloudera.com wrote:
On Wed, Aug 26, 2015 at 2:03 PM, Jerry jerry.c...@gmail.com wrote:
Assuming your submitting the job from terminal; when main() is called,
if I
try to open a file locally, can I assume the machine is always
Hi.
I want to parse a file and return a key-value pair with pySpark, but
result is strange to me.
the test.sql is a big fie and each line is usename and password, with
# between them, I use below mapper2 to map data, and in my
understanding, i in words.take(10) should be a tuple, but the result
is
Hi Guru,
Thanks! Great to hear that someone tried it in production. How do you like
it so far?
Best Regards,
Jerry
On Tue, Aug 18, 2015 at 11:38 AM, Guru Medasani gdm...@gmail.com wrote:
Hi Jerry,
Yes. I’ve seen customers using this in production for data science work.
I’m currently
org.apache.spark.sql.hive.*;
Let me know what I'm doing wrong.
Thanks,
Jerry
Hi Prabeesh,
That's even better!
Thanks for sharing
Jerry
On Tue, Aug 18, 2015 at 1:31 PM, Prabeesh K. prabsma...@gmail.com wrote:
Refer this post
http://blog.prabeeshk.com/blog/2015/06/19/pyspark-notebook-with-docker/
Spark + Jupyter + Docker
On 18 August 2015 at 21:29, Jerry Lam
cannot do this.
Other solutions (e.g. Zeppelin) seem to reinvent the wheel that IPython has
already offered years ago. It would be great if someone can educate me the
reason behind this.
Best Regards,
Jerry
into server:
/etc/httpd/modules/mod_authz_core.so: cannot open shared object file: No
such file or directory
[FAILED]
Best Regards,
Jerry
On Mon, Aug 17, 2015 at 11:09 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Howdy folks!
I’m interested in hearing about what people think of spark-ec2
So it seems like dataframes aren't going give me a break and just work. Now
it evaluates but goes nuts if it runs into a null case OR doesn't know how
to get the correct data type when I specify the default value as a string
expression. Let me know if anyone has a work around to this. PLEASE HELP
those links
point me to something useful. Let me know if you can run the above code/
what you did different to get that code to run.
Thanks,
Jerry
On Fri, Aug 14, 2015 at 1:23 PM, Salih Oztop soz...@yahoo.com wrote:
Hi Jerry,
This blog post is perfect for window functions in Spark.
https
)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
On Fri, Aug 14, 2015 at 1:39 PM, Jerry jerry.c...@gmail.com wrote:
Hi Salih,
Normally I do sort before
Great stuff Tim. This definitely will make Mesos users life easier
Sent from my iPad
On 2015-08-12, at 11:52, Haripriya Ayyalasomayajula aharipriy...@gmail.com
wrote:
Thanks Tim, Jerry.
On Wed, Aug 12, 2015 at 1:18 AM, Tim Chen t...@mesosphere.io wrote:
Yes the options
My experience with Mesos + Spark is not great. I saw one executor with 30 CPU
and the other executor with 6. So I don't think you can easily configure it
without some tweaking at the source code.
Sent from my iPad
On 2015-08-11, at 2:38, Haripriya Ayyalasomayajula aharipriy...@gmail.com
Just out of curiosity, what is the advantage of using parquet without hadoop?
Sent from my iPhone
On 11 Aug, 2015, at 11:12 am, saif.a.ell...@wellsfargo.com wrote:
I confirm that it works,
I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450
Saif
From:
the spark shell, so all I do is Test.run(sc) in shell.
Let me know what to look for to debug this problem. I'm not sure where to
look to solve this problem.
Thanks,
Jerry
By the way, if Hive is present in the Spark install, does show up in text
when you start the spark shell? Any commands I can run to check if it
exists? I didn't setup the spark machine that I use, so I don't know what's
present or absent.
Thanks,
Jerry
On Mon, Aug 10, 2015 at 2:38 PM
Hi Akshat,
Is there a particular reason you don't use s3a? From my experience,s3a performs
much better than the rest. I believe the inefficiency is from the
implementation of the s3 interface.
Best Regards,
Jerry
Sent from my iPhone
On 9 Aug, 2015, at 5:48 am, Akhil Das ak
before?
Best Regards,
Jerry
,
Jerry
on.
Thank you for your help!
Jerry
On Thu, Jul 30, 2015 at 11:10 AM, Ted Yu yuzhih...@gmail.com wrote:
The files were dated 16-Jul-2015
Looks like nightly build either was not published, or published at a
different location.
You can download spark-1.5.0-SNAPSHOT.tgz and binary-search
. The speed is 4x faster in
the data-without-mapping that means that the more columns a parquet file
has the slower it is even only a specific column is needed.
Anyone has an explanation on this? I was expecting both of them will finish
approximate the same time.
Best Regards,
Jerry
Hi guys,
I noticed that too. Anders, can you confirm that it works on Spark 1.5
snapshot? This is what I tried at the end. It seems it is 1.4 issue.
Best Regards,
Jerry
On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg arp...@spotify.com wrote:
No, never really resolved the problem, except
You mean this does not work?
SELECT key, count(value) from table group by key
On Sun, Jul 19, 2015 at 2:28 PM, N B nb.nos...@gmail.com wrote:
Hello,
How do I go about performing the equivalent of the following SQL clause in
Spark Streaming? I will be using this on a Windowed DStream.
Yes.
Sent from my iPhone
On 19 Jul, 2015, at 10:52 pm, Jahagirdar, Madhu
madhu.jahagir...@philips.com wrote:
All,
Can we run different version of Spark using the same Mesos Dispatcher. For
example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ?
Regards,
Madhu
Hi Nikunj,
Sorry, I totally misread your question.
I think you need to first groupbykey (get all values of the same key together),
then follow by mapValues (probably put the values into a set and then take the
size of it because you want a distinct count)
HTH,
Jerry
Sent from my iPhone
?
--
*From:* Jerry Lam [chiling...@gmail.com]
*Sent:* Monday, July 20, 2015 8:27 AM
*To:* Jahagirdar, Madhu
*Cc:* user; d...@spark.apache.org
*Subject:* Re: Spark Mesos Dispatcher
Yes.
Sent from my iPhone
On 19 Jul, 2015, at 10:52 pm, Jahagirdar, Madhu
madhu.jahagir...@philips.com wrote
similar style off-heap memory
mgmt, more planning optimizations
*From:* Jerry Lam [mailto:chiling...@gmail.com chiling...@gmail.com]
*Sent:* Sunday, July 5, 2015 6:28 PM
*To:* Ted Yu
*Cc:* Slim Baltagi; user
*Subject:* Re: Benchmark results between Flink and Spark
Hi guys,
I just read
is in
comparisons to Flink is one of the immediate questions I have. It would be
great if they have the benchmark software available somewhere for other
people to experiment.
just my 2 cents,
Jerry
On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu yuzhih...@gmail.com wrote:
There was no mentioning
. However, I didn't use the
spark-csv package though. I did that manually so I cannot comment on the
spark-csv.
HTH,
Jerry
On Thu, Feb 5, 2015 at 9:32 AM, Spico Florin spicoflo...@gmail.com wrote:
Hello!
I'm using spark-csv 2.10 with Java from the maven repository
groupIdcom.databricks/groupId
Hi Deep,
what do you mean by stuck?
Jerry
On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
Is there any better operation than Union. I am using union and the cluster
is getting stuck with a large data set.
Thank you
Hi Deep,
How do you know the cluster is not responsive because of Union?
Did you check the spark web console?
Best Regards,
Jerry
On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
The cluster hangs.
On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam chiling
not affiliate
with Cloudera but it seems they are the only one who is very active in the
spark project and provides a hadoop distribution.
HTH,
Jerry
btw, who is Paco Nathan?
On Thu, Jan 22, 2015 at 10:03 AM, Babu, Prashanth
prashanth.b...@nttdata.com wrote:
Sudipta,
Use the Docker image
Hi guys,
Does this issue affect 1.2.0 only or all previous releases as well?
Best Regards,
Jerry
On Thu, Jan 8, 2015 at 1:40 AM, Xuelin Cao xuelincao2...@gmail.com wrote:
Yes, the problem is, I've turned the flag on.
One possible reason for this is, the parquet file supports predicate
wasn't that bad at all. If it is not indexed,
I expect it to take much longer time.
Can IndexedRDD be sorted by keys as well?
Best Regards,
Jerry
On Tue, Jan 13, 2015 at 11:06 AM, Andrew Ash and...@andrewash.com wrote:
Hi Jem,
Linear time in scaling on the big table doesn't seem
objects.
I'm thinking of overriding the saveAsParquetFile method to allows me to
persist the avro schema inside parquet. Is this possible at all?
Best Regards,
Jerry
On Fri, Jan 9, 2015 at 2:05 AM, Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
I cam across this
http://zenfractal.com
that. Is there
another API that allows me to do this?
Best Regards,
Jerry
.
Is this something already possible with spark/tachyon? If not, do you think
it is possible? Does anyone mind to share their experience in capturing the
data lineage in a data processing pipeline?
Best Regards,
Jerry
)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382)
Is this supported?
Best Regards,
Jerry
Hi Sean and Madhu,
Thank you for the explanation. I really appreciate it.
Best Regards,
Jerry
On Fri, Dec 19, 2014 at 4:50 AM, Sean Owen so...@cloudera.com wrote:
coalesce actually changes the number of partitions. Unless the
original RDD had just 1 partition, coalesce(1) will make an RDD
Michael,
Thanks. Is this still turned off in the released 1.2? Is it possible to
turn it on just to get an idea of how much of a difference it makes?
-Jerry
On 05/12/14 12:40 am, Michael Armbrust wrote:
I'll add that some of our data formats will actual infer this sort of
useful information
Hi Spark users,
I wonder if val resultRDD = RDDA.union(RDDB) will always have records in
RDDA before records in RDDB.
Also, will resultRDD.coalesce(1) change this ordering?
Best Regards,
Jerry
Hi spark users,
Do you know how to read json files using Spark SQL that are LZO compressed?
I'm looking into sqlContext.jsonFile but I don't know how to configure it
to read lzo files.
Best Regards,
Jerry
Hi Ted,
Thanks for your help.
I'm able to read lzo files using sparkContext.newAPIHadoopFile but I
couldn't do the same for sqlContext because sqlContext.josnFile does not
provide ways to configure the input file format. Do you know if there are
some APIs to do that?
Best Regards,
Jerry
On Wed
)
In some scenarios, Hadoop is faster because it is saving one stage. Did I
do something wrong?
Best Regards,
Jerry
On Wed, Dec 17, 2014 at 1:29 PM, Michael Armbrust mich...@databricks.com
wrote:
You can create an RDD[String] using whatever method and pass that to
jsonRDD.
On Wed, Dec 17, 2014
.user_id == t2.user_id)
nor
t1.join(t2, on = Some('t1.user_id == t2.user_id))
work, or even compile. I could not find any examples of how to perform a
join using the DSL. Any pointers will be appreciated :)
Thanks
-Jerry
Another problem with the DSL:
t1.where('term == dmin).count() returns zero. But
sqlCtx.sql(select * from t1 where term = 'dmin').count() returns 700,
which I know is correct from the data. Is there something wrong with how
I'm using the DSL?
Thanks
On 17/12/14 11:13 am, Jerry Raj wrote
in which I can do that. The farthest I can get to
is to convert items.toSeq. The type information I got back is:
scala items.toSeq
res57: Seq[Any] = [WrappedArray([1,orange],[2,apple])]
Any suggestion?
Best Regards,
Jerry
Hi Mark,
Thank you for helping out.
The items I got back from Spark SQL has the type information as follows:
scala items
res16: org.apache.spark.sql.Row = [WrappedArray([1,orange],[2,apple])]
I tried to iterate the items as you suggested but no luck.
Best Regards,
Jerry
On Mon, Dec 15
with name = apple with
early stopping.
Is this possible? If yes, how one implements the contain function?
Best Regards,
Jerry
java.lang.RuntimeException: [1.57] failure:
``('' expected but identifier myudf found
I also tried returning a List of Ints, that did not work either. Is
there a way to write a UDF that returns a list?
Thanks
-Jerry
-
To unsubscribe, e
Hi,
If I create a SchemaRDD from a file that I know is sorted on a certain
field, is it possible to somehow pass that information on to Spark SQL
so that SQL queries referencing that field are optimized?
Thanks
-Jerry
.
Best Regards,
Jerry
Sent from my iPad
On Jul 24, 2014, at 6:53 AM, Sameer Sayyed sam.sayyed...@gmail.com wrote:
Hello All,
I am new user of spark, I am using cloudera-quickstart-vm-5.0.0-0-vmware for
execute sample examples of Spark.
I am very sorry for silly and basic question.
I am
. --jars A.jar,B.jar,C.jar not --jars A.jar, B.jar, C.jar
I'm just guessing because when I used --jars I never have spaces in it.
HTH,
Jerry
On Wed, Jul 16, 2014 at 5:30 AM, Madabhattula Rajesh Kumar
mrajaf...@gmail.com wrote:
Hi Team,
Now i've changed my code and reading configuration from
Hi Rajesh,
can you describe your spark cluster setup? I saw localhost:2181 for
zookeeper.
Best Regards,
Jerry
On Tue, Jul 15, 2014 at 9:47 AM, Madabhattula Rajesh Kumar
mrajaf...@gmail.com wrote:
Hi Team,
Could you please help me to resolve the issue.
*Issue *: I'm not able to connect
. uber jar it and run it just like any other simple
java program. If you still have connection issues, then at least you know
the problem is from the configurations.
HTH,
Jerry
On Tue, Jul 15, 2014 at 12:10 PM, Krishna Sankar ksanka...@gmail.com
wrote:
One vector to check is the HBase libraries
://issues.apache.org/jira/browse/SPARK-2483 seems to
address only HiveQL.
Best Regards,
Jerry
On Tue, Jul 15, 2014 at 3:38 AM, anyweil wei...@gmail.com wrote:
Thank you so much for the information, now i have merge the fix of #1411
and
seems the HiveSQL works with:
SELECT name FROM people WHERE
of spark, but maybe not.
HTH,
Jerry
On Mon, Jul 14, 2014 at 3:09 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
You currently can't use SparkContext inside a Spark task, so in this case
you'd have to call some kind of local K-means library. One example you can
try to use is Weka (http
Then yarn application -kill appid should work. This is what I did 2 hours ago.
Sorry I cannot provide more help.
Sent from my iPhone
On 14 Jul, 2014, at 6:05 pm, hsy...@gmail.com hsy...@gmail.com wrote:
yarn-cluster
On Mon, Jul 14, 2014 at 2:44 PM, Jerry Lam chiling...@gmail.com wrote
or it is a bug in spark sql?
Best Regards,
Jerry
issue?
For the curious mind, the dataset is about 200-300GB and we are using 10
machines for this benchmark. Given the env is equal between the two
experiments, why pure spark is faster than SparkSQL?
Best Regards,
Jerry
By the way, I also try hql(select * from m).count. It is terribly slow
too.
On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam chiling...@gmail.com wrote:
Hi Spark users and developers,
I'm doing some simple benchmarks with my team and we found out a potential
performance issue using Hive via
Hi Spark users,
Also, to put the performance issue into perspective, we also ran the query
on Hive. It took about 5 minutes to run.
Best Regards,
Jerry
On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam chiling...@gmail.com wrote:
By the way, I also try hql(select * from m).count. It is terribly
overhead, then there must be something additional
that SparkSQL adds to the overall overheads that Hive doesn't have.
Best Regards,
Jerry
On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust mich...@databricks.com
wrote:
On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote
[], (MetastoreRelation test, m, None), None
HiveTableScan [id#106], (MetastoreRelation test, s, Some(s)), None
Best Regards,
Jerry
On Thu, Jul 10, 2014 at 7:16 PM, Michael Armbrust mich...@databricks.com
wrote:
Hi Jerry,
Thanks for reporting this. It would be helpful if you could
+1 as well for being able to submit jobs programmatically without using
shell script.
we also experience issues of submitting jobs programmatically without using
spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar
to submit jobs in shell.
On Wed, Jul 9, 2014 at 9:47 AM,
that defines how my application
should look like. In my humble opinion, using Spark as embeddable library
rather than main framework and runtime is much easier.
On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote:
+1 as well for being able to submit jobs programmatically without
the error you
saw. By reducing the number of cores, there are more cpu resources
available to a task so the GC could finish before the error gets throw.
HTH,
Jerry
On Tue, Jul 8, 2014 at 1:35 PM, Aaron Davidson ilike...@gmail.com wrote:
There is a difference from actual GC overhead, which can
Hi guys,
I ended up reserving a room at the Phoenix (Hotel:
http://www.jdvhotels.com/hotels/california/san-francisco-hotels/phoenix-hotel)
recommended by my friend who has been in SF.
According to Google, it takes 11min to walk to the conference which is not
too bad.
Hope this helps!
Jerry
Hi Spark users,
Do you guys plan to go the spark summit? Can you recommend any hotel near
the conference? I'm not familiar with the area.
Thanks!
Jerry
it with spark,
I don't think you can get a lot of performance from scanning HBase unless
you are talking about caching the results from HBase in spark and reuse it
over and over.
HTH,
Jerry
On Wed, Apr 9, 2014 at 12:02 PM, David Quigley dquigle...@gmail.com wrote:
Hi all,
We are currently using hbase
Hi Shark,
Should I assume that Shark users should not use the shark APIs since there
are no documentations for it? If there are documentations, can you point it
out?
Best Regards,
Jerry
On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam chiling...@gmail.com wrote:
Hello everyone,
I have
101 - 171 of 171 matches
Mail list logo