Re: Spark standalone/Mesos on top of Ceph

2015-09-22 Thread Jerry Lam
Do you have specific reasons to use Ceph? I used Ceph before, I'm not too in love with it especially when I was using the Ceph Object Gateway S3 API. There are some incompatibilities with aws s3 api. You really really need to try it because making the commitment. Did you managed to install it? On

Re: How does one use s3 for checkpointing?

2015-09-21 Thread Jerry Lam
Hi Amit, Have you looked at Amazon EMR? Most people using EMR use s3 for persistency (both as input and output of spark jobs). Best Regards, Jerry Sent from my iPhone > On 21 Sep, 2015, at 9:24 pm, Amit Ramesh wrote: > > > A lot of places in the documentation mention

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
I just noticed you found 1.4 has the same issue. I added that as well in the ticket. On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam wrote: > Hi Yin, > > You are right! I just tried the scala version with the above lines, it > works as expected. > I'm not sure if it happens als

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
actually a bit. I created a ticket for this (SPARK-10731 <https://issues.apache.org/jira/browse/SPARK-10731>). Best Regards, Jerry On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai wrote: > btw, does 1.4 has the same problem? > > On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > &

Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
aster. Not to mention that if I do: df.rdd.take(1) //runs much faster. Is this expected? Why head/first/take is so slow for dataframe? Is it a bug in the optimizer? or I did something wrong? Best Regards, Jerry

Re: Java vs. Scala for Spark

2015-09-08 Thread Jerry Lam
what language do the developers are comfortable with? - what are the components in the system that will constraint the choice of the language? Best Regards, Jerry On Tue, Sep 8, 2015 at 11:59 AM, Dean Wampler wrote: > It's true that Java 8 lambdas help. If you've read Learning Spark

Re: Does the driver program always run local to where you submit the job from?

2015-08-26 Thread Jerry
Thanks! On Wed, Aug 26, 2015 at 2:06 PM, Marcelo Vanzin wrote: > On Wed, Aug 26, 2015 at 2:03 PM, Jerry wrote: > > Assuming your submitting the job from terminal; when main() is called, > if I > > try to open a file locally, can I assume the machine is always the one I >

Does the driver program always run local to where you submit the job from?

2015-08-26 Thread Jerry
a cluster. The file I'm opening is purely for the driver program and not something the worker nodes are going to read from. Thanks, Jerry

Spark return key value pair

2015-08-19 Thread Jerry OELoo
Hi. I want to parse a file and return a key-value pair with pySpark, but result is strange to me. the test.sql is a big fie and each line is usename and password, with # between them, I use below mapper2 to map data, and in my understanding, i in words.take(10) should be a tuple, but the result is

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi Prabeesh, That's even better! Thanks for sharing Jerry On Tue, Aug 18, 2015 at 1:31 PM, Prabeesh K. wrote: > Refer this post > http://blog.prabeeshk.com/blog/2015/06/19/pyspark-notebook-with-docker/ > > Spark + Jupyter + Docker > > On 18 August 2015 at 21:29, Je

Re: What am I missing that's preventing javac from finding the libraries (CLASSPATH is setup...)?

2015-08-18 Thread Jerry
So from what I understand, those usually pull dependencies for a given project? I'm able to run the spark shell so I'd assume I have everything. What am I missing from the big picture and what directory do I run maven on? Thanks, Jerry On Tue, Aug 18, 2015 at 11:15 AM, Ted

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi Guru, Thanks! Great to hear that someone tried it in production. How do you like it so far? Best Regards, Jerry On Tue, Aug 18, 2015 at 11:38 AM, Guru Medasani wrote: > Hi Jerry, > > Yes. I’ve seen customers using this in production for data science work. > I’m currently us

What am I missing that's preventing javac from finding the libraries (CLASSPATH is setup...)?

2015-08-18 Thread Jerry
ot exist import org.apache.spark.sql.hive.*; Let me know what I'm doing wrong. Thanks, Jerry

Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
cannot do this. Other solutions (e.g. Zeppelin) seem to reinvent the wheel that IPython has already offered years ago. It would be great if someone can educate me the reason behind this. Best Regards, Jerry

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-17 Thread Jerry Lam
server: /etc/httpd/modules/mod_authz_core.so: cannot open shared object file: No such file or directory [FAILED] Best Regards, Jerry On Mon, Aug 17, 2015 at 11:09 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Howdy folks! > > I’m interested in hearing about what people

Re: Another issue with using lag and lead with data frames

2015-08-14 Thread Jerry
12) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) On Fri, Aug 14, 2015 at 1:39 PM, Jerry wrote: > Hi Salih, > Normally I do sort before performing that operation, but since I've been > trying to get this working for a week, I'm just loading something simple to > test if la

Re: Another issue with using lag and lead with data frames

2015-08-14 Thread Jerry
ther. Hopefully those links point me to something useful. Let me know if you can run the above code/ what you did different to get that code to run. Thanks, Jerry On Fri, Aug 14, 2015 at 1:23 PM, Salih Oztop wrote: > Hi Jerry, > This blog post is perfect for window function

Another issue with using lag and lead with data frames

2015-08-14 Thread Jerry
So it seems like dataframes aren't going give me a break and just work. Now it evaluates but goes nuts if it runs into a null case OR doesn't know how to get the correct data type when I specify the default value as a string expression. Let me know if anyone has a work around to this. PLEASE HELP M

Re: Controlling number of executors on Mesos vs YARN

2015-08-12 Thread Jerry Lam
Great stuff Tim. This definitely will make Mesos users life easier Sent from my iPad On 2015-08-12, at 11:52, Haripriya Ayyalasomayajula wrote: > Thanks Tim, Jerry. > > On Wed, Aug 12, 2015 at 1:18 AM, Tim Chen wrote: > Yes the options are not that configurable yet but I think i

Re: Parquet without hadoop: Possible?

2015-08-11 Thread Jerry Lam
Just out of curiosity, what is the advantage of using parquet without hadoop? Sent from my iPhone > On 11 Aug, 2015, at 11:12 am, wrote: > > I confirm that it works, > > I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450 > > Saif > > From: Ellafi, Saif A. > S

Re: Controlling number of executors on Mesos vs YARN

2015-08-11 Thread Jerry Lam
My experience with Mesos + Spark is not great. I saw one executor with 30 CPU and the other executor with 6. So I don't think you can easily configure it without some tweaking at the source code. Sent from my iPad On 2015-08-11, at 2:38, Haripriya Ayyalasomayajula wrote: > Hi Tim, > > Spark

Re: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Jerry
By the way, if Hive is present in the Spark install, does show up in text when you start the spark shell? Any commands I can run to check if it exists? I didn't setup the spark machine that I use, so I don't know what's present or absent. Thanks, Jerry On Mon, Aug 10,

Re: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Jerry
Thanks... looks like I now hit that bug about HiveMetaStoreClient as I now get the message about being unable to instantiate it. On a side note, does anyone know where hive-site.xml is typically located? Thanks, Jerry On Mon, Aug 10, 2015 at 2:03 PM, Michael Armbrust wrote: >

Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Jerry
pointed to with -cp when starting the spark shell, so all I do is "Test.run(sc)" in shell. Let me know what to look for to debug this problem. I'm not sure where to look to solve this problem. Thanks, Jerry

Re: Accessing S3 files with s3n://

2015-08-09 Thread Jerry Lam
Hi Akshat, Is there a particular reason you don't use s3a? From my experience,s3a performs much better than the rest. I believe the inefficiency is from the implementation of the s3 interface. Best Regards, Jerry Sent from my iPhone > On 9 Aug, 2015, at 5:48 am, Akhil Da

Poor HDFS Data Locality on Spark-EC2

2015-08-04 Thread Jerry Lam
experience this problem before? Best Regards, Jerry

Re: Spark Master Build Git Commit Hash

2015-07-30 Thread Jerry Lam
one is based on. Thank you for your help! Jerry On Thu, Jul 30, 2015 at 11:10 AM, Ted Yu wrote: > The files were dated 16-Jul-2015 > Looks like nightly build either was not published, or published at a > different location. > > You can download spark-1.5.0-SNAPSHOT.tgz and binary

Spark Master Build Git Commit Hash

2015-07-30 Thread Jerry Lam
gards, Jerry

Unexpected performance issues with Spark SQL using Parquet

2015-07-27 Thread Jerry Lam
pping. The speed is 4x faster in the data-without-mapping that means that the more columns a parquet file has the slower it is even only a specific column is needed. Anyone has an explanation on this? I was expecting both of them will finish approximate the same time. Best Regards, Jerry

Re: Partition parquet data by ENUM column

2015-07-23 Thread Jerry Lam
lared type (org.apache.parquet.io.api.Binary) does not match the schema found in file metadata. Column item is of type: FullTypeDescriptor(PrimitiveType: BINARY, OriginalType: ENUM) Valid types for this column are: null Is it because Spark does not recognize ENUM type in parquet? Best Regards, Jerry On Wed, Jul 22, 201

Re: Parquet problems

2015-07-22 Thread Jerry Lam
Hi guys, I noticed that too. Anders, can you confirm that it works on Spark 1.5 snapshot? This is what I tried at the end. It seems it is 1.4 issue. Best Regards, Jerry On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg wrote: > No, never really resolved the problem, except by increasing

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
sing that dispatcher ? > ------ > *From:* Jerry Lam [chiling...@gmail.com] > *Sent:* Monday, July 20, 2015 8:27 AM > *To:* Jahagirdar, Madhu > *Cc:* user; d...@spark.apache.org > *Subject:* Re: Spark Mesos Dispatcher > > Yes. > > Sent from my iPhone > > On 19 Jul,

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
Yes. Sent from my iPhone > On 19 Jul, 2015, at 10:52 pm, "Jahagirdar, Madhu" > wrote: > > All, > > Can we run different version of Spark using the same Mesos Dispatcher. For > example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ? > > Regards, > Madhu Jahagirdar > > Th

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
Hi Nikunj, Sorry, I totally misread your question. I think you need to first groupbykey (get all values of the same key together), then follow by mapValues (probably put the values into a set and then take the size of it because you want a distinct count) HTH, Jerry Sent from my iPhone >

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
You mean this does not work? SELECT key, count(value) from table group by key On Sun, Jul 19, 2015 at 2:28 PM, N B wrote: > Hello, > > How do I go about performing the equivalent of the following SQL clause in > Spark Streaming? I will be using this on a Windowed DStream. > > SELECT key, coun

Re: Benchmark results between Flink and Spark

2015-07-14 Thread Jerry Lam
-Bits-and-Bytes.html > > Probably if re-ran the benchmarks with 1.5/tungsten line would close the > gap a bit(or a lot) with spark moving towards similar style off-heap memory > mgmt, more planning optimizations > > > *From:* Jerry Lam [mailto:chiling...@gmail.com ] > *Sent:* Sun

Re: Benchmark results between Flink and Spark

2015-07-05 Thread Jerry Lam
in comparisons to Flink is one of the immediate questions I have. It would be great if they have the benchmark software available somewhere for other people to experiment. just my 2 cents, Jerry On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu wrote: > There was no mentioning of the versions of Flink

Re: Reading from CSV file with spark-csv_2.10

2015-02-05 Thread Jerry Lam
spark. However, I didn't use the spark-csv package though. I did that manually so I cannot comment on the spark-csv. HTH, Jerry On Thu, Feb 5, 2015 at 9:32 AM, Spico Florin wrote: > Hello! > I'm using spark-csv 2.10 with Java from the maven repository > com.databricks > s

Re: Union in Spark

2015-02-01 Thread Jerry Lam
Hi Deep, How do you know the cluster is not responsive because of "Union"? Did you check the spark web console? Best Regards, Jerry On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan wrote: > The cluster hangs. > > On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam wrote: > >>

Re: Union in Spark

2015-02-01 Thread Jerry Lam
Hi Deep, what do you mean by stuck? Jerry On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan wrote: > Hi, > Is there any better operation than Union. I am using union and the cluster > is getting stuck with a large data set. > > Thank you >

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Jerry Lam
x27;m not affiliate with Cloudera but it seems they are the only one who is very active in the spark project and provides a hadoop distribution. HTH, Jerry btw, who is Paco Nathan? On Thu, Jan 22, 2015 at 10:03 AM, Babu, Prashanth < prashanth.b...@nttdata.com> wrote: > Sudipta, > >

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-19 Thread Jerry Lam
Hi guys, Does this issue affect 1.2.0 only or all previous releases as well? Best Regards, Jerry On Thu, Jan 8, 2015 at 1:40 AM, Xuelin Cao wrote: > > Yes, the problem is, I've turned the flag on. > > One possible reason for this is, the parquet file supports "pr

Re: IndexedRDD

2015-01-13 Thread Jerry Lam
mance wasn't that bad at all. If it is not indexed, I expect it to take much longer time. Can IndexedRDD be sorted by keys as well? Best Regards, Jerry On Tue, Jan 13, 2015 at 11:06 AM, Andrew Ash wrote: > Hi Jem, > > Linear time in scaling on the big table doesn't seem

Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-09 Thread Jerry Lam
avro objects. I'm thinking of overriding the saveAsParquetFile method to allows me to persist the avro schema inside parquet. Is this possible at all? Best Regards, Jerry On Fri, Jan 9, 2015 at 2:05 AM, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > I cam

Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Jerry Lam
to do that. Is there another API that allows me to do this? Best Regards, Jerry

Spark or Tachyon: capture data lineage

2015-01-02 Thread Jerry Lam
->E. Is this something already possible with spark/tachyon? If not, do you think it is possible? Does anyone mind to share their experience in capturing the data lineage in a data processing pipeline? Best Regards, Jerry

SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-23 Thread Jerry Lam
on.toRdd$lzycompute(HiveContext.scala:382) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382) Is this supported? Best Regards, Jerry

Re: Spark SQL with a sorted file

2014-12-22 Thread Jerry Raj
Michael, Thanks. Is this still turned off in the released 1.2? Is it possible to turn it on just to get an idea of how much of a difference it makes? -Jerry On 05/12/14 12:40 am, Michael Armbrust wrote: I'll add that some of our data formats will actual infer this sort of useful inform

Re: UNION two RDDs

2014-12-22 Thread Jerry Lam
Hi Sean and Madhu, Thank you for the explanation. I really appreciate it. Best Regards, Jerry On Fri, Dec 19, 2014 at 4:50 AM, Sean Owen wrote: > coalesce actually changes the number of partitions. Unless the > original RDD had just 1 partition, coalesce(1) will make an RDD

UNION two RDDs

2014-12-18 Thread Jerry Lam
Hi Spark users, I wonder if val resultRDD = RDDA.union(RDDB) will always have records in RDDA before records in RDDB. Also, will resultRDD.coalesce(1) change this ordering? Best Regards, Jerry

Re: Spark SQL DSL for joins?

2014-12-18 Thread Jerry Raj
Thanks, that helped. And I needed SchemaRDD.as() to provide an alias for the RDD. -Jerry On 17/12/14 12:12 pm, Tobias Pfeiffer wrote: Jerry, On Wed, Dec 17, 2014 at 3:35 PM, Jerry Raj mailto:jerry@gmail.com>> wrote: Another problem with the DSL: t1.where('term == &q

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
files) In some scenarios, Hadoop is faster because it is saving one stage. Did I do something wrong? Best Regards, Jerry On Wed, Dec 17, 2014 at 1:29 PM, Michael Armbrust wrote: > > You can create an RDD[String] using whatever method and pass that to > jsonRDD. > > On Wed, Dec

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
Hi Ted, Thanks for your help. I'm able to read lzo files using sparkContext.newAPIHadoopFile but I couldn't do the same for sqlContext because sqlContext.josnFile does not provide ways to configure the input file format. Do you know if there are some APIs to do that? Best Regards, Jer

Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
Hi spark users, Do you know how to read json files using Spark SQL that are LZO compressed? I'm looking into sqlContext.jsonFile but I don't know how to configure it to read lzo files. Best Regards, Jerry

Re: Spark SQL DSL for joins?

2014-12-16 Thread Jerry Raj
Another problem with the DSL: t1.where('term == "dmin").count() returns zero. But sqlCtx.sql("select * from t1 where term = 'dmin').count() returns 700, which I know is correct from the data. Is there something wrong with how I'm using the DSL? Thanks On

Spark SQL DSL for joins?

2014-12-16 Thread Jerry Raj
, on = 't1.user_id == t2.user_id) nor t1.join(t2, on = Some('t1.user_id == t2.user_id)) work, or even compile. I could not find any examples of how to perform a join using the DSL. Any pointers will be appreciated :) Thanks -Jerry

Re: Accessing rows of a row in Spark

2014-12-15 Thread Jerry Lam
Hi Mark, Thank you for helping out. The items I got back from Spark SQL has the type information as follows: scala> items res16: org.apache.spark.sql.Row = [WrappedArray([1,orange],[2,apple])] I tried to iterate the items as you suggested but no luck. Best Regards, Jerry On Mon, Dec

Accessing rows of a row in Spark

2014-12-15 Thread Jerry Lam
not find a method in which I can do that. The farthest I can get to is to convert items.toSeq. The type information I got back is: scala> items.toSeq res57: Seq[Any] = [WrappedArray([1,orange],[2,apple])] Any suggestion? Best Regards, Jerry

Filtering nested data using Spark SQL

2014-12-10 Thread Jerry Lam
ere(contain('item, "name", "apple")).collect() the contain function will loop through the item with "name" = "apple" with early stopping. Is this possible? If yes, how one implements the contain function? Best Regards, Jerry

Spark SQL with a sorted file

2014-12-03 Thread Jerry Raj
Hi, If I create a SchemaRDD from a file that I know is sorted on a certain field, is it possible to somehow pass that information on to Spark SQL so that SQL queries referencing that field are optimized? Thanks -Jerry - To

Spark SQL UDF returning a list?

2014-12-03 Thread Jerry Raj
tion in thread "main" java.lang.RuntimeException: [1.57] failure: ``('' expected but identifier myudf found I also tried returning a List of Ints, that did not work either. Is there a way to write a UDF that returns a list? Thanks -Jerry

Re: Spark SQL and Hive tables

2014-07-25 Thread Jerry Lam
sing Spark SQL. It is a good starting point. Best Regards, Jerry On Fri, Jul 25, 2014 at 5:38 PM, Sameer Tilak wrote: > Hi Michael, > Thanks. I am not creating HiveContext, I am creating SQLContext. I am > using CDH 5.1. Can you please let me know which conf/ directory you

Re: Spark SQL and Hive tables

2014-07-25 Thread Jerry Lam
Hi Sameer, Maybe this page will help you: https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables Best Regards, Jerry On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak wrote: > Hi All, > I am trying to load data from Hive tables using Spark SQL. I am using > sp

Re: Starting with spark

2014-07-24 Thread Jerry
spark. Best Regards, Jerry Sent from my iPad > On Jul 24, 2014, at 6:53 AM, Sameer Sayyed wrote: > > Hello All, > > I am new user of spark, I am using cloudera-quickstart-vm-5.0.0-0-vmware for > execute sample examples of Spark. > I am very sorry for silly and basic qu

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-16 Thread Jerry Lam
it is executed in spark regardless of dialect although the execution might be different for the same query." Best Regards, Jerry On Tue, Jul 15, 2014 at 6:22 PM, Michael Armbrust wrote: > hql and sql are just two different dialects for interacting with data. > After parsing is complete

Re: Need help on spark Hbase

2014-07-16 Thread Jerry Lam
i.e. --jars A.jar,B.jar,C.jar not --jars A.jar, B.jar, C.jar I'm just guessing because when I used --jars I never have spaces in it. HTH, Jerry On Wed, Jul 16, 2014 at 5:30 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi Team, > > Now i've changed my

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Jerry Lam
that HiveContext needs a metastore and it has a more powerful SQL support borrowed from Hive. Can you shed some lights on this when you get a minute? Thanks, Jerry On Tue, Jul 15, 2014 at 4:32 PM, Michael Armbrust wrote: > No, that is why I included the link to SPARK-2096

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Jerry Lam
lt;https://issues.apache.org/jira/browse/SPARK-2483> seems to address only HiveQL. Best Regards, Jerry On Tue, Jul 15, 2014 at 3:38 AM, anyweil wrote: > Thank you so much for the information, now i have merge the fix of #1411 > and > seems the HiveSQL works with: > SELECT name FROM

Re: Need help on spark Hbase

2014-07-15 Thread Jerry Lam
ated codes. uber jar it and run it just like any other simple java program. If you still have connection issues, then at least you know the problem is from the configurations. HTH, Jerry On Tue, Jul 15, 2014 at 12:10 PM, Krishna Sankar wrote: > One vector to check is the HBase libraries in the

Re: How to kill running spark yarn application

2014-07-15 Thread Jerry Lam
ApplicationMaster, the SparkSubmit will return "yarnAppState: KILLED" and then terminated itself. This is what happens to me using cdh 5.0.2 Which distribution of hadoop you are using? On Tue, Jul 15, 2014 at 10:42 AM, Jerry Lam wrote: > when I use yarn application -kill, both Sp

Re: How to kill running spark yarn application

2014-07-15 Thread Jerry Lam
lication -kill" If you do jps You'll have a list > of SparkSubmit and ApplicationMaster > > After you use yarn applicaton -kill you only kill the SparkSubmit > > > > On Mon, Jul 14, 2014 at 4:29 PM, Jerry Lam wrote: > >> Then yarn application -kill appi

Re: Need help on spark Hbase

2014-07-15 Thread Jerry Lam
Hi Rajesh, can you describe your spark cluster setup? I saw localhost:2181 for zookeeper. Best Regards, Jerry On Tue, Jul 15, 2014 at 9:47 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi Team, > > Could you please help me to resolve the issue. > > *Iss

Re: How to kill running spark yarn application

2014-07-14 Thread Jerry Lam
Then yarn application -kill appid should work. This is what I did 2 hours ago. Sorry I cannot provide more help. Sent from my iPhone > On 14 Jul, 2014, at 6:05 pm, "hsy...@gmail.com" wrote: > > yarn-cluster > > >> On Mon, Jul 14, 2014 at 2:44 PM, Jerry Lam

Re: How to kill running spark yarn application

2014-07-14 Thread Jerry Lam
Hi Siyuan, I wonder if you --master yarn-cluster or yarn-client? Best Regards, Jerry On Mon, Jul 14, 2014 at 5:08 PM, hsy...@gmail.com wrote: > Hi all, > > A newbie question, I start a spark yarn application through spark-submit > > How do I kill this app. I can kill

Re: Can we get a spark context inside a mapper

2014-07-14 Thread Jerry Lam
rn of spark, but maybe not. HTH, Jerry On Mon, Jul 14, 2014 at 3:09 PM, Matei Zaharia wrote: > You currently can't use SparkContext inside a Spark task, so in this case > you'd have to call some kind of local K-means library. One example you can > try to use is Weka (http:/

Re: Potential bugs in SparkSQL

2014-07-10 Thread Jerry Lam
tesianProduct HiveTableScan [], (MetastoreRelation test, m, None), None HiveTableScan [id#106], (MetastoreRelation test, s, Some(s)), None Best Regards, Jerry On Thu, Jul 10, 2014 at 7:16 PM, Michael Armbrust wrote: > Hi Jerry, > > Thanks for reporting this. It would be helpf

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
overhead, then there must be something additional that SparkSQL adds to the overall overheads that Hive doesn't have. Best Regards, Jerry On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust wrote: > On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam wrote: > >> For the curious mind, the

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Spark users, Also, to put the performance issue into perspective, we also ran the query on Hive. It took about 5 minutes to run. Best Regards, Jerry On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam wrote: > By the way, I also try hql("select * from m").count. It is terrib

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
By the way, I also try hql("select * from m").count. It is terribly slow too. On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam wrote: > Hi Spark users and developers, > > I'm doing some simple benchmarks with my team and we found out a potential > performance issue usi

Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
sing pure spark. I wonder if anyone knows what causes the performance issue? For the curious mind, the dataset is about 200-300GB and we are using 10 machines for this benchmark. Given the env is equal between the two experiments, why pure spark is faster than SparkSQL? Best Regards, Jerry

Potential bugs in SparkSQL

2014-07-10 Thread Jerry Lam
on (s.id=m_id)").collect().foreach(println) It will work. Am I doing something wrong or it is a bug in spark sql? Best Regards, Jerry

Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam
rk as a >>> separate service (just like MySQL and JDBC, for example). With spark-submit >>> I'm bound to Spark as a main framework that defines how my application >>> should look like. In my humble opinion, using Spark as embeddable library >>> rather than ma

Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam
+1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used "hadoop jar" to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM,

Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Jerry Lam
and trigger the error you saw. By reducing the number of cores, there are more cpu resources available to a task so the GC could finish before the error gets throw. HTH, Jerry On Tue, Jul 8, 2014 at 1:35 PM, Aaron Davidson wrote: > There is a difference from actual GC overhead, which

Re: Spark Summit 2014 (Hotel suggestions)

2014-05-27 Thread Jerry Lam
Hi guys, I ended up reserving a room at the Phoenix (Hotel: http://www.jdvhotels.com/hotels/california/san-francisco-hotels/phoenix-hotel) recommended by my friend who has been in SF. According to Google, it takes 11min to walk to the conference which is not too bad. Hope this helps! Jerry

Spark Summit 2014 (Hotel suggestions)

2014-05-06 Thread Jerry Lam
Hi Spark users, Do you guys plan to go the spark summit? Can you recommend any hotel near the conference? I'm not familiar with the area. Thanks! Jerry

Re: hbase scan performance

2014-04-09 Thread Jerry Lam
spark, I don't think you can get a lot of performance from scanning HBase unless you are talking about caching the results from HBase in spark and reuse it over and over. HTH, Jerry On Wed, Apr 9, 2014 at 12:02 PM, David Quigley wrote: > Hi all, > > We are currently using hbase

Re: Sample Project for using Shark API in Spark programs

2014-04-07 Thread Jerry Lam
Hi Shark, Should I assume that Shark users should not use the shark APIs since there are no documentations for it? If there are documentations, can you point it out? Best Regards, Jerry On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam wrote: > Hello everyone, > > I have successfully

Sample Project for using Shark API in Spark programs

2014-04-03 Thread Jerry Lam
T * FROM users WHERE age < 20") scala> println(youngUsers.count) ... scala> val featureMatrix = youngUsers.map(extractFeatures(_)) scala> kmeans(featureMatrix) Is there a more complete sample code to start a program using Shark API in Spark? Thanks! Jerry

<    1   2