Re: Dataset Question: No Encoder found for Set[(scala.Long, scala.Long)]

2017-02-01 Thread Jerry Lam
--+ > |[2A 01 03 02 02 0...| > ++ > > > scala> x.schema > res19: org.apache.spark.sql.types.StructType = > StructType(StructField(value,BinaryType,true)) > > > On Wed, Feb 1, 2017 at 12:03 PM, Jerry Lam <chiling...@gmail.com> wrote:

using withWatermark on Dataset

2017-02-01 Thread Jerry Lam
Hi everyone, Anyone knows how to use withWatermark on Dataset? I have tried the following but hit this exception: dataset org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to "MyType" The code looks like the following: dataset .withWatermark("timestamp", "5

Re: Dataset Question: No Encoder found for Set[(scala.Long, scala.Long)]

2017-02-01 Thread Jerry Lam
o encoder. there is no > other work around that i know of. > > import org.apache.spark.sql.{ Encoder, Encoders } > implicit def setEncoder[X]: Encoder[Set[X]] = Encoders.kryo[Set[X]] > > On Tue, Jan 31, 2017 at 7:33 PM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi guys, &

Dataset Question: No Encoder found for Set[(scala.Long, scala.Long)]

2017-01-31 Thread Jerry Lam
Hi guys, I got an exception like the following, when I tried to implement a user defined aggregation function. Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for Set[(scala.Long, scala.Long)] The Set[(Long, Long)] is a field in the case class which is the

Re: Collaborative Filtering Implicit Feedback Impl.

2016-12-05 Thread Jerry Lam
gt; That doesn't mean this 0 value is literally included in the input. There's > no need for that. > > On Tue, Dec 6, 2016 at 4:24 AM Jerry Lam <chiling...@gmail.com> wrote: > >> Hi Sean, >> >> I'm referring to the paper (http://yifanhu.net/PUB/cf.pdf) Section 2:

Re: Collaborative Filtering Implicit Feedback Impl.

2016-12-05 Thread Jerry Lam
air. (x^Ty)^2 + regularization. Do I misunderstand the paper? Best Regards, Jerry On Mon, Dec 5, 2016 at 2:43 PM, Sean Owen <so...@cloudera.com> wrote: > What are you referring to in what paper? implicit input would never > materialize 0s for missing values. > > On Tue, Dec

Collaborative Filtering Implicit Feedback Impl.

2016-12-05 Thread Jerry Lam
Hello spark users and developers, I read the paper from Yahoo about CF with implicit feedback and other papers using implicit feedbacks. Their implementation require to set the missing rating with 0. That is for unobserved ratings, the confidence for those is set to 1 (c=1). Therefore, the matrix

[Spark SQL]: UDF with Array[Double] as input

2016-04-01 Thread Jerry Lam
Hi spark users and developers, Anyone tried to pass in an Array[Double] as a input to the UDF? I tried it for many hours reading spark sql code but IK still couldn't figure out a way to do this. Best Regards, Jerry

Re: [Spark SQL] Unexpected Behaviour

2016-03-29 Thread Jerry Lam
answer though. As I said, this is just a tip of iceberg. I have experience worsen than this. For example, you might think renaming fields will work but in some cases, it still returns wrong results. Best Regards, Jerry On Tue, Mar 29, 2016 at 7:38 AM, Jerry Lam <chiling...@gmail.

Re: [Spark SQL] Unexpected Behaviour

2016-03-29 Thread Jerry Lam
uot;label")).explain >> == Physical Plan == >> TungstenProject [label#15] >> SortMergeJoin [id#14], [id#30] >> TungstenSort [id#14 ASC], false, 0 >>TungstenExchange hashpartitioning(id#14) >> TungstenProject [_1#12 AS id#14,_2#13 AS label#15] >&

Re: [Spark SQL] Unexpected Behaviour

2016-03-29 Thread Jerry Lam
uot;d2.label").select($"d1.label") > > > Hope this helps some. > > Best regards, > Sunitha. > > On Mar 28, 2016, at 2:34 PM, Jerry Lam <chiling...@gmail.com> wrote: > > Hi spark users and developers, > > I'm using spark 1.5.1 (I have no

[Spark SQL] Unexpected Behaviour

2016-03-28 Thread Jerry Lam
Hi spark users and developers, I'm using spark 1.5.1 (I have no choice because this is what we used). I ran into some very unexpected behaviour when I did some join operations lately. I cannot post my actual code here and the following code is not for practical reasons but it should demonstrate

Pattern Matching over a Sequence of rows using Spark

2016-02-28 Thread Jerry Lam
Hi spark users and developers, Anyone has experience developing pattern matching over a sequence of rows using Spark? I'm talking about functionality similar to matchpath in Hive or match_recognize in Oracle DB. It is used for path analysis on clickstream data. If you know of any libraries that

Re: Streaming with broadcast joins

2016-02-19 Thread Jerry Lam
Hi guys, I also encounter broadcast dataframe issue not for steaming jobs but regular dataframe join. In my case, the executors died probably due to OOM which I don't think it should use that much memory. Anyway, I'm going to craft an example and send it here to see if it is a bug or something

Re: Convert Iterable to RDD

2016-02-12 Thread Jerry Lam
Not sure if I understand your problem well but why don't you create the file locally and then upload to hdfs? Sent from my iPhone > On 12 Feb, 2016, at 9:09 am, "seb.arzt" wrote: > > I have an Iterator of several million elements, which unfortunately won't fit > into the

Re: Spark 1.5.2 memory error

2016-02-03 Thread Jerry Lam
Hi guys, I was processing 300GB data with lot of joins today. I have a combination of RDD->Dataframe->RDD due to legacy code. I have memory issues at the beginning. After fine-tuning those configurations that many already suggested above, it works with 0 task failed. I think it is fair to say any

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam
I think spark dataframe supports more than just SQL. It is more like pandas dataframe.( I rarely use the SQL feature. ) There are a lot of novelties in dataframe so I think it is quite optimize for many tasks. The in-memory data structure is very memory efficient. I just change a very slow RDD

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam
Hi Michael, Is there a section in the spark documentation demonstrate how to serialize arbitrary objects in Dataframe? The last time I did was using some User Defined Type (copy from VectorUDT). Best Regards, Jerry On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust

Union of RDDs without the overhead of Union

2016-02-02 Thread Jerry Lam
Hi Spark users and developers, anyone knows how to union two RDDs without the overhead of it? say rdd1.union(rdd2).saveTextFile(..) This requires a stage to union the 2 rdds before saveAsTextFile (2 stages). Is there a way to skip the union step but have the contents of the two rdds save to the

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam
on't mind it. But looks like I have to > convert all my existing transformation to things like > df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly > and error prone in my opinion. > > On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam <chiling...@gmail.com> wrote:

Re: Spark, Mesos, Docker and S3

2016-01-26 Thread Jerry Lam
Hi Mao, Can you try --jars to include those jars? Best Regards, Jerry Sent from my iPhone > On 26 Jan, 2016, at 7:02 pm, Mao Geng wrote: > > Hi there, > > I am trying to run Spark on Mesos using a Docker image as executor, as > mentioned >

Re: sqlContext.cacheTable("tableName") vs dataFrame.cache()

2016-01-19 Thread Jerry Lam
Is cacheTable similar to asTempTable before? Sent from my iPhone > On 19 Jan, 2016, at 4:18 am, George Sigletos wrote: > > Thanks Kevin for your reply. > > I was suspecting the same thing as well, although it still does not make much > sense to me why would you need

[Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Jerry Lam
Hi spark users and developers, what do you do if you want the from_unixtime function in spark sql to return the timezone you want instead of the system timezone? Best Regards, Jerry

Re: [Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Jerry Lam
if this is the only way out of the box. Thanks! Jerry On Mon, Jan 18, 2016 at 2:32 PM, Alexander Pivovarov <apivova...@gmail.com> wrote: > Look at > to_utc_timestamp > > from_utc_timestamp > On Jan 18, 2016 9:39 AM, "Jerry Lam" <chiling...@gmail.com> wrot

Re: DataFrameWriter on partitionBy for parquet eat all RAM

2016-01-15 Thread Jerry Lam
ome workarounds: > https://issues.apache.org/jira/browse/SPARK-12546 > >> On Thu, Jan 14, 2016 at 6:46 PM, Jerry Lam <chiling...@gmail.com> wrote: >> Hi Arkadiusz, >> >> the partitionBy is not designed to have many distinct value the last time I >> u

Re: How To Save TF-IDF Model In PySpark

2016-01-15 Thread Jerry Lam
Can you save it to parquet with the vector in one field? Sent from my iPhone > On 15 Jan, 2016, at 7:33 pm, Andy Davidson > wrote: > > Are you using 1.6.0 or an older version? > > I think I remember something in 1.5.1 saying save was not implemented in >

Re: DataFrameWriter on partitionBy for parquet eat all RAM

2016-01-14 Thread Jerry Lam
Hi Arkadiusz, the partitionBy is not designed to have many distinct value the last time I used it. If you search in the mailing list, I think there are couple of people also face similar issues. For example, in my case, it won't work over a million distinct user ids. It will require a lot of

[Spark SQL]: Issues with writing dataframe with Append Mode to Parquet

2016-01-12 Thread Jerry Lam
Hi spark users and developers, I wonder if the following observed behaviour is expected. I'm writing dataframe to parquet into s3. I'm using append mode when I'm writing to it. Since I'm using org.apache.spark.sql. parquet.DirectParquetOutputCommitter as the

Re: [Spark SQL]: Issues with writing dataframe with Append Mode to Parquet

2016-01-12 Thread Jerry Lam
ich...@databricks.com> wrote: > There can be dataloss when you are using the DirectOutputCommitter and > speculation is turned on, so we disable it automatically. > > On Tue, Jan 12, 2016 at 1:11 PM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi spark users and developers

Re: SparkSQL integration issue with AWS S3a

2016-01-06 Thread Jerry Lam
not secured at all... > >> On Jan 2, 2016, at 11:13 AM, KOSTIANTYN Kudriavtsev >> <kudryavtsev.konstan...@gmail.com> wrote: >> >> thanks Jerry, it works! >> really appreciate your help >> >> Thank you, >> Konstantin Kudryavtse

Re: SparkSQL integration issue with AWS S3a

2016-01-01 Thread Jerry Lam
om> wrote: > > Hi Jerry, > > what you suggested looks to be working (I put hdfs-site.xml into > $SPARK_HOME/conf folder), but could you shed some light on how it can be > federated per user? > Thanks in advance! > > Thank you, > Konstantin Kudryavtsev > >>

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Jerry Lam
Hi Kostiantyn, Can you define those properties in hdfs-site.xml and make sure it is visible in the class path when you spark-submit? It looks like a conf sourcing issue to me. Cheers, Sent from my iPhone > On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev >

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Jerry Lam
? > > Thank you, > Konstantin Kudryavtsev > >> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam <chiling...@gmail.com> wrote: >> Hi Kostiantyn, >> >> Can you define those properties in hdfs-site.xml and make sure it is visible >> in the class path when y

Re: ideal number of executors per machine

2015-12-15 Thread Jerry Lam
Hi Veljko, I usually ask the following questions: “how many memory per task?” then "How many cpu per task?” then I calculate based on the memory and cpu requirements per task. You might be surprise (maybe not you, but at least I am :) ) that many OOM issues are actually because of this. Best

Re: spark-ec2 vs. EMR

2015-12-02 Thread Jerry Lam
er. Fixed only last week. Not sure if fixed in all branches >>> >>> 10. I think Amazon will include spark-jobserver to EMR soon. >>> >>> 11. You do not need to be aws expert to start EMR cluster. Users can use >>> EMR web ui to start cluster to run some

Re: Very slow startup for jobs containing millions of tasks

2015-11-14 Thread Jerry Lam
an 1.5.0, you miss some fixes such as SPARK-9952 > > Cheers > >> On Sat, Nov 14, 2015 at 6:35 PM, Jerry Lam <chiling...@gmail.com> wrote: >> Hi spark users and developers, >> >> Have anyone experience the slow startup of a job when it contains a stage >&

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
park zzhang$ more conf/hive-site.xml > > > > > > > hive.metastore.uris > thrift://zzhang-yarn11:9083 <> > > > > > HW11188:spark zzhang$ > > By the way, I don’t know whether there is any caveat for this walk around. &

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
t that cannot be done by HiveContext? > > Thanks. > > Zhan Zhang > > On Nov 6, 2015, at 10:43 AM, Jerry Lam <chiling...@gmail.com > <mailto:chiling...@gmail.com>> wrote: > >> What is interesting is that pyspark shell works fine with multiple session

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
; I don't see config of skipping the above call. > > FYI > > On Fri, Nov 6, 2015 at 8:53 AM, Jerry Lam <chiling...@gmail.com > <mailto:chiling...@gmail.com>> wrote: > Hi spark users and developers, > > Is it possible to disable HiveContext from being insta

[Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
Hi spark users and developers, Is it possible to disable HiveContext from being instantiated when using spark-shell? I got the following errors when I have more than one session starts. Since I don't use HiveContext, it would be great if I can have more than 1 spark-shell start at the same time.

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
onfig of skipping the above call. > > FYI > > On Fri, Nov 6, 2015 at 8:53 AM, Jerry Lam <chiling...@gmail.com > <mailto:chiling...@gmail.com>> wrote: > Hi spark users and developers, > > Is it possible to disable HiveContext from being instantiated when usin

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Jerry Lam
Does Qubole use Yarn or Mesos for resource management? Sent from my iPhone > On 5 Nov, 2015, at 9:02 pm, Sabarish Sasidharan > wrote: > > Qubole - To unsubscribe, e-mail:

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Jerry Lam
We "used" Spark on Mesos to build interactive data analysis platform because the interactive session could be long and might not use Spark for the entire session. It is very wasteful of resources if we used the coarse-grained mode because it keeps resource for the entire session. Therefore,

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
r. the max-date is likely > to be faster though. > > On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi Koert, >> >> You should be able to see if it requires scanning the whole data by >> "explain" the query. The physica

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
xposed. > > On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> it seems to work but i am not sure if its not scanning the whole dataset. >> let me dig into tasks a a bit >> >> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam <chili

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
of the physical plan, you can navigate the actual execution in the web UI to see how much data is actually read to satisfy this request. I hope it only requires a few bytes for few dates. Best Regards, Jerry On Sun, Nov 1, 2015 at 5:56 PM, Jerry Lam <chiling...@gmail.com> wrote: > I agreed the

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
Hi Koert, If the partitioned table is implemented properly, I would think "select distinct(date) as dt from table order by dt DESC limit 1" would return the latest dates without scanning the whole dataset. I haven't try it that myself. It would be great if you can report back if this actually

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Jerry Lam
Hi Bryan, Did you read the email I sent few days ago. There are more issues with partitionBy down the road: https://www.mail-archive.com/user@spark.apache.org/msg39512.html Best Regards, Jerry > On Oct 28, 2015, at 4:52 PM,

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Jerry Lam
gt; > Jerry, > > Thank you for the note. It sounds like you were able to get further than I > have been - any insight? Just a Spark 1.4.1 vs Spark 1.5? > > Regards, > > Bryan Jeffrey > From: Jerry Lam > Sent: ‎10/‎28/‎2015 6:29 PM > To: Bryan Jeffrey > Cc: S

[Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
Hi Spark users and developers, Anyone experiences issues in setting hadoop configurations after SparkContext is initialized? I'm using Spark 1.5.1. I'm trying to use s3a which requires access and secret key set into hadoop configuration. I tried to set the properties in the hadoop configuration

Re: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
Marcelo Vanzin <van...@cloudera.com> wrote: > On Tue, Oct 27, 2015 at 10:43 AM, Jerry Lam <chiling...@gmail.com> wrote: > > Anyone experiences issues in setting hadoop configurations after > > SparkContext is initialized? I'm using Spark 1.5.1. > > > > I'm

Re: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
r with that > code. > > On Tue, Oct 27, 2015 at 11:22 AM, Jerry Lam <chiling...@gmail.com> wrote: > > Hi Marcelo, > > > > Thanks for the advice. I understand that we could set the configurations > > before creating SparkContext. My question is > > SparkCon

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Jerry Lam
nterfaces.scala:561) >> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560) >> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:31) >> org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:395) >> org.apache.spark.sql.DataFrameRea

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
rtitions might be using tons of driver memory via the > OutputCommitCoordinator's bookkeeping data structures. > > On Sun, Oct 25, 2015 at 5:50 PM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi spark guys, >> >> I think I hit the same issue SPARK-8890 >> https:/

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
) org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:31) org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:395) org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:267) On Sun, Oct 25, 2015 at 10:25 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Josh, > >

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
parameters to make it more memory efficient? Best Regards, Jerry On Sun, Oct 25, 2015 at 8:39 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi guys, > > After waiting for a day, it actually causes OOM on the spark driver. I > configure the driver to have 6GB. Note that I didn't c

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
million files. Not sure why it OOM the driver after the job is marked _SUCCESS in the output folder. Best Regards, Jerry On Sat, Oct 24, 2015 at 9:35 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Spark users and developers, > > Does anyone encounter any issue when a spark SQL job

Spark SQL: Issues with using DirectParquetOutputCommitter with APPEND mode and OVERWRITE mode

2015-10-22 Thread Jerry Lam
Hi Spark users and developers, I read the ticket [SPARK-8578] (Should ignore user defined output committer when appending data) which ignore DirectParquetOutputCommitter if append mode is selected. The logic was that it is unsafe to use because it is not possible to revert a failed job in append

Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Jerry Lam
Hi Spark users and developers, I have a dataframe with the following schema (Spark 1.5.1): StructType(StructField(type,StringType,true), StructField(timestamp,LongType,false)) After I save the dataframe in parquet and read it back, I get the following schema:

Re: Spark executor on Mesos - how to set effective user id?

2015-10-19 Thread Jerry Lam
Can you try setting SPARK_USER at the driver? It is used to impersonate users at the executor. So if you have user setup for launching spark jobs on the executor machines, simply set it to that user name for SPARK_USER. There is another configuration that will prevents jobs being launched with

Re: Indexing Support

2015-10-18 Thread Jerry Lam
I'm interested in it but I doubt there is r-tree indexing support in the near future as spark is not a database. You might have a better luck looking at databases with spatial indexing support out of the box. Cheers Sent from my iPad On 2015-10-18, at 17:16, Mustafa Elbehery

Re: Dataframes - sole data structure for parallel computations?

2015-10-08 Thread Jerry Lam
I just read the article by ogirardot but I don’t agree It is like saying pandas dataframe is the sole data structure for analyzing data in python. Can Pandas dataframe replace Numpy array? The answer is simply no from an efficiency perspective for some computations. Unless there is a computer

Re: spark-submit --packages using different resolver

2015-10-06 Thread Jerry Lam
This is the ticket SPARK-10951 <https://issues.apache.org/jira/browse/SPARK-10951> Cheers~ On Tue, Oct 6, 2015 at 11:33 AM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Burak, > > Thank you for the tip. > Unfortunately it does not work. It throws: > > java.net.

Re: spark-submit --packages using different resolver

2015-10-06 Thread Jerry Lam
. Could you please try using > the --repositories flag and provide the address: > `$ spark-submit --packages my:awesome:package --repositories > s3n://$aws_ak:$aws_sak@bucket/path/to/repo` > > If that doesn't work, could you please file a JIRA? > > Best, > Burak > > >

Re: Limiting number of cores per job in multi-threaded driver.

2015-10-04 Thread Jerry Lam
Philip, the guy is trying to help you. Calling him silly is a bit too far. He might assume your problem is IO bound which might not be the case. If you need only 4 cores per job no matter what there is little advantage to use spark in my opinion because you can easily do this with just a worker

spark-submit --packages using different resolver

2015-10-01 Thread Jerry Lam
Hi spark users and developers, I'm trying to use spark-submit --packages against private s3 repository. With sbt, I'm using fm-sbt-s3-resolver with proper aws s3 credentials. I wonder how can I add this resolver into spark-submit such that --packages can resolve dependencies from private repo?

Re: Spark SQL: Implementing Custom Data Source

2015-09-29 Thread Jerry Lam
h...@gmail.com> wrote: > >> See this thread: >> >> http://search-hadoop.com/m/q3RTttmiYDqGc202 >> >> And: >> >> >> http://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources >> >> On Sep 28, 2015, at 8:22 PM, Jerry Lam <c

Spark SQL: Implementing Custom Data Source

2015-09-28 Thread Jerry Lam
Hi spark users and developers, I'm trying to learn how implement a custom data source for Spark SQL. Is there a documentation that I can use as a reference? I'm not sure exactly what needs to be extended/implemented. A general workflow will be greatly helpful! Best Regards, Jerry

Re: Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-26 Thread Jerry Lam
mes: > > import org.apache.spark.sql.functions._ > table("purchases").select(explode(df("purchase_items")).as("item")) > > > > On Fri, Sep 25, 2015 at 4:21 PM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi sparkers, >> >> Anyone know

Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-25 Thread Jerry Lam
Hi sparkers, Anyone knows how to do LATERAL VIEW EXPLODE without HiveContext? I don't want to start up a metastore and derby just because I need LATERAL VIEW EXPLODE. I have been trying but I always get the exception like this: Name: java.lang.RuntimeException Message: [1.68] failure: ``union''

Re: Spark standalone/Mesos on top of Ceph

2015-09-22 Thread Jerry Lam
Do you have specific reasons to use Ceph? I used Ceph before, I'm not too in love with it especially when I was using the Ceph Object Gateway S3 API. There are some incompatibilities with aws s3 api. You really really need to try it because making the commitment. Did you managed to install it? On

Re: Re: Spark standalone/Mesos on top of Ceph

2015-09-22 Thread Jerry Lam
t; > Best, > Sun. > > -- > fightf...@163.com > > > *From:* Jerry Lam <chiling...@gmail.com> > *Date:* 2015-09-23 09:37 > *To:* fightf...@163.com > *CC:* user <user@spark.apache.org> > *Subject:* Re: Spark standalone/Mesos on t

Re: How does one use s3 for checkpointing?

2015-09-21 Thread Jerry Lam
Hi Amit, Have you looked at Amazon EMR? Most people using EMR use s3 for persistency (both as input and output of spark jobs). Best Regards, Jerry Sent from my iPhone > On 21 Sep, 2015, at 9:24 pm, Amit Ramesh wrote: > > > A lot of places in the documentation mention using

Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Spark Developers, I just ran some very simple operations on a dataset. I was surprise by the execution plan of take(1), head() or first(). For your reference, this is what I did in pyspark 1.5: df=sqlContext.read.parquet("someparquetfiles") df.head() The above lines take over 15 minutes. I

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
0:01 AM, Yin Huai <yh...@databricks.com> wrote: > >> Hi Jerry, >> >> Looks like it is a Python-specific issue. Can you create a JIRA? >> >> Thanks, >> >> Yin >> >> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chiling...@gmail.com> wrote:

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
I just noticed you found 1.4 has the same issue. I added that as well in the ticket. On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Yin, > > You are right! I just tried the scala version with the above lines, it > works as expected. > I'm n

Re: Java vs. Scala for Spark

2015-09-08 Thread Jerry Lam
Hi Bryan, I would choose a language based on the requirements. It does not make sense if you have a lot of dependencies that are java-based components and interoperability between java and scala is not always obvious. I agree with the above comments that Java is much more verbose than Scala in

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
using this for one of my projects on a cluster as well. Also, here is a blog that describes how to configure this. http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ Guru Medasani gdm...@gmail.com On Aug 18, 2015, at 8:35 AM, Jerry Lam chiling...@gmail.com

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi Prabeesh, That's even better! Thanks for sharing Jerry On Tue, Aug 18, 2015 at 1:31 PM, Prabeesh K. prabsma...@gmail.com wrote: Refer this post http://blog.prabeeshk.com/blog/2015/06/19/pyspark-notebook-with-docker/ Spark + Jupyter + Docker On 18 August 2015 at 21:29, Jerry Lam

Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi spark users and developers, Did anyone have IPython Notebook (Jupyter) deployed in production that uses Spark as the computational engine? I know Databricks Cloud provides similar features with deeper integration with Spark. However, Databricks Cloud has to be hosted by Databricks so we

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-17 Thread Jerry Lam
Hi Nick, I forgot to mention in the survey that ganglia is never installed properly for some reasons. I have this exception every time I launched the cluster: Starting httpd: httpd: Syntax error on line 154 of /etc/httpd/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so into

Re: Controlling number of executors on Mesos vs YARN

2015-08-12 Thread Jerry Lam
as an example framework for Mesos - thats how I know it. It is surprising to see that the options provided by mesos in this case are less. Tweaking the source code, haven't done it yet but I would love to see what options could be there! On Tue, Aug 11, 2015 at 5:42 AM, Jerry Lam chiling

Re: Controlling number of executors on Mesos vs YARN

2015-08-11 Thread Jerry Lam
My experience with Mesos + Spark is not great. I saw one executor with 30 CPU and the other executor with 6. So I don't think you can easily configure it without some tweaking at the source code. Sent from my iPad On 2015-08-11, at 2:38, Haripriya Ayyalasomayajula aharipriy...@gmail.com

Re: Parquet without hadoop: Possible?

2015-08-11 Thread Jerry Lam
Just out of curiosity, what is the advantage of using parquet without hadoop? Sent from my iPhone On 11 Aug, 2015, at 11:12 am, saif.a.ell...@wellsfargo.com wrote: I confirm that it works, I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450 Saif From:

Re: Accessing S3 files with s3n://

2015-08-09 Thread Jerry Lam
Hi Akshat, Is there a particular reason you don't use s3a? From my experience,s3a performs much better than the rest. I believe the inefficiency is from the implementation of the s3 interface. Best Regards, Jerry Sent from my iPhone On 9 Aug, 2015, at 5:48 am, Akhil Das

Poor HDFS Data Locality on Spark-EC2

2015-08-04 Thread Jerry Lam
Hi Spark users and developers, I have been trying to use spark-ec2. After I launched the spark cluster (1.4.1) with ephemeral hdfs (using hadoop 2.4.0), I tried to execute a job where the data is stored in the ephemeral hdfs. It does not matter what I tried to do, there is no data locality at

Spark Master Build Git Commit Hash

2015-07-30 Thread Jerry Lam
Hi Spark users and developers, I wonder which git commit was used to build the latest master-nightly build found at: http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/? I downloaded the build but I couldn't find the information related to it. Thank you! Best Regards,

Re: Spark Master Build Git Commit Hash

2015-07-30 Thread Jerry Lam
for the commits made on Jul 16th. There may be other ways of determining the latest commit. Cheers On Thu, Jul 30, 2015 at 7:39 AM, Jerry Lam chiling...@gmail.com wrote: Hi Spark users and developers, I wonder which git commit was used to build the latest master-nightly build found at: http

Unexpected performance issues with Spark SQL using Parquet

2015-07-27 Thread Jerry Lam
Hi spark users and developers, I have been trying to understand how Spark SQL works with Parquet for the couple of days. There is a performance problem that is unexpected using the column pruning. Here is a dummy example: The parquet file has the 3 fields: |-- customer_id: string (nullable =

Re: Parquet problems

2015-07-22 Thread Jerry Lam
Hi guys, I noticed that too. Anders, can you confirm that it works on Spark 1.5 snapshot? This is what I tried at the end. It seems it is 1.4 issue. Best Regards, Jerry On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg arp...@spotify.com wrote: No, never really resolved the problem, except by

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
You mean this does not work? SELECT key, count(value) from table group by key On Sun, Jul 19, 2015 at 2:28 PM, N B nb.nos...@gmail.com wrote: Hello, How do I go about performing the equivalent of the following SQL clause in Spark Streaming? I will be using this on a Windowed DStream.

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
Yes. Sent from my iPhone On 19 Jul, 2015, at 10:52 pm, Jahagirdar, Madhu madhu.jahagir...@philips.com wrote: All, Can we run different version of Spark using the same Mesos Dispatcher. For example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ? Regards, Madhu

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
= rdd.reduceByKey((c1, c2) - c1+c2 ); ListTuple2String, Integer output = rdd2.collect(); for (Tuple2?,? tuple : output) { System.out.println( tuple._1() + : + tuple._2() ); } } On Sun, Jul 19, 2015 at 2:28 PM, Jerry Lam chiling...@gmail.com wrote: You mean

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
? -- *From:* Jerry Lam [chiling...@gmail.com] *Sent:* Monday, July 20, 2015 8:27 AM *To:* Jahagirdar, Madhu *Cc:* user; d...@spark.apache.org *Subject:* Re: Spark Mesos Dispatcher Yes. Sent from my iPhone On 19 Jul, 2015, at 10:52 pm, Jahagirdar, Madhu madhu.jahagir...@philips.com wrote

Re: Benchmark results between Flink and Spark

2015-07-14 Thread Jerry Lam
similar style off-heap memory mgmt, more planning optimizations *From:* Jerry Lam [mailto:chiling...@gmail.com chiling...@gmail.com] *Sent:* Sunday, July 5, 2015 6:28 PM *To:* Ted Yu *Cc:* Slim Baltagi; user *Subject:* Re: Benchmark results between Flink and Spark Hi guys, I just read

Re: Benchmark results between Flink and Spark

2015-07-05 Thread Jerry Lam
Hi guys, I just read the paper too. There is no much information regarding why Flink is faster than Spark for data science type of workloads in the benchmark. It is very difficult to generalize the conclusion of a benchmark from my point of view. How much experience the author has with Spark is

Re: Reading from CSV file with spark-csv_2.10

2015-02-05 Thread Jerry Lam
Hi Florin, I might be wrong but timestamp looks like a keyword in SQL that the engine gets confused with. If it is a column name of your table, you might want to change it. ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types) I'm constantly working with CSV files with spark.

Re: Union in Spark

2015-02-01 Thread Jerry Lam
Hi Deep, what do you mean by stuck? Jerry On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Is there any better operation than Union. I am using union and the cluster is getting stuck with a large data set. Thank you

Re: Union in Spark

2015-02-01 Thread Jerry Lam
Hi Deep, How do you know the cluster is not responsive because of Union? Did you check the spark web console? Best Regards, Jerry On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: The cluster hangs. On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam chiling

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Jerry Lam
Hi Sudipta, I would also like to suggest to ask this question in Cloudera mailing list since you have HDFS, MAPREDUCE and Yarn requirements. Spark can work with HDFS and YARN but it is more like a client to those clusters. Cloudera can provide services to answer your question more clearly. I'm

  1   2   >