Re: SparkR DataFrame Column Casts esp. from CSV Files

2015-06-03 Thread Shivaram Venkataraman
I created https://issues.apache.org/jira/browse/SPARK-8085 for this. On Wed, Jun 3, 2015 at 12:12 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Hmm - the schema=myschema doesn't seem to work in SparkR from my simple local test. I'm filing a JIRA for this now On Wed, Jun 3,

Re: SparkR DataFrame Column Casts esp. from CSV Files

2015-06-03 Thread Shivaram Venkataraman
Hmm - the schema=myschema doesn't seem to work in SparkR from my simple local test. I'm filing a JIRA for this now On Wed, Jun 3, 2015 at 11:04 AM, Eskilson,Aleksander alek.eskil...@cerner.com wrote: Neat, thanks for the info Hossein. My use case was just to reset the schema for a CSV

Cleaning up workers' directories automatically

2015-06-03 Thread atalay
Hi everyone, everytime our data comes and new updates occur in our cluster, an undesirable file is being created in workers' directories.In order to cleanup automatically I changed the variable value Spark (Standalone) Client Advanced Configuration Snippet (Safety Valve) for

Ivy support in Spark vs. sbt

2015-06-03 Thread Marcelo Vanzin
Hey all, I've been bit by something really weird lately and I'm starting to think it's related to the ivy support we have in Spark, and running unit tests that use that code. The first thing that happens is that after running unit tests, sometimes my sbt builds start failing with error saying

Re: MLlib: Anybody working on hierarchical topic models like HLDA?

2015-06-03 Thread Joseph Bradley
Hi Lorenz, I'm not aware of people working on hierarchical topic models for MLlib, but that would be cool to see. Hopefully other devs know more! Glad that the current LDA is helpful! Joseph On Wed, Jun 3, 2015 at 6:43 AM, Lorenz Fischer lorenz.fisc...@gmail.com wrote: Hi All I'm working

Re: GraphX: New graph operator

2015-06-03 Thread Reynold Xin
Hi Tarek, I took a quick look at the materials you shared. It actually seems to me it'd be super easy to express a graph as two DataFrames: one for edges (srcid, dstid, and other edge attributes) and one for vertices (vid, and other vertex attributes). Then intersection is just

Stop Master and Slaves without SSH

2015-06-03 Thread Devl Devel
Hey All, start-slaves.sh and stop-slaves.sh make use of SSH to connect to remote clusters. Are there alternative methods to do this without SSH? For example using: ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT is fine but there is no way to kill the Worker without

Re: GraphX: New graph operator

2015-06-03 Thread Tarek Auel
Hi, The graph is already there (GraphX) and has the two RDDs you described. My question tries to get an idea, if the community thinks that it's a benefit and would be a plus or not. If yes, I would like to contribute it to GraphX (either as part of GraphOpts or as external library). An

SparkR DataFrame Column Casts esp. from CSV Files

2015-06-03 Thread Eskilson,Aleksander
It appears that casting columns remains a bit of a trick in Spark’s DataFrames. This is an issue because tools like spark-csv will set column types to String by default and will not attempt to infer types. Although spark-csv supports specifying types for columns in its options, it’s not clear

MLlib: Anybody working on hierarchical topic models like HLDA?

2015-06-03 Thread Lorenz Fischer
Hi All I'm working on a project in which I use the current LDA implementation that has been contributed by Databricks' Joseph Bradley et al. for the recent 1.3.0 release (thanks guys!). While this is great, my project requires several levels of topics, as I would like to offer users to drill down

RE: MLlib: Anybody working on hierarchical topic models like HLDA?

2015-06-03 Thread Yang, Yuhao
Hi Lorenz, I’m trying to build a prototype of HDP for a customer based on the current LDA implementations. An initial version will probably be ready within the next one or two weeks. I’ll share it and hopefully we can join forces. One concern is that I’m not sure how widely it will be used

Re: MLlib: Anybody working on hierarchical topic models like HLDA?

2015-06-03 Thread DB Tsai
Is your HDP implementation based on distributed gibbs sampling? Thanks. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Wed, Jun 3, 2015 at 8:13 PM, Yang, Yuhao yuhao.y...@intel.com wrote: Hi Lorenz, I’m trying to build a

Re: SparkR DataFrame Column Casts esp. from CSV Files

2015-06-03 Thread Eskilson,Aleksander
Hi Shivaram, As far as databricks’ spark-csv API shows, it seems there’s currently only support for explicit definition of column types. In JSON we have nice typed fields, but in CSVs, all bets are off. In the SQL version of the API, it appears you specify the column types when you create the

Re: SparkR DataFrame Column Casts esp. from CSV Files

2015-06-03 Thread Eskilson,Aleksander
Neat, thanks for the info Hossein. My use case was just to reset the schema for a CSV dataset, but if either a. I can specify it at load, or b. it will be inferred in the future, I’ll likely not need to cast columns, much less reset the whole schema. I’ll still file a JIRA for the capability,

Re: SparkR DataFrame Column Casts esp. from CSV Files

2015-06-03 Thread Shivaram Venkataraman
cc Hossein who knows more about the spark-csv options You are right that the default CSV reader options end up creating all columns as string. I know that the JSON reader infers the schema [1] but I don't know if the CSV reader has any options to do that. Regarding the SparkR syntax to cast

Re: SparkR DataFrame Column Casts esp. from CSV Files

2015-06-03 Thread Reynold Xin
I think Hossein does want to implement schema inference for CSV -- then it'd be easy. Another way you can do this is to use R dataframe/table to read the CSV files in, and then convert it into a Spark DataFrames. Not going to be scalable, but could work. On Wed, Jun 3, 2015 at 10:49 AM,