Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-25 Thread Piotr Smoliński
Hi Peter, The blank file _SUCCESS indicates properly finished output operation. What is the topology of your application? I presume, you write to local filesystem and have more than one worker machine. In such case Spark will write the result files for each partition (in the worker which holds

MLib Documentation Update Needed

2016-09-25 Thread Tobi Bosede
The loss function here for logistic regression is confusing. It seems to imply that spark uses only -1 and 1 class labels. However it uses 0,1 as the very inconspicuous note quoted below (under Classification)

Re: spark stream based deduplication

2016-09-25 Thread backtrack5
Thank you @markcitizen . What I want to achieve is , say for an example My historic rdd has (Hash1, recordid1) (Hash2,recordid2) And in the new steam I have the following, (Hash3, recordid3) (Hash1,recordid5) In this above scenario, 1) for recordid5,I should get recordid5 is duplicate of

Re: udf forces usage of Row for complex types?

2016-09-25 Thread Bedrytski Aliaksandr
Hi Koert, these case classes you are talking about, should be serializeable to be efficient (like kryo or just plain java serialization). DataFrame is not simply a collection of Rows (which are serializeable by default), it also contains a schema with different type for each column. This way any

Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-25 Thread Mario Ds Briggs
Hi Daniel, can you give it a try in the IBM's Analytics for Spark, the fix has been in for a week now thanks Mario From: Daniel Lopes To: Mario Ds Briggs/India/IBM@IBMIN Cc: Adam Roberts , user ,

Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-25 Thread Peter Figliozzi
Both df.write.csv("/path/to/foo") and df.write.format("com.databricks.spark.csv").save("/path/to/foo") results in a *blank* file called "_SUCCESS" under /path/to/foo. My df has stuff in it.. tried this with both my real df, and a quick df constructed from literals. Why isn't it writing

Re: Spark MLlib ALS algorithm

2016-09-25 Thread Roshani Nagmote
Hello, I ran ALS algorithm on 30 c4.8xlarge machines(60GB RAM each) with dataset(1.4GB) Netflix dataset (Users: 480189, Items: 17770, Ratings: 99M) *Command* I run: /usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn --jars /usr/lib/spark/examples/jars/scopt_2.11-3.3.0.jar

Extract timestamp from Kafka message

2016-09-25 Thread Kevin Tran
Hi Everyone, Does anyone know how could we extract timestamp from Kafka message in Spark streaming ? JavaPairInputDStream messagesDStream = KafkaUtils.createDirectStream( ssc, String.class, String.class, StringDecoder.class, StringDecoder.class,

Re: ArrayType support in Spark SQL

2016-09-25 Thread Koert Kuipers
not pretty but this works: import org.apache.spark.sql.functions.udf df.withColumn("array", sqlf.udf({ () => Seq(1, 2, 3) }).apply()) On Sun, Sep 25, 2016 at 6:13 PM, Jason White wrote: > It seems that `functions.lit` doesn't support ArrayTypes. To reproduce: > >

Spark 2.0 Structured Streaming: sc.parallelize in foreach sink cause Task not serializable error

2016-09-25 Thread Jianshi
Dear all: I am trying out the new released feature of structured streaming in Spark 2.0. I use the Structured Streaming to perform windowing by event time. I can print out the result in the console. I would like to write the result to Cassandra database through the foreach sink option. I am

udf forces usage of Row for complex types?

2016-09-25 Thread Koert Kuipers
after having gotten used to have case classes represent complex structures in Datasets, i am surprised to find out that when i work in DataFrames with udfs no such magic exists, and i have to fall back to manipulating Row objects, which is error prone and somewhat ugly. for example: case class

Re: spark-submit failing but job running from scala ide

2016-09-25 Thread Jacek Laskowski
Hi, How did you install Spark 1.6? It's usually as simple as rm -rf $SPARK_1.6_HOME, but it really depends on how you installed it in the first place. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at

Re: spark-submit failing but job running from scala ide

2016-09-25 Thread vr spark
yes, i have both spark 1.6 and spark 2.0. I unset the spark home environment variable and pointed spark submit to 2.0. Its working now. How do i uninstall/remove spark 1.6 from mac? Thanks On Sun, Sep 25, 2016 at 4:28 AM, Jacek Laskowski wrote: > Hi, > > Can you execute

Re: In Spark-Scala, how to copy Array of Lists into new DataFrame?

2016-09-25 Thread Marco Mistroni
Hi in fact i have just found some written notes in my code see if this docs help you (it will work with any spark versions, not only 1.3.0) https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#creating-dataframes hth On Sun, Sep 25, 2016 at 1:25 PM, Marco Mistroni

Re: In Spark-Scala, how to copy Array of Lists into new DataFrame?

2016-09-25 Thread Marco Mistroni
Hi i must admit , i had issues as well in finding a sample that does that, (hopefully Spark folks can add more examples or someone on the list can post a sample code?) hopefully you can reuse sample below So, you start from an rdd of doubles (myRdd) ## make a row val toRddOfRows =

In Spark-Scala, how to copy Array of Lists into new DataFrame?

2016-09-25 Thread Dan Bikle
Hello World, I am familiar with Python and I am learning Spark-Scala. I want to build a DataFrame which has structure desribed by this syntax: *// Prepare training data from a list of (label, features) tuples.val training = spark.createDataFrame(Seq( (1.1, Vectors.dense(1.1, 0.1)),

Re: spark-submit failing but job running from scala ide

2016-09-25 Thread Jacek Laskowski
Hi, Can you execute run-example SparkPi with your Spark installation? Also, see the logs: 16/09/24 23:15:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 16/09/24 23:15:15 INFO Utils: Successfully started service 'SparkUI' on port 4041. You've got two Spark

Re: How to use Spark-Scala to download a CSV file from the web?

2016-09-25 Thread Jörn Franke
Use a tool like flume and/or oozie to reliable download files from http and do error handling (e.g. Requests time out). Afterwards process the data with spark. > On 25 Sep 2016, at 10:27, Dan Bikle wrote: > > hello spark-world, > > How to use Spark-Scala to download a CSV

Re: How to use Spark-Scala to download a CSV file from the web?

2016-09-25 Thread Marco Mistroni
Hi not sure if spark-csv supports the http:// format you use to load data from the WEB. I just tried this and got exception scala> val df = sqlContext.read. | format("com.databricks.spark.csv"). | option("inferSchema", "true"). |

How to use Spark-Scala to download a CSV file from the web?

2016-09-25 Thread Dan Bikle
hello spark-world, How to use Spark-Scala to download a CSV file from the web and load the file into a spark-csv DataFrame? Currently I depend on curl in a shell command to get my CSV file. Here is the syntax I want to enhance: */* fb_csv.scalaThis script should load FB

Re: Open source Spark based projects

2016-09-25 Thread Simon Chan
PredictionIO is an open-source machine learning server project based on Spark - http://predictionio.incubator.apache.org/ Simon On Fri, Sep 23, 2016 at 12:46 PM, manasdebashiskar wrote: > check out spark packages https://spark-packages.org/ and you will find few >

Left Join Yields Results And Not Results

2016-09-25 Thread Aaron Jackson
Hi, I'm using pyspark (1.6.2) to do a little bit of ETL and have noticed a very odd situation. I have two dataframes, base and updated. The "updated" dataframe contains constrained subset of data from "base" that I wish to excluded. Something like this. updated = base.where(base.X =

spark-submit failing but job running from scala ide

2016-09-25 Thread vr spark
Hi, I have this simple scala app which works fine when i run it as scala application from the scala IDE for eclipse. But when i export is as jar and run it from spark-submit i am getting below error. Please suggest *bin/spark-submit --class com.x.y.vr.spark.first.SimpleApp test.jar* 16/09/24