Re: Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
Looks like you can do it with dense_rank functions. https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html I setup some basic records and seems like it did the right thing. Now time to throw 50TB and 100 spark nodes at this problem and see what happens :) On Sat,

Re: Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
Ah.. might actually. I'll have to mess around with that. On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley wrote: > Would `topByKey` help? > > https://github.com/apache/spark/blob/master/mllib/src/ > main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42 > > Best, >

Re: SparkR error: reference is ambiguous.

2016-09-10 Thread Felix Cheung
Could you provide more information on how df in your example is created? Also please include the output from printSchema(df)? This example works: > c <- createDataFrame(cars) > c SparkDataFrame[speed:double, dist:double] > c$speed <- c$dist*0 > c SparkDataFrame[speed:double, dist:double] >

RE: Graphhopper/routing in Spark

2016-09-10 Thread Kane O'Donnell
It’s not obvious to me either = ) I was thinking more along the lines of retrieving the graph from HDFS/Spark, merging it together (which should be taken care of by sc.textFile) and then giving it to GraphHopper. Alternatively I guess I could just put the graph locally on every worker node. Or

Re: questions about using dapply

2016-09-10 Thread Felix Cheung
You might need MARGIN capitalized, this example works though: c <- as.DataFrame(cars) # rename the columns to c1, c2 c <- selectExpr(c, "speed as c1", "dist as c2") cols_in <- dapplyCollect(c, function(x) {apply(x[, paste("c", 1:2, sep = "")], MARGIN=2, FUN = function(y){ y %in% c(61, 99)})}) #

Re: SparkR API problem with subsetting distributed data frame

2016-09-10 Thread Felix Cheung
How are you calling dirs()? What would be x? Is dat a SparkDataFrame? With SparkR, i in dat[i, 4] should be an logical expression for row, eg. df[df$age %in% c(19, 30), 1:2] On Sat, Sep 10, 2016 at 11:02 AM -0700, "Bene" >

Re: Assign values to existing column in SparkR

2016-09-10 Thread Felix Cheung
If you are to set a column to 0 (essentially remove and replace the existing one) you would need to put a column on the right hand side: > df <- as.DataFrame(iris) > head(df) Sepal_Length Sepal_Width Petal_Length Petal_Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2

Re: Selecting the top 100 records per group by?

2016-09-10 Thread Karl Higley
Would `topByKey` help? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42 Best, Karl On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton wrote: > I'm trying to figure out a way to group by and return the top

Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
I'm trying to figure out a way to group by and return the top 100 records in that group. Something like: SELECT TOP(100, user_id) FROM posts GROUP BY user_id; But I can't really figure out the best way to do this... There is a FIRST and LAST aggregate function but this only returns one column.

Re: Spark_JDBC_Partitions

2016-09-10 Thread Mich Talebzadeh
Good points Unfortunately databump. expr, imp use binary format for import and export. that cannot be used to import data into HDFS in a suitable way. One can use what is known as flat,sh script to get data out tab or , separated etc. ROWNUM is a pseudocolumn (not a real column) that is

Re: Why is Spark getting Kafka data out from port 2181 ?

2016-09-10 Thread Cody Koeninger
Are you using the receiver based stream? On Sep 10, 2016 15:45, "Eric Ho" wrote: > I notice that some Spark programs would contact something like 'zoo1:2181' > when trying to suck data out of Kafka. > > Does the kafka data actually transported out over this port ? > >

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-10 Thread Holden Karau
I don't think a 2.0 uber jar will play nicely on a 1.5 standalone cluster. On Saturday, September 10, 2016, Felix Cheung wrote: > You should be able to get it to work with 2.0 as uber jar. > > What type cluster you are running on? YARN? And what distribution? > > > >

Re: Spark_JDBC_Partitions

2016-09-10 Thread ayan guha
In oracle something called row num is present in every row. You can create an evenly distribution using that column. If it is one time work, try using sqoop. Are you using Oracle's own appliance? Then you can use data pump format On 11 Sep 2016 01:59, "Mich Talebzadeh"

Re: Problems with Reading CSV Files - Java - Eclipse

2016-09-10 Thread ayan guha
It is failing with no class found error for parquet output committer. Maybe a build issue? On 11 Sep 2016 01:50, "Irfan Kabli" wrote: > Dear Spark community members, > > I am trying to read a CSV file in Spark using Java API. > > My setup is as follows: > > Windows

Why is Spark getting Kafka data out from port 2181 ?

2016-09-10 Thread Eric Ho
I notice that some Spark programs would contact something like 'zoo1:2181' when trying to suck data out of Kafka. Does the kafka data actually transported out over this port ? Typically Zookeepers use 2218 for SSL. If my Spark program were to use 2218, how would I specify zookeeper specific

Not sure why Filter on DStream doesn't get invoked?

2016-09-10 Thread kant kodali
Hi All, I am trying to simplify how to frame my question so below is my code. I see that BAR gets printed but not FOO and I am not sure why? my batch interval is 1 second (something I pass in when I create a spark context). any idea? I have bunch of events and I want to store the number of events

Re: SparkR API problem with subsetting distributed data frame

2016-09-10 Thread Bene
Here are a few code snippets: The data frame looks like this: kfzzeit datum latitude longitude 1 # 2015-02-09 07:18:33 2015-02-09 52.35234 9.881965 2 # 2015-02-09 07:18:34 2015-02-09 52.35233 9.881970 3 #

Re: SparkR API problem with subsetting distributed data frame

2016-09-10 Thread Felix Cheung
Could you include code snippets you are running? On Sat, Sep 10, 2016 at 1:44 AM -0700, "Bene" > wrote: Hi, I am having a problem with the SparkR API. I need to subset a distributed data so I can extract single values from

Re: iterating over DataFrame Partitions sequentially

2016-09-10 Thread sujeet jog
Thank you Jacob, It works for me. On Sat, Sep 10, 2016 at 12:54 AM, Jakob Odersky wrote: > > Hi Jakob, I have a DataFrame with like 10 patitions, based on the exact > content on each partition i want to batch load some other data from DB, i > cannot operate in parallel due to

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-10 Thread Felix Cheung
You should be able to get it to work with 2.0 as uber jar. What type cluster you are running on? YARN? And what distribution? On Sun, Sep 4, 2016 at 8:48 PM -0700, "Holden Karau" > wrote: You really shouldn't mix different versions of Spark

Re: Spark CSV skip lines

2016-09-10 Thread Hyukjin Kwon
As you are reading each record as each file via wholeTextFiles and falttening them to records, I think you can just drop the few lines as you want. Can you just drop or skip few lines from reader.readAll().map(...)? Also, are you sure this is an issue in Spark or external CSV library issue? Do

Re: Spark CSV skip lines

2016-09-10 Thread Selvam Raman
Hi, I saw this two option already anyway thanks for the idea. i am using wholetext file to read my data(cause there are \n middle of it) and using opencsv to parse the data. In my data first two lines are just some report. how can i eliminate. *How to eliminate first two lines after reading

Re: Reading a TSV file

2016-09-10 Thread Hyukjin Kwon
Yeap. also, sep is preferred and has a higher precedence than delimiter. ​ 2016-09-11 0:44 GMT+09:00 Jacek Laskowski : > Hi Muhammad, > > sep or delimiter should both work fine. > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache

Re: Spark CSV output

2016-09-10 Thread Hyukjin Kwon
Have you tried the quote related options (e.g. `quote` or `quoteMode` *https://github.com/databricks/spark-csv/blob/master/README.md#features )*? On 11 Sep 2016 12:22 a.m., "ayan guha" wrote: > CSV

Spark using my last job resources and jar files

2016-09-10 Thread nagaraj
I am new to spark . Trying to run spark job with client mode and it works well if I use the same path for jar and other resource files. After killing the running application using Yarn command and if spark job is resubmitted with updated jar and file locations, job still uses my old path. After

Re: Spark_JDBC_Partitions

2016-09-10 Thread Mich Talebzadeh
creating an Oracle sequence for a table of 200million is not going to be that easy without changing the schema. It is possible to export that table from prod and import it to DEV/TEST and create the sequence there. If it is a FACT table then the foreign keys from the Dimension tables will be

Problems with Reading CSV Files - Java - Eclipse

2016-09-10 Thread Irfan Kabli
Dear Spark community members, I am trying to read a CSV file in Spark using Java API. My setup is as follows: > Windows Machine > Local deployment > Spark 2.0.0 > Eclipse Scala IDE 4.0.0 I am trying to read from the local file system with the following code: (Using the Java Perspective)

Re: Reading a TSV file

2016-09-10 Thread Jacek Laskowski
Hi Muhammad, sep or delimiter should both work fine. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sat, Sep 10, 2016 at 10:42 AM, Muhammad Asif Abbasi

Re: Spark_JDBC_Partitions

2016-09-10 Thread Takeshi Yamamuro
Hi, Yea, spark does not have the same functionality with sqoop. I think one of simple solutions is to assign unique ids on the oracle table by yourself. Thought? // maropu On Sun, Sep 11, 2016 at 12:37 AM, Mich Talebzadeh wrote: > Strange that Oracle table of

Re: Spark_JDBC_Partitions

2016-09-10 Thread Mich Talebzadeh
Strange that Oracle table of 200Million plus rows has not been partitioned. What matters here is to have parallel connections from JDBC to Oracle, each reading a sub-set of table. Any parallel fetch is going to be better than reading with one connection from Oracle. Surely among 404 columns

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2016-09-10 Thread Takeshi Yamamuro
Hi, Seems the known issue, see https://issues.apache.org/jira/browse/SPARK-4105 // maropu On Sat, Sep 10, 2016 at 11:08 PM, 齐忠 wrote: > Hi all > > when use default compression snappy,I get error when spark doing shuffle > > 16/09/09 08:33:15 ERROR executor.Executor:

Re: Spark CSV output

2016-09-10 Thread ayan guha
CSV standard uses quote to identify multiline output On 11 Sep 2016 01:04, "KhajaAsmath Mohammed" wrote: > Hi, > > I am using the package com.databricks.spark.csv to save the dataframe > output to hdfs path. I am able to write the output but there are quotations > before

Spark_JDBC_Partitions

2016-09-10 Thread Ajay Chander
Hello Everyone, My goal is to use Spark Sql to load huge amount of data from Oracle to HDFS. *Table in Oracle:* 1) no primary key. 2) Has 404 columns. 3) Has 200,800,000 rows. *Spark SQL:* In my Spark SQL I want to read the data into n number of partitions in parallel, for which I need to

Spark CSV output

2016-09-10 Thread KhajaAsmath Mohammed
Hi, I am using the package com.databricks.spark.csv to save the dataframe output to hdfs path. I am able to write the output but there are quotations before and after end of the string. Did anyone resolve it when usinig it with com.databricks.spark.csv package. "An account was successfully

Re: Reading a TSV file

2016-09-10 Thread Muhammad Asif Abbasi
Thanks for responding. I believe i had already given scala example as a part of my code in the second email. Just looked at the DataFrameReader code, and it appears the following would work in Java. Dataset pricePaidDS = spark.read().*option("sep","\t")*.csv(fileName); Thanks for your help.

java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2016-09-10 Thread 齐忠
Hi all when use default compression snappy,I get error when spark doing shuffle 16/09/09 08:33:15 ERROR executor.Executor: Managed memory leak detected; size = 89817648 bytes, TID = 20912 16/09/09 08:33:15 ERROR executor.Executor: Exception in task 63.2 in stage 1.0 (TID 20912)

Re: Reading a TSV file

2016-09-10 Thread Mich Talebzadeh
Read header false not true val df2 = spark.read.option("header", false).option("delimiter","\t").csv("hdfs://rhes564:9000/tmp/nw_10124772.tsv") Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Reading a TSV file

2016-09-10 Thread Mich Talebzadeh
This should be pretty straight forward? You can create a tab separated file from any database table and buck copy out, MSSQL, Sybase etc bcp scratchpad..nw_10124772 out nw_10124772.tsv -c *-t '\t' *-Usa -A16384 Password: Starting copy... 441 rows copied. more nw_10124772.tsv Mar 22 2011

Re: Reading a TSV file

2016-09-10 Thread Mich Talebzadeh
Thanks Jacek. The old stuff with databricks scala> val df = spark.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("hdfs://rhes564:9000/data/stg/accounts/ll/18740868") df: org.apache.spark.sql.DataFrame = [Transaction Date: string, Transaction

Re: Reading a TSV file

2016-09-10 Thread Muhammad Asif Abbasi
Thanks for the quick response. Let me rephrase the question which I admit wasn't clearly worded and perhaps too abstract. To read a CSV i am using the following code (works perfectly). SparkSession spark = SparkSession.builder() .master("local") .appName("Reading a CSV")

Re: Reading a TSV file

2016-09-10 Thread Jacek Laskowski
Hi Mich, CSV is now one of the 7 formats supported by SQL in 2.0. No need to use "com.databricks.spark.csv" and --packages. A mere format("csv") or csv(path: String) would do it. The options are same. p.s. Yup, when I read TSV I thought about time series data that I believe got its own file

Re: spark streaming kafka connector questions

2016-09-10 Thread Cody Koeninger
Hard to say without seeing the code, but if you do multiple actions on an Rdd without caching, the Rdd will be computed multiple times. On Sep 10, 2016 2:43 AM, "Cheng Yi" wrote: After some investigation, the problem i see is liked caused by a filter and union of the

Re: Reading a TSV file

2016-09-10 Thread Mich Talebzadeh
I gather the title should say CSV as opposed to tsv? Also when the term spark-csv is used is it a reference to databricks stuff? val df = spark.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load.. or it is something new in 2 like spark-sql

Re: Reading a TSV file

2016-09-10 Thread Jacek Laskowski
Hi, If Spark 2.0 supports a format, use it. For CSV it's csv() or format("csv"). It should be supported by Scala and Java. If the API's broken for Java (but works for Scala), you'd have to create a "bridge" yourself or report an issue in Spark's JIRA @ https://issues.apache.org/jira/browse/SPARK.

Reading a TSV file

2016-09-10 Thread Muhammad Asif Abbasi
Hi, I would like to know what is the most efficient way of reading tsv in Scala, Python and Java with Spark 2.0. I believe with Spark 2.0 CSV is a native source based on Spark-csv module, and we can potentially read a "tsv" file by specifying 1. Option ("delimiter","\t") in Scala 2. sep

Re: Does it run distributed if class not Serializable

2016-09-10 Thread Yan Facai
I believe that Serializable is necessary for distributing. On Fri, Sep 9, 2016 at 7:10 PM, Gourav Sengupta wrote: > And you are using JAVA? > > AND WHY? > > Regards, > Gourav > > On Fri, Sep 9, 2016 at 11:47 AM, Yusuf Can Gürkan > wrote: >

Re: SparkSQL DAG generation , DAG optimization , DAG execution

2016-09-10 Thread Mich Talebzadeh
right let us simplify this. can you run the whole thing *once* only and send dag execution output from UI? you can use snipping tool to take the image. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Spark CSV skip lines

2016-09-10 Thread Hyukjin Kwon
Hi Selvam, If your report is commented with any character (e.g. #), you can skip these lines via comment option [1]. If you are using Spark 1.x, then you might be able to do this by manually skipping from the RDD and then making this to DataFrame as below: I haven’t tested this but I think this

Spark CSV skip lines

2016-09-10 Thread Selvam Raman
Hi, I am using spark csv to read csv file. The issue is my files first n lines contains some report and followed by actual data (header and rest of the data). So how can i skip first n lines in spark csv. I dont have any specific comment character in the first byte. Please give me some idea.

Re: SparkSQL DAG generation , DAG optimization , DAG execution

2016-09-10 Thread Rabin Banerjee
Hi , 1. You are doing some analytics I guess? *YES* 2. It is almost impossible to guess what is happening except that you are looping 50 times over the same set of sql? *I am Not Looping any SQL, All SQLs are called exactly once , which requires output from prev SQL.* 3. Your

SparkR API problem with subsetting distributed data frame

2016-09-10 Thread Bene
Hi, I am having a problem with the SparkR API. I need to subset a distributed data so I can extract single values from it on which I can then do calculations. Each row of my df has two integer values, I am creating a vector of new values calculated as a series of sin, cos, tan functions on these

Re: spark streaming kafka connector questions

2016-09-10 Thread Cheng Yi
After some investigation, the problem i see is liked caused by a filter and union of the dstream. if i just do kafka-stream -- process -- output operator, then there is no problem, one event will be fetched once. if i do kafka-stream -- process(1) - filter a stream A for later union --|

Re: spark streaming kafka connector questions

2016-09-10 Thread 毅程
Cody, Thanks for the message. 1. as you mentioned, I do find the version for kafka 0.10.1, I will use that, although lots of experimental tags. Thank you. 2. I have done thorough investigating, it is NOT the scenario where 1st process failed, then 2nd process triggered. 3. I do agree the session

Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-10 Thread Josh Rosen
Based on Ben's helpful error description, I managed to reproduce this bug and found the root cause: There's a bug in MemoryStore's PartiallySerializedBlock class: it doesn't close a serialization stream before attempting to deserialize its serialized values, causing it to miss any data stored in

Re: SparkSQL DAG generation , DAG optimization , DAG execution

2016-09-10 Thread Mich Talebzadeh
Hi 1. You are doing some analytics I guess? 2. It is almost impossible to guess what is happening except that you are looping 50 times over the same set of sql? 3. Your sql step n depends on step n-1. So spark cannot get rid of 1 -n steps 4. you are not storing anything in