Re: Dataframe schema...

2016-10-21 Thread Koert Kuipers
This rather innocent looking optimization flag nullable has caused a lot of bugs... Makes me wonder if we are better off without it On Oct 21, 2016 8:37 PM, "Muthu Jayakumar" wrote: > Thanks Cheng Lian for opening the JIRA. I found this with Spark 2.0.0. > > Thanks, > Muthu

Re: Dataframe schema...

2016-10-21 Thread Muthu Jayakumar
Thanks Cheng Lian for opening the JIRA. I found this with Spark 2.0.0. Thanks, Muthu On Fri, Oct 21, 2016 at 3:30 PM, Cheng Lian wrote: > Yea, confirmed. While analyzing unions, we treat StructTypes with > different field nullabilities as incompatible types and throws this

RDD to Dataset results in fixed number of partitions

2016-10-21 Thread Spark User
Hi All, I'm trying to create a Dataset from RDD and do groupBy on the Dataset. The groupBy stage runs with 200 partitions. Although the RDD had 5000 partitions. I also seem to have no way to change that 200 partitions on the Dataset to some other large number. This seems to be affecting the

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-21 Thread Chetan Khatri
Hello Cheng, Thank you for response. I am using spark 1.6.1, i am writing around 350 gz parquet part files for single table. Processed around 180 GB of Data using Spark. On Sat, Oct 22, 2016 at 3:41 AM, Cheng Lian wrote: > What version of Spark are you using and how

Re: RDD groupBy() then random sort each group ?

2016-10-21 Thread Cheng Lian
I think it would much easier to use DataFrame API to do this by doing local sort using randn() as key. For example, in Spark 2.0: val df = spark.range(100) val shuffled = df.repartition($"id" % 10).sortWithinPartitions(randn(42)) Replace df with a DataFrame wrapping your RDD, and $"id" % 10

Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-21 Thread Cheng Lian
Efe - You probably hit this bug: https://issues.apache.org/jira/browse/SPARK-18058 On 10/21/16 2:03 AM, Agraj Mangal wrote: I have seen this error sometimes when the elements in the schema have different nullabilities. Could you print the schema for data and for

Re: How to iterate the element of an array in DataFrame?

2016-10-21 Thread Cheng Lian
You may either use SQL function "array" and "named_struct" or define a case class with expected field names. Cheng On 10/21/16 2:45 AM, 颜发才(Yan Facai) wrote: My expectation is: root |-- tag: vector namely, I want to extract from: [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| to:

Re: Dataframe schema...

2016-10-21 Thread Cheng Lian
Yea, confirmed. While analyzing unions, we treat StructTypes with different field nullabilities as incompatible types and throws this error. Opened https://issues.apache.org/jira/browse/SPARK-18058 to track this issue. Thanks for reporting! Cheng On 10/21/16 3:15 PM, Cheng Lian wrote: Hi

Re: Dataframe schema...

2016-10-21 Thread Cheng Lian
Hi Muthu, What is the version of Spark are you using? This seems to be a bug in the analysis phase. Cheng On 10/21/16 12:50 PM, Muthu Jayakumar wrote: Sorry for the late response. Here is what I am seeing... Schema from parquet file. d1.printSchema() root |-- task_id: string (nullable =

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-21 Thread Cheng Lian
What version of Spark are you using and how many output files does the job writes out? By default, Spark versions before 1.6 (not including) writes Parquet summary files when committing the job. This process reads footers from all Parquet files in the destination directory and merges them

Writing to Parquet Job turns to wait mode after even completion of job

2016-10-21 Thread Chetan Khatri
Hello Spark Users, I am writing around 10 GB of Processed Data to Parquet where having 1 TB of HDD and 102 GB of RAM, 16 vCore machine on Google Cloud. Every time, i write to parquet. it shows on Spark UI that stages succeeded but on spark shell it hold context on wait mode for almost 10 mins.

Re: Dataframe schema...

2016-10-21 Thread Muthu Jayakumar
Sorry for the late response. Here is what I am seeing... Schema from parquet file. d1.printSchema() root |-- task_id: string (nullable = true) |-- task_name: string (nullable = true) |-- some_histogram: struct (nullable = true) ||-- values: array (nullable = true) |||--

About Reading Parquet - failed to read single gz parquet - failed entire transformation

2016-10-21 Thread Chetan Khatri
Hello Spark Users, I am working on Historical Data Processing for Telecom provider where I processed single Job and output wrote to parquet in append mode, while reading i am able to read the parquet view table because it's just lazy evolution. But when i apply action on top of that by joining

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-21 Thread Cody Koeninger
That's a good point... the dstreams package is still on 10.0.1 though. I'll make a ticket to update it. On Fri, Oct 21, 2016 at 1:02 PM, Srikanth wrote: > Kakfa 0.10.1 release separates poll() from heartbeat. So session.timeout.ms > & max.poll.interval.ms can be set

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-21 Thread Srikanth
Kakfa 0.10.1 release separates poll() from heartbeat. So session.timeout.ms & max.poll.interval.ms can be set differently. I'll leave it to you on how to add this to docs! On Thu, Oct 20, 2016 at 1:41 PM, Cody Koeninger wrote: > Right on, I put in a PR to make a note of

Re: Where condition on columns of Arrays does no longer work in spark 2

2016-10-21 Thread Cheng Lian
Thanks for reporting! It's a bug, just filed a ticket to track it: https://issues.apache.org/jira/browse/SPARK-18053 Cheng On 10/20/16 1:54 AM, filthysocks wrote: I have a Column in a DataFrame that contains Arrays and I wanna filter for equality. It does work fine in spark 1.6 but not in

Re: Kafka Direct Stream: Offset Managed Manually (Exactly Once)

2016-10-21 Thread Cody Koeninger
DStream checkpoints have all kinds of other difficulties, biggest one being you can't use a checkpoint if your app code has been updated. If you can avoid checkpoints in general, I would. On Fri, Oct 21, 2016 at 11:17 AM, Erwan ALLAIN wrote: > Thanks for the fast answer

Re: Kafka Direct Stream: Offset Managed Manually (Exactly Once)

2016-10-21 Thread Erwan ALLAIN
Thanks for the fast answer ! I just feel annoyed and frustrated not to be able to use spark checkpointing because I believe that there mechanism has been correctly tested. I'm afraid that reinventing the wheel can lead to side effects that I don't see now ... Anyway thanks again, I know what I

Re: [Spark ML] Using GBTClassifier in OneVsRest

2016-10-21 Thread Guo-Xun Yuan
Same questions here. GBTClassifier and MultilayerPerceptronClassifier extend Predictor[_,_] rather than Classifier[_,_]. However, both are classifiers. It looks like the class inheritance hierarchy is not strictly followed. I wonder if the community considers it an issue, and has a plan for the

Plotting decision boundary in non-linear logistic regression

2016-10-21 Thread aditya1702
Hello, I am working with Logistic Regression on a non linear data and I want to plot a decision boundary using the data. I dont know how do I do it using the contour plot. Could someone help me out please. This is the code I have written: from pyspark.ml.classification import LogisticRegression

Re: ALS.trainImplicit block sizes

2016-10-21 Thread Nick Pentreath
Oh also you mention 20 partitions. Is that how many you have? How many ratings? It may be worth trying to reparation to larger number of partitions. On Fri, 21 Oct 2016 at 17:04, Nick Pentreath wrote: > I wonder if you can try with setting different blocks for user

Re: Kafka Direct Stream: Offset Managed Manually (Exactly Once)

2016-10-21 Thread Cody Koeninger
0. If your processing time is regularly greater than your batch interval you're going to have problems anyway. Investigate this more, set maxRatePerPartition, something. 1. That's personally what I tend to do. 2. Why are you relying on checkpoints if you're storing offset state in the database?

Re: ALS.trainImplicit block sizes

2016-10-21 Thread Nick Pentreath
I wonder if you can try with setting different blocks for user and item? Are you able to try 2.0 or use Scala for setting it in 1.6? You want your item blocks to be a lot less than user blocks. Items maybe 5-10, users perhaps 250-500? Do you have many "power items" that are connected to almost

Issues with reading gz files with Spark Streaming

2016-10-21 Thread Nkechi Achara
Hi, I am using Spark 1.5.0 to read gz files with textFileStream, but when new files are dropped in the specified directory. I know this is only the case with gz files as when i extract the file into the directory specified the files are read on the next window and processed. My code is here:

Re: issue accessing Phoenix table from Spark

2016-10-21 Thread Jörn Franke
Have you verified that this class is in the fat jar? It looks that it misses some of the Hbase libraries ... > On 21 Oct 2016, at 11:45, Mich Talebzadeh wrote: > > Still does not work with Spark 2.0.0 on apache-phoenix-4.8.1-HBase-1.2-bin > > thanks > > Dr Mich

Drop partition with PURGE fail

2016-10-21 Thread bluishpenguin
Hi all, I am using spark 2.0 and would like execute the command below to drop the partition with PURGE. sql = "ALTER TABLE db.table1 DROP IF EXISTS PARTITION (pkey='150') PURGE" spark.sql(sql) It throws exception: /Py4JJavaError: An error occurred while calling o45.sql. :

Re: Re: How to iterate the element of an array in DataFrame?

2016-10-21 Thread Yan Facai
My expectation is: root |-- tag: vector namely, I want to extract from: [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| to: Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) I believe it needs two step: 1. val tag2vec = {tag: Array[Structure] => Vector} 2. mblog_tags.withColumn("vec",

Re: issue accessing Phoenix table from Spark

2016-10-21 Thread Mich Talebzadeh
Still does not work with Spark 2.0.0 on apache-phoenix-4.8.1-HBase-1.2-bin thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-21 Thread Agraj Mangal
I have seen this error sometimes when the elements in the schema have different nullabilities. Could you print the schema for data and for someCode.thatReturnsADataset() and see if there is any difference between the two ? On Fri, Oct 21, 2016 at 9:14 AM, Efe Selcuk wrote: >

Re: Re: How to iterate the element of an array in DataFrame?

2016-10-21 Thread lk_spark
how about change Schema from root |-- category.firstCategory: array (nullable = true) ||-- element: struct (containsNull = true) |||-- category: string (nullable = true) |||-- weight: string (nullable = true) to: root |-- category: string (nullable = true) |-- weight:

Kafka Direct Stream: Offset Managed Manually (Exactly Once)

2016-10-21 Thread Erwan ALLAIN
Hi, I'm currently implementing an exactly once mechanism based on the following example: https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerBatch.scala the pseudo code is as follow: dstream.transform (store offset in a variable on driver side )

sql.functions partitionby AttributeError: 'NoneType' object has no attribute '_jvm'

2016-10-21 Thread muhammet pakyürek
i work with partitioonby for lead lag functions i get the errror above and here is the explanation jspec = sc._jvm.org.apache.spark.sql.expressions.Window.partitionBy(_to_java_cols(cols))

How to clean the accumulator and broadcast from the driver manually?

2016-10-21 Thread Mungeol Heo
Hello, As I mentioned at the title, I want to know is it possible to clean the accumulator/broadcast from the driver manually since the driver's memory keeps increasing. Someone says that unpersist method removes them both from memory as well as disk on each executor node. But it stays on the

Re: How to iterate the element of an array in DataFrame?

2016-10-21 Thread Yan Facai
I don't know how to construct `array>`. Could anyone help me? I try to get the array by : scala> mblog_tags.map(_.getSeq[(String, String)](0)) while the result is: res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value:

RE: Can we disable parquet logs in Spark?

2016-10-21 Thread Yu, Yucai
I set "log4j.rootCategory=ERROR, console" and using "-file conf/log4f.properties" to make most of logs suppressed, but those org.apache.parquet log still exists. Any way to disable them also? Thanks, Yucai From: Yu, Yucai [mailto:yucai...@intel.com] Sent: Friday, October 21, 2016 2:50 PM To:

Re: ALS.trainImplicit block sizes

2016-10-21 Thread Nick Pentreath
How many nodes are you using in the cluster? On Fri, 21 Oct 2016 at 08:58 Nikhil Mishra wrote: > Thanks Nick. > > So we do partition U x I matrix into BxB matrices, each of size around U/B > and I/B. Is that correct? Do you know whether a single block of the matrix

Re: [Spark ML] Using GBTClassifier in OneVsRest

2016-10-21 Thread Nick Pentreath
Currently no - GBT implements the predictors, not the classifier interface. It might be possible to wrap it in a wrapper that extends the Classifier trait. Hopefully GBT will support multi-class at some point. But you can use RandomForest which does support multi-class. On Fri, 21 Oct 2016 at

Can we disable parquet logs in Spark?

2016-10-21 Thread Yu, Yucai
Hi, I see lots of parquet logs in container logs(YARN mode), like below: stdout: Oct 21, 2016 2:27:30 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 8,448B for [ss_promo_sk] INT32: 5,996 values, 8,513B raw, 8,409B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED,

Re: ALS.trainImplicit block sizes

2016-10-21 Thread Nick Pentreath
The blocks params will set both user and item blocks. Spark 2.0 supports user and item blocks for PySpark: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation On Fri, 21 Oct 2016 at 08:12 Nikhil Mishra wrote: > Hi, > > I

Re: spark pi example fail on yarn

2016-10-21 Thread Xi Shen
I see, I had this issue before. I think you are using Java 8, right? Because Java 8 JVM requires more bootstrap heap memory. Turning off the memory check is an unsafe way to avoid this issue. I think it is better to increase the memory ratio, like this: yarn.nodemanager.vmem-pmem-ratio

ALS.trainImplicit block sizes

2016-10-21 Thread Nikhil Mishra
Hi, I have a question about the block size to be specified in ALS.trainImplicit() in pyspark (Spark 1.6.1). There is only one block size parameter to be specified. I want to know if that would result in partitioning both the users as well as the items axes. For example, I am using the following