Re: Flattening XML in a DataFrame

2016-08-16 Thread Hyukjin Kwon
Sorry for late reply.

Currently, the library only supports to load XML documents just as they are.

Do you mind if I ask open an issue with some more explanations here,
https://github.com/databricks/spark-xml/issues?




2016-08-17 7:22 GMT+09:00 Sreekanth Jella :

> Hi Experts,
>
>
>
> Please suggest. Thanks in advance.
>
>
>
> Thanks,
>
> Sreekanth
>
>
>
> *From:* Sreekanth Jella [mailto:srikanth.je...@gmail.com]
> *Sent:* Sunday, August 14, 2016 11:46 AM
> *To:* 'Hyukjin Kwon' 
> *Cc:* 'user @spark' 
> *Subject:* Re: Flattening XML in a DataFrame
>
>
>
> Hi Hyukjin Kwon,
>
> Thank you for reply.
>
> There are several types of XML documents with different schema which needs
> to be parsed and tag names do not know in hand. All we know is the XSD for
> the given XML.
>
> Is it possible to get the same results even when we do not know the xml
> tags like manager.id, manager.name or is it possible to read the tag
> names from XSD and use?
>
> Thanks,
> Sreekanth
>
>
>
> On Aug 12, 2016 9:58 PM, "Hyukjin Kwon"  wrote:
>
> Hi Sreekanth,
>
>
>
> Assuming you are using Spark 1.x,
>
>
>
> I believe this code below:
>
> sqlContext.read.format("com.databricks.spark.xml").option("rowTag", 
> "emp").load("/tmp/sample.xml")
>
>   .selectExpr("manager.id", "manager.name", 
> "explode(manager.subordinates.clerk) as clerk")
>
>   .selectExpr("id", "name", "clerk.cid", "clerk.cname")
>
>   .show()
>
> would print the results below as you want:
>
> +---++---+-+
>
> | id|name|cid|cname|
>
> +---++---+-+
>
> |  1| foo|  1|  foo|
>
> |  1| foo|  1|  foo|
>
> +---++---+-+
>
> ​
>
>
>
> I hope this is helpful.
>
>
>
> Thanks!
>
>
>
>
>
>
>
>
>
> 2016-08-13 9:33 GMT+09:00 Sreekanth Jella :
>
> Hi Folks,
>
>
>
> I am trying flatten variety of XMLs using DataFrames. I’m using spark-xml
> package which is automatically inferring my schema and creating a
> DataFrame.
>
>
>
> I do not want to hard code any column names in DataFrame as I have lot of
> varieties of XML documents and each might be lot more depth of child nodes.
> I simply want to flatten any type of XML and then write output data to a
> hive table. Can you please give some expert advice for the same.
>
>
>
> Example XML and expected output is given below.
>
>
>
> Sample XML:
>
> 
>
> 
>
>
>
>1
>
>foo
>
> 
>
>   
>
> 1
>
> foo
>
>   
>
>   
>
> 1
>
> foo
>
>   
>
> 
>
>
>
> 
>
> 
>
>
>
> Expected output:
>
> id, name, clerk.cid, clerk.cname
>
> 1, foo, 2, cname2
>
> 1, foo, 3, cname3
>
>
>
> Thanks,
>
> Sreekanth Jella
>
>
>
>
>
>


Re: JavaRDD to DataFrame fails with null pointer exception in 1.6.0

2016-08-16 Thread sudhir patil
Tested with java 7 & 8 , same issue on both versions.

On Aug 17, 2016 12:29 PM, "spats"  wrote:

> Cannot convert JavaRDD to DataFrame in spark 1.6.0, throws null pointer
> exception & no more details. Can't really figure out what really happening.
> Any pointer to fixes?
>
> //convert JavaRDD to DataFrame
> DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class);
>
> // exception with no more details
> Exception in thread "main" java.lang.NullPointerException
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/JavaRDD-to-DataFrame-fails-with-null-
> pointer-exception-in-1-6-0-tp27547.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: VectorUDT with spark.ml.linalg.Vector

2016-08-16 Thread Alexey Svyatkovskiy
Hi Yanbo,

Thanks for your reply. I will keep an eye on that pull request.
For now, I decided to just put my code inside org.apache.spark.ml to be
able to access private classes.

Thanks,
Alexey

On Tue, Aug 16, 2016 at 11:13 PM, Yanbo Liang  wrote:

> It seams that VectorUDT is private and can not be accessed out of Spark
> currently. It should be public but we need to do some refactor before make
> it public. You can refer the discussion at https://github.com/apache/
> spark/pull/12259 .
>
> Thanks
> Yanbo
>
> 2016-08-16 9:48 GMT-07:00 alexeys :
>
>> I am writing an UDAF to be applied to a data frame column of type Vector
>> (spark.ml.linalg.Vector). I rely on spark/ml/linalg so that I do not have
>> to
>> go back and forth between dataframe and RDD.
>>
>> Inside the UDAF, I have to specify a data type for the input, buffer, and
>> output (as usual). VectorUDT is what I would use with
>> spark.mllib.linalg.Vector:
>> https://github.com/apache/spark/blob/master/mllib/src/main/
>> scala/org/apache/spark/mllib/linalg/Vectors.scala
>>
>> However, when I try to import it from spark.ml instead: import
>> org.apache.spark.ml.linalg.VectorUDT
>> I get a runtime error (no errors during the build):
>>
>> class VectorUDT in package linalg cannot be accessed in package
>> org.apache.spark.ml.linalg
>>
>> Is it expected/can you suggest a workaround?
>>
>> I am using Spark 2.0.0
>>
>> Thanks,
>> Alexey
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/VectorUDT-with-spark-ml-linalg-Vector-tp27542.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


JavaRDD to DataFrame fails with null pointer exception in 1.6.0

2016-08-16 Thread spats
Cannot convert JavaRDD to DataFrame in spark 1.6.0, throws null pointer
exception & no more details. Can't really figure out what really happening.
Any pointer to fixes?

//convert JavaRDD to DataFrame
DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class);

// exception with no more details
Exception in thread "main" java.lang.NullPointerException



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/JavaRDD-to-DataFrame-fails-with-null-pointer-exception-in-1-6-0-tp27547.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Data frame Performance

2016-08-16 Thread Selvam Raman
Hi Mich,

The input and output are just for example and it s not exact column name.
Colc not needed.

The code which I shared is working fine but need to confirm, was it right
approach and effect performance.

Thanks,
Selvam R
+91-97877-87724
On Aug 16, 2016 5:18 PM, "Mich Talebzadeh" 
wrote:

> Hi Selvan,
>
> is table called sel,?
>
> And are these assumptions correct?
>
> site -> ColA
> requests -> ColB
>
> I don't think you are using ColC here?
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 16 August 2016 at 12:06, Selvam Raman  wrote:
>
>> Hi All,
>>
>> Please suggest me the best approach to achieve result. [ Please comment
>> if the existing logic is fine or not]
>>
>> Input Record :
>>
>> ColA ColB ColC
>> 1 2 56
>> 1 2 46
>> 1 3 45
>> 1 5 34
>> 1 5 90
>> 2 1 89
>> 2 5 45
>> ​
>> Expected Result
>>
>> ResA ResB
>> 12:2|3:3|5:5
>> 2   1:1|5:5
>>
>> I followd the below Spark steps
>>
>> (Spark version - 1.5.0)
>>
>> def valsplit(elem :scala.collection.mutable.WrappedArray[String]) :
>> String =
>> {
>>
>> elem.map(e => e+":"+e).mkString("|")
>> }
>>
>> sqlContext.udf.register("valudf",valsplit(_:scala.collection
>> .mutable.WrappedArray[String]))
>>
>>
>> val x =sqlContext.sql("select site,valudf(collect_set(requests)) as test
>> from sel group by site").first
>>
>>
>>
>> --
>> Selvam Raman
>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>
>
>


Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-16 Thread ayan guha
Hi

Thank you for your reply. Yes, I can get prediction and original features
together. My question is how to tie them back to other parts of the data,
which was not in LP.

For example, I have a bunch of other dimensions which are not part of
features or label.

Sorry if this is a stupid question.

On Wed, Aug 17, 2016 at 12:57 PM, Yanbo Liang  wrote:

> MLlib will keep the original dataset during transformation, it just append
> new columns to existing DataFrame. That is you can get both prediction
> value and original features from the output DataFrame of model.transform.
>
> Thanks
> Yanbo
>
> 2016-08-16 17:48 GMT-07:00 ayan guha :
>
>> Hi
>>
>> I have a dataset as follows:
>>
>> DF:
>> amount:float
>> date_read:date
>> meter_number:string
>>
>> I am trying to predict future amount based on past 3 weeks consumption
>> (and a heaps of weather data related to date).
>>
>> My Labelpoint looks like
>>
>> label (populated from DF.amount)
>> features (populated from a bunch of other stuff)
>>
>> Model.predict output:
>> label
>> prediction
>>
>> Now, I am trying to put together this prediction value back to meter
>> number and date_read from original DF?
>>
>> One way to assume order of records in DF and Model.predict will be
>> exactly same and zip two RDDs. But any other (possibly better) solution?
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha


Re: VectorUDT with spark.ml.linalg.Vector

2016-08-16 Thread Yanbo Liang
It seams that VectorUDT is private and can not be accessed out of Spark
currently. It should be public but we need to do some refactor before make
it public. You can refer the discussion at
https://github.com/apache/spark/pull/12259 .

Thanks
Yanbo

2016-08-16 9:48 GMT-07:00 alexeys :

> I am writing an UDAF to be applied to a data frame column of type Vector
> (spark.ml.linalg.Vector). I rely on spark/ml/linalg so that I do not have
> to
> go back and forth between dataframe and RDD.
>
> Inside the UDAF, I have to specify a data type for the input, buffer, and
> output (as usual). VectorUDT is what I would use with
> spark.mllib.linalg.Vector:
> https://github.com/apache/spark/blob/master/mllib/src/
> main/scala/org/apache/spark/mllib/linalg/Vectors.scala
>
> However, when I try to import it from spark.ml instead: import
> org.apache.spark.ml.linalg.VectorUDT
> I get a runtime error (no errors during the build):
>
> class VectorUDT in package linalg cannot be accessed in package
> org.apache.spark.ml.linalg
>
> Is it expected/can you suggest a workaround?
>
> I am using Spark 2.0.0
>
> Thanks,
> Alexey
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/VectorUDT-with-spark-ml-linalg-Vector-tp27542.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-16 Thread Yanbo Liang
MLlib will keep the original dataset during transformation, it just append
new columns to existing DataFrame. That is you can get both prediction
value and original features from the output DataFrame of model.transform.

Thanks
Yanbo

2016-08-16 17:48 GMT-07:00 ayan guha :

> Hi
>
> I have a dataset as follows:
>
> DF:
> amount:float
> date_read:date
> meter_number:string
>
> I am trying to predict future amount based on past 3 weeks consumption
> (and a heaps of weather data related to date).
>
> My Labelpoint looks like
>
> label (populated from DF.amount)
> features (populated from a bunch of other stuff)
>
> Model.predict output:
> label
> prediction
>
> Now, I am trying to put together this prediction value back to meter
> number and date_read from original DF?
>
> One way to assume order of records in DF and Model.predict will be exactly
> same and zip two RDDs. But any other (possibly better) solution?
>
> --
> Best Regards,
> Ayan Guha
>


Re: [SQL] Why does spark.read.csv.cache give me a WARN about cache but not text?!

2016-08-16 Thread Michael Armbrust
try running explain on each of these.  my guess would be caching in broken
in some cases.

On Tue, Aug 16, 2016 at 6:05 PM, Jacek Laskowski  wrote:

> Hi,
>
> Can anyone explain why spark.read.csv("people.csv").cache.show ends up
> with a WARN while spark.read.text("people.csv").cache.show does not?
> It happens in 2.0 and today's build.
>
> scala> sc.version
> res5: String = 2.1.0-SNAPSHOT
>
> scala> spark.read.csv("people.csv").cache.show
> +-+-+---++
> |  _c0|  _c1|_c2| _c3|
> +-+-+---++
> |kolumna 1|kolumna 2|kolumn3|size|
> |Jacek| Warszawa| Polska|  40|
> +-+-+---++
>
> scala> spark.read.csv("people.csv").cache.show
> 16/08/16 18:01:52 WARN CacheManager: Asked to cache already cached data.
> +-+-+---++
> |  _c0|  _c1|_c2| _c3|
> +-+-+---++
> |kolumna 1|kolumna 2|kolumn3|size|
> |Jacek| Warszawa| Polska|  40|
> +-+-+---++
>
> scala> spark.read.text("people.csv").cache.show
> ++
> |   value|
> ++
> |kolumna 1,kolumna...|
> |Jacek,Warszawa,Po...|
> ++
>
> scala> spark.read.text("people.csv").cache.show
> ++
> |   value|
> ++
> |kolumna 1,kolumna...|
> |Jacek,Warszawa,Po...|
> ++
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


create SparkSession without loading defaults for unit tests

2016-08-16 Thread Koert Kuipers
for unit tests i would like to create a SparkSession that does not load
anything from system properties, similar to:

new SQLContext(new SparkContext(new SparkConf(loadDefaults = false)))

how do i go about doing this? i dont see a way.

thanks! koert


Re: GraphFrames 0.2.0 released

2016-08-16 Thread Jacek Laskowski
Hi Tim,

AWESOME. Thanks a lot for releasing it. That makes me even more eager
to see it in Spark's codebase (and replacing the current RDD-based
API)!

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Aug 16, 2016 at 9:32 AM, Tim Hunter  wrote:
> Hello all,
> I have released version 0.2.0 of the GraphFrames package. Apart from a few
> bug fixes, it is the first release published for Spark 2.0 and both scala
> 2.10 and 2.11. Please let us know if you have any comment or questions.
>
> It is available as a Spark package:
> https://spark-packages.org/package/graphframes/graphframes
>
> The source code is available as always at
> https://github.com/graphframes/graphframes
>
>
> What is GraphFrames?
>
> GraphFrames is a DataFrame-based graph engine Spark. In addition to the
> algorithms available in GraphX, users can write highly expressive queries by
> leveraging the DataFrame API, combined with a new API for motif finding. The
> user also benefits from DataFrame performance optimizations within the Spark
> SQL engine.
>
> Cheers
>
> Tim
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Rebalancing when adding kafka partitions

2016-08-16 Thread Cody Koeninger
The underlying kafka consumer

On Tue, Aug 16, 2016 at 2:17 PM, Srikanth  wrote:
> Yes, SubscribePattern detects new partition. Also, it has a comment saying
>
>> Subscribe to all topics matching specified pattern to get dynamically
>> assigned partitions.
>>  * The pattern matching will be done periodically against topics existing
>> at the time of check.
>>  * @param pattern pattern to subscribe to
>>  * @param kafkaParams Kafka
>
>
> Who does the new partition discover? Underlying kafka consumer or
> spark-streaming-kafka-0-10-assembly??
>
> Srikanth
>
> On Fri, Aug 12, 2016 at 5:15 PM, Cody Koeninger  wrote:
>>
>> Hrrm, that's interesting. Did you try with subscribe pattern, out of
>> curiosity?
>>
>> I haven't tested repartitioning on the  underlying new Kafka consumer, so
>> its possible I misunderstood something.
>>
>> On Aug 12, 2016 2:47 PM, "Srikanth"  wrote:
>>>
>>> I did try a test with spark 2.0 + spark-streaming-kafka-0-10-assembly.
>>> Partition was increased using "bin/kafka-topics.sh --alter" after spark
>>> job was started.
>>> I don't see messages from new partitions in the DStream.
>>>
 KafkaUtils.createDirectStream[Array[Byte], Array[Byte]] (
 ssc, PreferConsistent, Subscribe[Array[Byte],
 Array[Byte]](topics, kafkaParams) )
 .map(r => (r.key(), r.value()))
>>>
>>>
>>> Also, no.of partitions did not increase too.

 dataStream.foreachRDD( (rdd, curTime) => {
  logger.info(s"rdd has ${rdd.getNumPartitions} partitions.")
>>>
>>>
>>> Should I be setting some parameter/config? Is the doc for new integ
>>> available?
>>>
>>> Thanks,
>>> Srikanth
>>>
>>> On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger 
>>> wrote:

 No, restarting from a checkpoint won't do it, you need to re-define the
 stream.

 Here's the jira for the 0.10 integration

 https://issues.apache.org/jira/browse/SPARK-12177

 I haven't gotten docs completed yet, but there are examples at

 https://github.com/koeninger/kafka-exactly-once/tree/kafka-0.10

 On Fri, Jul 22, 2016 at 1:05 PM, Srikanth  wrote:
 > In Spark 1.x, if we restart from a checkpoint, will it read from new
 > partitions?
 >
 > If you can, pls point us to some doc/link that talks about Kafka 0.10
 > integ
 > in Spark 2.0.
 >
 > On Fri, Jul 22, 2016 at 1:33 PM, Cody Koeninger 
 > wrote:
 >>
 >> For the integration for kafka 0.8, you are literally starting a
 >> streaming job against a fixed set of topicapartitions,  It will not
 >> change throughout the job, so you'll need to restart the spark job if
 >> you change kafka partitions.
 >>
 >> For the integration for kafka 0.10 / spark 2.0, if you use subscribe
 >> or subscribepattern, it should pick up new partitions as they are
 >> added.
 >>
 >> On Fri, Jul 22, 2016 at 11:29 AM, Srikanth 
 >> wrote:
 >> > Hello,
 >> >
 >> > I'd like to understand how Spark Streaming(direct) would handle
 >> > Kafka
 >> > partition addition?
 >> > Will a running job be aware of new partitions and read from it?
 >> > Since it uses Kafka APIs to query offsets and offsets are handled
 >> > internally.
 >> >
 >> > Srikanth
 >
 >
>>>
>>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[SQL] Why does spark.read.csv.cache give me a WARN about cache but not text?!

2016-08-16 Thread Jacek Laskowski
Hi,

Can anyone explain why spark.read.csv("people.csv").cache.show ends up
with a WARN while spark.read.text("people.csv").cache.show does not?
It happens in 2.0 and today's build.

scala> sc.version
res5: String = 2.1.0-SNAPSHOT

scala> spark.read.csv("people.csv").cache.show
+-+-+---++
|  _c0|  _c1|_c2| _c3|
+-+-+---++
|kolumna 1|kolumna 2|kolumn3|size|
|Jacek| Warszawa| Polska|  40|
+-+-+---++

scala> spark.read.csv("people.csv").cache.show
16/08/16 18:01:52 WARN CacheManager: Asked to cache already cached data.
+-+-+---++
|  _c0|  _c1|_c2| _c3|
+-+-+---++
|kolumna 1|kolumna 2|kolumn3|size|
|Jacek| Warszawa| Polska|  40|
+-+-+---++

scala> spark.read.text("people.csv").cache.show
++
|   value|
++
|kolumna 1,kolumna...|
|Jacek,Warszawa,Po...|
++

scala> spark.read.text("people.csv").cache.show
++
|   value|
++
|kolumna 1,kolumna...|
|Jacek,Warszawa,Po...|
++

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-16 Thread ayan guha
Hi

I have a dataset as follows:

DF:
amount:float
date_read:date
meter_number:string

I am trying to predict future amount based on past 3 weeks consumption (and
a heaps of weather data related to date).

My Labelpoint looks like

label (populated from DF.amount)
features (populated from a bunch of other stuff)

Model.predict output:
label
prediction

Now, I am trying to put together this prediction value back to meter number
and date_read from original DF?

One way to assume order of records in DF and Model.predict will be exactly
same and zip two RDDs. But any other (possibly better) solution?

-- 
Best Regards,
Ayan Guha


Re: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-16 Thread Chanh Le
Hi Michael,

You should you Alluxio instead.
http://www.alluxio.org/docs/master/en/Running-Spark-on-Alluxio.html 

It should be easier.


Regards,
Chanh



> On Aug 17, 2016, at 5:45 AM, Michael Allman  wrote:
> 
> Hello,
> 
> A coworker was having a problem with a big Spark job failing after several 
> hours when one of the executors would segfault. That problem aside, I 
> speculated that her job would be more robust against these kinds of executor 
> crashes if she used replicated RDD storage. She's using off heap storage (for 
> good reason), so I asked her to try running her job with the following 
> storage level: `StorageLevel(useDisk = true, useMemory = true, useOffHeap = 
> true, deserialized = false, replication = 2)`. The job would immediately fail 
> with a rather suspicious looking exception. For example:
> 
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 
> or
> 
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> 

Can't connect to remote spark standalone cluster: getting WARN TaskSchedulerImpl: Initial job has not accepted any resources

2016-08-16 Thread Andrew Vykhodtsev
Dear all,

I am trying to connect a remote windows machine to a standalone spark
cluster (a single VM running on Ubuntu server with 8 cores and 64GB RAM).
Both client and server have Spark 2.0 software prebuilt for Hadoop 2.6, and
hadoop 2.7

I have the following settings on cluster:

export SPARK_WORKER_MEMORY=32G
export SPARK_WORKER_CORES=8

and the following settings on client (spark-defaults.conf)

spark.driver.memory  4g
spark.executor.memory8g
spark.executor.cores  2


When I start pyspark, everything works smoothly. In Spark UI, I see that my
app is running and has 4 executors attached to it, each with 2 cores and 8g
of memory.

However, when I try to read some HDFS files, it hangs and gives me the
following message in the loop.

>>> df = sqlContext.read.parquet('/projects/kaggle-bimbo/dataset_full.pqt')

16/08/17 01:04:34 WARN DomainSocketFactory: The short-circuit local reads
feature cannot be used because UNIX Domain sockets are not available on
Windows.
16/08/17 01:04:52 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient resources

I go to Spark UI again and see that it actually tries to start another set
of 4 executors . These executors hang for some time, fail and start again.
So when this app is left alone it generates many executors in the status
"EXITED". Nothing really happens. If I go to application UI, it just shows
me
Stages : 0/1 (1 failed) and
TasksNo tasks have started yet

Is it a bug or am I doing something wrong? looks like re-occurence of
https://issues.apache.org/jira/browse/SPARK-2260


Anyone else having trouble with replicated off heap RDD persistence?

2016-08-16 Thread Michael Allman
Hello,

A coworker was having a problem with a big Spark job failing after several 
hours when one of the executors would segfault. That problem aside, I 
speculated that her job would be more robust against these kinds of executor 
crashes if she used replicated RDD storage. She's using off heap storage (for 
good reason), so I asked her to try running her job with the following storage 
level: `StorageLevel(useDisk = true, useMemory = true, useOffHeap = true, 
deserialized = false, replication = 2)`. The job would immediately fail with a 
rather suspicious looking exception. For example:

com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 9086
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

or

java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at 
com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at 

RE: Flattening XML in a DataFrame

2016-08-16 Thread Sreekanth Jella
Hi Experts,

 

Please suggest. Thanks in advance.

 

Thanks,

Sreekanth

 

From: Sreekanth Jella [mailto:srikanth.je...@gmail.com] 
Sent: Sunday, August 14, 2016 11:46 AM
To: 'Hyukjin Kwon' 
Cc: 'user @spark' 
Subject: Re: Flattening XML in a DataFrame

 

Hi Hyukjin Kwon,

Thank you for reply.

There are several types of XML documents with different schema which needs to 
be parsed and tag names do not know in hand. All we know is the XSD for the 
given XML. 

Is it possible to get the same results even when we do not know the xml tags 
like manager.id, manager.name or is it possible to read the tag names from XSD 
and use?

Thanks, 
Sreekanth

 

On Aug 12, 2016 9:58 PM, "Hyukjin Kwon"  > wrote:

Hi Sreekanth,

 

Assuming you are using Spark 1.x,

 

I believe this code below:

sqlContext.read.format("com.databricks.spark.xml").option("rowTag", 
"emp").load("/tmp/sample.xml")
  .selectExpr("manager.id  ", "manager.name 
 ", "explode(manager.subordinates.clerk) as clerk")
  .selectExpr("id", "name", "clerk.cid", "clerk.cname")
  .show()

would print the results below as you want:

+---++---+-+
| id|name|cid|cname|
+---++---+-+
|  1| foo|  1|  foo|
|  1| foo|  1|  foo|
+---++---+-+

​

 

I hope this is helpful.

 

Thanks!

 

 

 

 

2016-08-13 9:33 GMT+09:00 Sreekanth Jella  >:

Hi Folks,

 

I am trying flatten variety of XMLs using DataFrames. I’m using spark-xml 
package which is automatically inferring my schema and creating a DataFrame. 

 

I do not want to hard code any column names in DataFrame as I have lot of 
varieties of XML documents and each might be lot more depth of child nodes. I 
simply want to flatten any type of XML and then write output data to a hive 
table. Can you please give some expert advice for the same.

 

Example XML and expected output is given below.

 

Sample XML:





   

   1

   foo



  

1

foo

  

  

1

foo

  



   





 

Expected output:

id, name, clerk.cid, clerk.cname

1, foo, 2, cname2

1, foo, 3, cname3

 

Thanks,

Sreekanth Jella

 

 



Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Aris
Hello Michael,

I made a JIRA with sample code to reproduce this problem. I set you as the
"shepherd" -- I Hope this is enough, otherwise I can fix it.

https://issues.apache.org/jira/browse/SPARK-17092



On Sun, Aug 14, 2016 at 9:38 AM, Michael Armbrust 
wrote:

> Anytime you see JaninoRuntimeException you are seeing a bug in our code
> generation.  If you can come up with a small example that causes the
> problem it would be very helpful if you could open a JIRA.
>
> On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar 
> wrote:
>
>> I see a similar issue being resolved recently: https://issues.apach
>> e.org/jira/browse/SPARK-15285
>>
>> On Fri, Aug 12, 2016 at 3:33 PM, Aris  wrote:
>>
>>> Hello folks,
>>>
>>> I'm on Spark 2.0.0 working with Datasets -- and despite the fact that
>>> smaller data unit tests work on my laptop, when I'm on a cluster, I get
>>> cryptic error messages:
>>>
>>> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method
 "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spa
 rk/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst
 .expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

>>>
>>> Unfortunately I'm not clear on how to even isolate the source of this
>>> problem. I didn't have this problem in Spark 1.6.1.
>>>
>>> Any clues?
>>>
>>
>>
>>
>> --
>> -Dhruve Ashar
>>
>>
>


Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Aris
My error is specifically:

Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
> of class "org.apache.spark.sql.catalyst.ex
> pressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB


Here is an easy way to reproduce in spark-shell on a cluster!

import org.apache.spark.sql.types.{DoubleType, StructType}
import org.apache.spark.sql.{Row, SparkSession}

//val spark = SparkSession.builder.getOrCreate

val COLMAX: Double = 1000.0
val ROWSIZE: Int = 1000

val intToRow: Int => Row = (i: Int) =>
Row.fromSeq(Range.Double.inclusive(1.0, COLMAX, 1.0).toSeq)
val schema: StructType = (1 to COLMAX.toInt).foldLeft(new
StructType())((s, i) => s.add(i.toString, DoubleType, nullable =
true))
val rdds = spark.sparkContext.parallelize((1 to ROWSIZE).map(intToRow))
val df = spark.createDataFrame(rdds, schema)
val Array(left, right) = df.randomSplit(Array(.8,.2))

// This crashes
left.count




On Tue, Aug 16, 2016 at 8:56 AM, Ted Yu  wrote:

> Can you take a look at commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf ?
>
> There was a test:
> SPARK-15285 Generated SpecificSafeProjection.apply method grows beyond 64KB
>
> See if it matches your use case.
>
> On Tue, Aug 16, 2016 at 8:41 AM, Aris  wrote:
>
>> I am still working on making a minimal test that I can share without my
>> work-specific code being in there. However, the problem occurs with a
>> dataframe with several hundred columns being asked to do a tension split.
>> Random split works with up to about 350 columns so far. It breaks in my
>> code with 600 columns, but it's a converted dataset of case classes to
>> dataframe. This is deterministically causing the error in Scala 2.11.
>>
>> Once I can get a deterministically breaking test without work code I will
>> try to file a Jira bug.
>>
>> On Tue, Aug 16, 2016, 04:17 Ted Yu  wrote:
>>
>>> I think we should reopen it.
>>>
>>> On Aug 16, 2016, at 1:48 AM, Kazuaki Ishizaki 
>>> wrote:
>>>
>>> I just realized it since it broken a build with Scala 2.10.
>>> https://github.com/apache/spark/commit/fa244e5a90690d6a31be5
>>> 0f2aa203ae1a2e9a1cf
>>>
>>> I can reproduce the problem in SPARK-15285 with master branch.
>>> Should we reopen SPARK-15285?
>>>
>>> Best Regards,
>>> Kazuaki Ishizaki,
>>>
>>>
>>>
>>> From:Ted Yu 
>>> To:dhruve ashar 
>>> Cc:Aris , "user@spark.apache.org" <
>>> user@spark.apache.org>
>>> Date:2016/08/15 06:19
>>> Subject:Re: Spark 2.0.0 JaninoRuntimeException
>>> --
>>>
>>>
>>>
>>> Looks like the proposed fix was reverted:
>>>
>>> Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply
>>> method grows beyond 64 KB"
>>>
>>> This reverts commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf.
>>>
>>> Maybe this was fixed in some other JIRA ?
>>>
>>> On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar <*dhruveas...@gmail.com*
>>> > wrote:
>>> I see a similar issue being resolved recently:
>>> *https://issues.apache.org/jira/browse/SPARK-15285*
>>> 
>>>
>>> On Fri, Aug 12, 2016 at 3:33 PM, Aris <*arisofala...@gmail.com*
>>> > wrote:
>>> Hello folks,
>>>
>>> I'm on Spark 2.0.0 working with Datasets -- and despite the fact that
>>> smaller data unit tests work on my laptop, when I'm on a cluster, I get
>>> cryptic error messages:
>>>
>>> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method
>>> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/
>>> spark/sql/catalyst/InternalRow;)I" of class
>>> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering"
>>> grows beyond 64 KB
>>>
>>> Unfortunately I'm not clear on how to even isolate the source of this
>>> problem. I didn't have this problem in Spark 1.6.1.
>>>
>>> Any clues?
>>>
>>>
>>>
>>> --
>>> -Dhruve Ashar
>>>
>>>
>>>
>>>
>


Re: Sum array values by row in new column

2016-08-16 Thread Javier Rey
Hi, Thanks!!

this works, but I also need mean :) I am finding way.

Regards.

2016-08-16 5:30 GMT-05:00 ayan guha :

> Here is a more generic way of doing this:
>
> from pyspark.sql import Row
> df = sc.parallelize([[1,2,3,4],[10,20,30]]).map(lambda x:
> Row(numbers=x)).toDF()
> df.show()
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType
> u = udf(lambda c: sum(c), IntegerType())
> df1 = df.withColumn("s",u(df.numbers))
> df1.show()
>
> On Tue, Aug 16, 2016 at 11:50 AM, Mike Metzger  > wrote:
>
>> Assuming you know the number of elements in the list, this should work:
>>
>> df.withColumn('total', df["_1"].getItem(0) + df["_1"].getItem(1) +
>> df["_1"].getItem(2))
>>
>> Mike
>>
>> On Mon, Aug 15, 2016 at 12:02 PM, Javier Rey  wrote:
>>
>>> Hi everyone,
>>>
>>> I have one dataframe with one column this column is an array of numbers,
>>> how can I sum each array by row a obtain a new column with sum? in pyspark.
>>>
>>> Example:
>>>
>>> ++
>>> | numbers|
>>> ++
>>> |[10, 20, 30]|
>>> |[40, 50, 60]|
>>> |[70, 80, 90]|
>>> ++
>>>
>>> The idea is obtain the same df with a new column with totals:
>>>
>>> ++--
>>> | numbers| |
>>> ++--
>>> |[10, 20, 30]|60   |
>>> |[40, 50, 60]|150  |
>>> |[70, 80, 90]|240  |
>>> ++--
>>>
>>> Regards!
>>>
>>> Samir
>>>
>>>
>>>
>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Rebalancing when adding kafka partitions

2016-08-16 Thread Srikanth
Yes, SubscribePattern detects new partition. Also, it has a comment saying

Subscribe to all topics matching specified pattern to get dynamically
> assigned partitions.
>  * The pattern matching will be done periodically against topics existing
> at the time of check.
>  * @param pattern pattern to subscribe to
>  * @param kafkaParams Kafka


Who does the new partition discover? Underlying kafka consumer or
spark-streaming-kafka-0-10-assembly??

Srikanth

On Fri, Aug 12, 2016 at 5:15 PM, Cody Koeninger  wrote:

> Hrrm, that's interesting. Did you try with subscribe pattern, out of
> curiosity?
>
> I haven't tested repartitioning on the  underlying new Kafka consumer, so
> its possible I misunderstood something.
> On Aug 12, 2016 2:47 PM, "Srikanth"  wrote:
>
>> I did try a test with spark 2.0 + spark-streaming-kafka-0-10-assembly.
>> Partition was increased using "bin/kafka-topics.sh --alter" after spark
>> job was started.
>> I don't see messages from new partitions in the DStream.
>>
>> KafkaUtils.createDirectStream[Array[Byte], Array[Byte]] (
>>> ssc, PreferConsistent, Subscribe[Array[Byte],
>>> Array[Byte]](topics, kafkaParams) )
>>> .map(r => (r.key(), r.value()))
>>
>>
>> Also, no.of partitions did not increase too.
>>
>>> dataStream.foreachRDD( (rdd, curTime) => {
>>>  logger.info(s"rdd has ${rdd.getNumPartitions} partitions.")
>>
>>
>> Should I be setting some parameter/config? Is the doc for new integ
>> available?
>>
>> Thanks,
>> Srikanth
>>
>> On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger 
>> wrote:
>>
>>> No, restarting from a checkpoint won't do it, you need to re-define the
>>> stream.
>>>
>>> Here's the jira for the 0.10 integration
>>>
>>> https://issues.apache.org/jira/browse/SPARK-12177
>>>
>>> I haven't gotten docs completed yet, but there are examples at
>>>
>>> https://github.com/koeninger/kafka-exactly-once/tree/kafka-0.10
>>>
>>> On Fri, Jul 22, 2016 at 1:05 PM, Srikanth  wrote:
>>> > In Spark 1.x, if we restart from a checkpoint, will it read from new
>>> > partitions?
>>> >
>>> > If you can, pls point us to some doc/link that talks about Kafka 0.10
>>> integ
>>> > in Spark 2.0.
>>> >
>>> > On Fri, Jul 22, 2016 at 1:33 PM, Cody Koeninger 
>>> wrote:
>>> >>
>>> >> For the integration for kafka 0.8, you are literally starting a
>>> >> streaming job against a fixed set of topicapartitions,  It will not
>>> >> change throughout the job, so you'll need to restart the spark job if
>>> >> you change kafka partitions.
>>> >>
>>> >> For the integration for kafka 0.10 / spark 2.0, if you use subscribe
>>> >> or subscribepattern, it should pick up new partitions as they are
>>> >> added.
>>> >>
>>> >> On Fri, Jul 22, 2016 at 11:29 AM, Srikanth 
>>> wrote:
>>> >> > Hello,
>>> >> >
>>> >> > I'd like to understand how Spark Streaming(direct) would handle
>>> Kafka
>>> >> > partition addition?
>>> >> > Will a running job be aware of new partitions and read from it?
>>> >> > Since it uses Kafka APIs to query offsets and offsets are handled
>>> >> > internally.
>>> >> >
>>> >> > Srikanth
>>> >
>>> >
>>>
>>
>>


Large where clause StackOverflow 1.5.2

2016-08-16 Thread rachmaninovquartet
Hi,

I'm trying to implement a folding function in Spark, it takes an input k and
a data frame of ids and dates. k=1 will be just the data frame, k=2 will,
consist of the min and max date for each id once and the rest twice, k=3
will consist of min and max once, min+1 and max-1, twice and the rest three
times, etc.

Code in scala, with variable names changed:
  val acctMinDates = df.groupBy("id").agg(min("thedate"))
  val acctMaxDates = df.groupBy("id").agg(max("thedate"))
  val acctDates = acctMinDates.join(acctMaxDates, "id").collect()

  var filterString = "";

  for (i <- 1 to k - 1) {

if (i == 1) {
  for (aDate <- acctDates) {
filterString = filterString + "(id = " + aDate(0) + " and thedate >
" + aDate(1) + " and thedate < " + aDate(2) + ") or ";
  }
  filterString = filterString.substring(0, filterString.size - 4)
}
df = df.unionAll(df.where(filterString));
  }
}

Code that is being attempted to translate, from pandas/python:

df = pd.concat([df.groupby('id').apply(lambda x: pd.concat([x.iloc[i: i + k]
for i in range(len(x.index) - k + 1)]))])



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Large-where-clause-StackOverflow-1-5-2-tp27544.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Executor Metrics

2016-08-16 Thread Otis Gospodnetić
Hi Muhammad,

You should give people a bit more time to answer/help you (for free). :)

I don't have direct answer for you, but you can look at SPM for Spark
, which has
all the instructions for getting all Spark metrics (Executors, etc.) into
SPM.  It doesn't involve sink.csv stuff.

Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


On Tue, Aug 16, 2016 at 11:21 AM, Muhammad Haris <
muhammad.haris.makh...@gmail.com> wrote:

> Still waiting for response, any clue/suggestions?
>
>
> On Tue, Aug 16, 2016 at 4:48 PM, Muhammad Haris <
> muhammad.haris.makh...@gmail.com> wrote:
>
>> Hi,
>> I have been trying to collect driver, master, worker and executors
>> metrics using Spark 2.0 in standalone mode, here is what my metrics
>> configuration file looks like:
>>
>> *.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
>> *.sink.csv.period=1
>> *.sink.csv.unit=seconds
>> *.sink.csv.directory=/root/metrics/
>> executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
>> master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
>> worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
>> driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
>>
>> Once application is complete, i can only see driver's metrics, have
>> checked directories on all the worker nodes as well.
>> Could anybody please help me what's i am doing wrong here.
>>
>>
>>
>> Regards
>>
>>
>>
>


Re: DataFrame use case

2016-08-16 Thread Sean Owen
I'd say that Datasets, not DataFrames, are the natural evolution of
RDDs. DataFrames are for inherently tabular data, and most naturally
manipulated by SQL-like operations. Datasets operate on programming
language objects like RDDs.

So, RDDs to DataFrames isn't quite apples-to-apples to begin with.
It's just never true that "X is always faster than Y" in a case like
this. Indeed your case doesn't sound like anything where a tabular
representation would be beneficial. There's overhead to treating it
like that. You're doing almost nothing to the data itself except
counting it, and RDDs have the lowest overhead of the three concepts
because they treat their contents as opaque objects anyway.

The benefit comes when you do things like SQL-like operations on
tabular data in the DataFrame API instead of RDD API. That's where
more optimization can kick in. Dataset brings some of the same
possible optimizations to an RDD-like API because it has more
knowledge of the type and nature of the entire data set.

If you're really only manipulating byte arrays, I don't know if
DataFrame adds anything. I know Dataset has some specialization for
byte[], so I'd expect you could see some storage benefits over RDDs,
maybe.

On Tue, Aug 16, 2016 at 6:32 PM, jtgenesis  wrote:
> Hey guys, I've been digging around trying to figure out if I should
> transition from RDDs to DataFrames. I'm currently using RDDs to represent
> tiles of binary imagery data and I'm wondering if representing the data as a
> DataFrame is a better solution.
>
> To get my feet wet, I did a little comparison on a Word Count application,
> on a 1GB file of random text, using an RDD and DataFrame. And I got the
> following results:
>
> RDD Count total: 137733312 Time Elapsed: 44.5675378 s
> DataFrame Count total: 137733312 Time Elapsed: 69.201253448 s
>
> I figured the DataFrame would outperform the RDD, since I've seen many
> sources that state superior speeds with DataFrames. These results could just
> be an implementation issue, unstructured data, or a result of the data
> source. I'm not really sure.
>
> This leads me to take a step back and figure out what applications are
> better suited with DataFrames than RDDs? In my case, while the original
> image file is unstructured. The data is loaded in a pairRDD, where the key
> contains multiple attributes that pertain to the value. The value is a chunk
> of the image represented as an array of bytes. Since, my data will be in a
> structured format, I don't see why I can't benefit from DataFrames. However,
> should I be concerned of any performance issues that pertain to
> processing/moving of byte array (each chunk is uniform size in the KB-MB
> range). I'll potentially be scanning the entire image, select specific image
> tiles and perform some work on them.
>
> If DataFrames are well suited for my use case, how does the data source
> affect my performance? I could always just load data into an RDD and convert
> to DataFrame, or I could convert the image into a parquet file and create
> DataFrames directly. Is one way recommended over the other?
>
> These are a lot of questions, and I'm still trying to ingest and make sense
> of everything. Any feedback would be greatly appreciated.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-use-case-tp27543.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



DataFrame use case

2016-08-16 Thread jtgenesis
Hey guys, I've been digging around trying to figure out if I should
transition from RDDs to DataFrames. I'm currently using RDDs to represent
tiles of binary imagery data and I'm wondering if representing the data as a
DataFrame is a better solution.

To get my feet wet, I did a little comparison on a Word Count application,
on a 1GB file of random text, using an RDD and DataFrame. And I got the
following results:

RDD Count total: 137733312 Time Elapsed: 44.5675378 s
DataFrame Count total: 137733312 Time Elapsed: 69.201253448 s

I figured the DataFrame would outperform the RDD, since I've seen many
sources that state superior speeds with DataFrames. These results could just
be an implementation issue, unstructured data, or a result of the data
source. I'm not really sure. 

This leads me to take a step back and figure out what applications are
better suited with DataFrames than RDDs? In my case, while the original
image file is unstructured. The data is loaded in a pairRDD, where the key
contains multiple attributes that pertain to the value. The value is a chunk
of the image represented as an array of bytes. Since, my data will be in a
structured format, I don't see why I can't benefit from DataFrames. However,
should I be concerned of any performance issues that pertain to
processing/moving of byte array (each chunk is uniform size in the KB-MB
range). I'll potentially be scanning the entire image, select specific image
tiles and perform some work on them.

If DataFrames are well suited for my use case, how does the data source
affect my performance? I could always just load data into an RDD and convert
to DataFrame, or I could convert the image into a parquet file and create
DataFrames directly. Is one way recommended over the other?

These are a lot of questions, and I'm still trying to ingest and make sense
of everything. Any feedback would be greatly appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-use-case-tp27543.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



VectorUDT with spark.ml.linalg.Vector

2016-08-16 Thread alexeys
I am writing an UDAF to be applied to a data frame column of type Vector
(spark.ml.linalg.Vector). I rely on spark/ml/linalg so that I do not have to
go back and forth between dataframe and RDD. 

Inside the UDAF, I have to specify a data type for the input, buffer, and
output (as usual). VectorUDT is what I would use with
spark.mllib.linalg.Vector: 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala

However, when I try to import it from spark.ml instead: import
org.apache.spark.ml.linalg.VectorUDT 
I get a runtime error (no errors during the build): 

class VectorUDT in package linalg cannot be accessed in package
org.apache.spark.ml.linalg 

Is it expected/can you suggest a workaround? 

I am using Spark 2.0.0

Thanks, 
Alexey



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/VectorUDT-with-spark-ml-linalg-Vector-tp27542.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



GraphFrames 0.2.0 released

2016-08-16 Thread Tim Hunter
Hello all,
I have released version 0.2.0 of the GraphFrames package. Apart from a few
bug fixes, it is the first release published for Spark 2.0 and both scala
2.10 and 2.11. Please let us know if you have any comment or questions.

It is available as a Spark package:
https://spark-packages.org/package/graphframes/graphframes

The source code is available as always at
https://github.com/graphframes/graphframes


What is GraphFrames?

GraphFrames is a DataFrame-based graph engine Spark. In addition to the
algorithms available in GraphX, users can write highly expressive queries
by leveraging the DataFrame API, combined with a new API for motif finding.
The user also benefits from DataFrame performance optimizations within the
Spark SQL engine.

Cheers

Tim


Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Ted Yu
Can you take a look at commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf ?

There was a test:
SPARK-15285 Generated SpecificSafeProjection.apply method grows beyond 64KB

See if it matches your use case.

On Tue, Aug 16, 2016 at 8:41 AM, Aris  wrote:

> I am still working on making a minimal test that I can share without my
> work-specific code being in there. However, the problem occurs with a
> dataframe with several hundred columns being asked to do a tension split.
> Random split works with up to about 350 columns so far. It breaks in my
> code with 600 columns, but it's a converted dataset of case classes to
> dataframe. This is deterministically causing the error in Scala 2.11.
>
> Once I can get a deterministically breaking test without work code I will
> try to file a Jira bug.
>
> On Tue, Aug 16, 2016, 04:17 Ted Yu  wrote:
>
>> I think we should reopen it.
>>
>> On Aug 16, 2016, at 1:48 AM, Kazuaki Ishizaki 
>> wrote:
>>
>> I just realized it since it broken a build with Scala 2.10.
>> https://github.com/apache/spark/commit/fa244e5a90690d6a31be50f2aa203a
>> e1a2e9a1cf
>>
>> I can reproduce the problem in SPARK-15285 with master branch.
>> Should we reopen SPARK-15285?
>>
>> Best Regards,
>> Kazuaki Ishizaki,
>>
>>
>>
>> From:Ted Yu 
>> To:dhruve ashar 
>> Cc:Aris , "user@spark.apache.org" <
>> user@spark.apache.org>
>> Date:2016/08/15 06:19
>> Subject:Re: Spark 2.0.0 JaninoRuntimeException
>> --
>>
>>
>>
>> Looks like the proposed fix was reverted:
>>
>> Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply
>> method grows beyond 64 KB"
>>
>> This reverts commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf.
>>
>> Maybe this was fixed in some other JIRA ?
>>
>> On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar <*dhruveas...@gmail.com*
>> > wrote:
>> I see a similar issue being resolved recently:
>> *https://issues.apache.org/jira/browse/SPARK-15285*
>> 
>>
>> On Fri, Aug 12, 2016 at 3:33 PM, Aris <*arisofala...@gmail.com*
>> > wrote:
>> Hello folks,
>>
>> I'm on Spark 2.0.0 working with Datasets -- and despite the fact that
>> smaller data unit tests work on my laptop, when I'm on a cluster, I get
>> cryptic error messages:
>>
>> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method
>> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/
>> apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.
>> catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
>>
>> Unfortunately I'm not clear on how to even isolate the source of this
>> problem. I didn't have this problem in Spark 1.6.1.
>>
>> Any clues?
>>
>>
>>
>> --
>> -Dhruve Ashar
>>
>>
>>
>>


Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Aris
I am still working on making a minimal test that I can share without my
work-specific code being in there. However, the problem occurs with a
dataframe with several hundred columns being asked to do a tension split.
Random split works with up to about 350 columns so far. It breaks in my
code with 600 columns, but it's a converted dataset of case classes to
dataframe. This is deterministically causing the error in Scala 2.11.

Once I can get a deterministically breaking test without work code I will
try to file a Jira bug.

On Tue, Aug 16, 2016, 04:17 Ted Yu  wrote:

> I think we should reopen it.
>
> On Aug 16, 2016, at 1:48 AM, Kazuaki Ishizaki  wrote:
>
> I just realized it since it broken a build with Scala 2.10.
>
> https://github.com/apache/spark/commit/fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf
>
> I can reproduce the problem in SPARK-15285 with master branch.
> Should we reopen SPARK-15285?
>
> Best Regards,
> Kazuaki Ishizaki,
>
>
>
> From:Ted Yu 
> To:dhruve ashar 
> Cc:Aris , "user@spark.apache.org" <
> user@spark.apache.org>
> Date:2016/08/15 06:19
> Subject:Re: Spark 2.0.0 JaninoRuntimeException
> --
>
>
>
> Looks like the proposed fix was reverted:
>
> Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply
> method grows beyond 64 KB"
>
> This reverts commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf.
>
> Maybe this was fixed in some other JIRA ?
>
> On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar <*dhruveas...@gmail.com*
> > wrote:
> I see a similar issue being resolved recently:
> *https://issues.apache.org/jira/browse/SPARK-15285*
> 
>
> On Fri, Aug 12, 2016 at 3:33 PM, Aris <*arisofala...@gmail.com*
> > wrote:
> Hello folks,
>
> I'm on Spark 2.0.0 working with Datasets -- and despite the fact that
> smaller data unit tests work on my laptop, when I'm on a cluster, I get
> cryptic error messages:
>
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
> of class
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering"
> grows beyond 64 KB
>
> Unfortunately I'm not clear on how to even isolate the source of this
> problem. I didn't have this problem in Spark 1.6.1.
>
> Any clues?
>
>
>
> --
> -Dhruve Ashar
>
>
>
>


Re: Spark Executor Metrics

2016-08-16 Thread Muhammad Haris
Still waiting for response, any clue/suggestions?


On Tue, Aug 16, 2016 at 4:48 PM, Muhammad Haris <
muhammad.haris.makh...@gmail.com> wrote:

> Hi,
> I have been trying to collect driver, master, worker and executors metrics
> using Spark 2.0 in standalone mode, here is what my metrics configuration
> file looks like:
>
> *.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
> *.sink.csv.period=1
> *.sink.csv.unit=seconds
> *.sink.csv.directory=/root/metrics/
> executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
> master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
> worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
> driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
>
> Once application is complete, i can only see driver's metrics, have
> checked directories on all the worker nodes as well.
> Could anybody please help me what's i am doing wrong here.
>
>
>
> Regards
>
>
>


Re: submitting spark job with kerberized Hadoop issue

2016-08-16 Thread Aneela Saleem
Thanks Steve,

I went through this but still not able to fix the issue

On Mon, Aug 15, 2016 at 2:01 AM, Steve Loughran 
wrote:

> Hi,
>
> Just came across this while going through all emails I'd left unread over
> my vacation.
>
> did you manage to fix this?
>
> 1. There's some notes I've taken on this topic:
> https://www.gitbook.com/book/steveloughran/kerberos_and_hadoop/details
>
>  -look at "Error messages to fear" to see if this one has surfaced;
> otherwise look at "low level secrets" to see how to start debugging things
>
>
> On 5 Aug 2016, at 14:54, Aneela Saleem  wrote:
>
> Hi all,
>
> I'm trying to connect to Kerberized Hadoop cluster using spark job. I have
> kinit'd from command line. When i run the following job i.e.,
>
> *./bin/spark-submit --keytab /etc/hadoop/conf/spark.keytab --principal
> spark/hadoop-master@platalyticsrealm --class
> com.platalytics.example.spark.App --master spark://hadoop-master:7077
> /home/vm6/project-1-jar-with-dependencies.jar
> hdfs://hadoop-master:8020/text*
>
> I get the error:
>
> Caused by: java.io.IOException: 
> org.apache.hadoop.security.AccessControlException:
> Client cannot authenticate via:[TOKEN, KERBEROS]
> at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:680)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1628)
>
> Following are the contents of *spark-defaults.conf* file:
>
> spark.master spark://hadoop-master:7077
> spark.eventLog.enabled   true
> spark.eventLog.dir   hdfs://hadoop-master:8020/spark/logs
> spark.serializer org.apache.spark.serializer.
> KryoSerializer
> spark.yarn.access.namenodes hdfs://hadoop-master:8020/
> spark.yarn.security.tokens.hbase.enabled true
> spark.yarn.security.tokens.hive.enabled true
> spark.yarn.principal yarn/hadoop-master@platalyticsrealm
> spark.yarn.keytab /etc/hadoop/conf/yarn.keytab
>
>
> Also i have added following in *spark-env.sh* file:
>
> HOSTNAME=`hostname -f`
> export SPARK_HISTORY_OPTS="-Dspark.history.kerberos.enabled=true
> -Dspark.history.kerberos.principal=spark/${HOSTNAME}@platalyticsrealm
> -Dspark.history.kerberos.keytab=/etc/hadoop/conf/spark.keytab"
>
>
> Please guide me, how to trace the issue?
>
> Thanks
>
>
>


Re: long lineage

2016-08-16 Thread Ted Yu
Have you tried periodic checkpoints ?

Cheers

> On Aug 16, 2016, at 5:50 AM, pseudo oduesp  wrote:
> 
> Hi ,
>  how we can deal after raise stackoverflow trigger by long lineage ?
> i mean i have this error and how resolve it wiyhout creating new session 
> thanks 
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



long lineage

2016-08-16 Thread pseudo oduesp
Hi ,
 how we can deal after raise stackoverflow trigger by long lineage ?
i mean i have this error and how resolve it wiyhout creating new session
thanks


Spark Executor Metrics

2016-08-16 Thread Muhammad Haris
Hi,
I have been trying to collect driver, master, worker and executors metrics
using Spark 2.0 in standalone mode, here is what my metrics configuration
file looks like:

*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
*.sink.csv.period=1
*.sink.csv.unit=seconds
*.sink.csv.directory=/root/metrics/
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource

Once application is complete, i can only see driver's metrics, have checked
directories on all the worker nodes as well.
Could anybody please help me what's i am doing wrong here.



Regards


Re: Data frame Performance

2016-08-16 Thread Mich Talebzadeh
Hi Selvan,

is table called sel,?

And are these assumptions correct?

site -> ColA
requests -> ColB

I don't think you are using ColC here?

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 16 August 2016 at 12:06, Selvam Raman  wrote:

> Hi All,
>
> Please suggest me the best approach to achieve result. [ Please comment if
> the existing logic is fine or not]
>
> Input Record :
>
> ColA ColB ColC
> 1 2 56
> 1 2 46
> 1 3 45
> 1 5 34
> 1 5 90
> 2 1 89
> 2 5 45
> ​
> Expected Result
>
> ResA ResB
> 12:2|3:3|5:5
> 2   1:1|5:5
>
> I followd the below Spark steps
>
> (Spark version - 1.5.0)
>
> def valsplit(elem :scala.collection.mutable.WrappedArray[String]) :
> String =
> {
>
> elem.map(e => e+":"+e).mkString("|")
> }
>
> sqlContext.udf.register("valudf",valsplit(_:scala.collection.mutable.
> WrappedArray[String]))
>
>
> val x =sqlContext.sql("select site,valudf(collect_set(requests)) as test
> from sel group by site").first
>
>
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Re: class not found exception Logging while running JavaKMeansExample

2016-08-16 Thread Ted Yu
The class is:
core/src/main/scala/org/apache/spark/internal/Logging.scala

So it is in spark-core.

On Tue, Aug 16, 2016 at 2:33 AM, subash basnet  wrote:

> Hello Yuzhihong,
>
> I didn't get how to implement what you said in the JavaKMeansExample.java.
> As I get the logging exception as while creating the spark session:
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/internal/Logging
> at com.dfki.spark.kmeans.KMeansSpark.JavaKMeansExample.
> main(JavaKMeansExample.java*:43*)
> Caused by: java.lang.ClassNotFoundException: org.apache.spark.internal.
> Logging
>
> The exception occurs at the *builder()*:
>
> 42SparkSession spark = SparkSession
> *43 .builder()*
> 44  .appName("JavaKMeansExample")
> 45 .getOrCreate();
>
> I have added all the necessary log4j and sl4j dependencies in pom. I am
> still not getting what dependencies I am missing.
>
> Best Regards,
> Subash Basnet
>
> On Mon, Aug 15, 2016 at 6:50 PM, Ted Yu  wrote:
>
>> Logging has become private in 2.0 release:
>>
>> private[spark] trait Logging {
>>
>> On Mon, Aug 15, 2016 at 9:48 AM, subash basnet 
>> wrote:
>>
>>> Hello all,
>>>
>>> I am trying to run JavaKMeansExample of the spark example project. I am
>>> getting the classnotfound exception error:
>>> *Exception in thread "main" java.lang.NoClassDefFoundError:
>>> org/apache/spark/internal/Logging*
>>> at java.lang.ClassLoader.defineClass1(Native Method)
>>> at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
>>> at jcom.dfki.spark.kmeans.KMeansSpark.JavaKMeansExample.main(Ja
>>> vaKMeansExample.java:43)
>>> *Caused by: java.lang.ClassNotFoundException:
>>> org.apache.spark.internal.Logging*
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>
>>> I have added all the logging related dependencies as below:
>>>  org.slf4j slf4j-api
>>> ${slf4j.version}  
>>> org.slf4j slf4j-log4j12
>>> ${slf4j.version} ${hadoop.deps.scope}
>>>   org.slf4j
>>> jul-to-slf4j ${slf4j.version}
>>>   org.slf4j
>>> jcl-over-slf4j ${slf4j.version}
>>> log4j
>>> log4j ${log4j.version}
>>>   commons-logging
>>> commons-logging 1.2
>>>  What depedencies could I be missing, any idea? Regards,
>>> Subash Basnet
>>>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>
>>
>


Unsubscribe

2016-08-16 Thread Martin Serrano

Sent from my Verizon Wireless 4G LTE DROID


Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Ted Yu
I think we should reopen it. 

> On Aug 16, 2016, at 1:48 AM, Kazuaki Ishizaki  wrote:
> 
> I just realized it since it broken a build with Scala 2.10.
> https://github.com/apache/spark/commit/fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf
> 
> I can reproduce the problem in SPARK-15285 with master branch.
> Should we reopen SPARK-15285?
> 
> Best Regards,
> Kazuaki Ishizaki,
> 
> 
> 
> From:Ted Yu 
> To:dhruve ashar 
> Cc:Aris , "user@spark.apache.org" 
> 
> Date:2016/08/15 06:19
> Subject:Re: Spark 2.0.0 JaninoRuntimeException
> 
> 
> 
> Looks like the proposed fix was reverted:
> 
> Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply method 
> grows beyond 64 KB"
> 
> This reverts commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf.
> 
> Maybe this was fixed in some other JIRA ?
> 
> On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar  wrote:
> I see a similar issue being resolved recently: 
> https://issues.apache.org/jira/browse/SPARK-15285
> 
> On Fri, Aug 12, 2016 at 3:33 PM, Aris  wrote:
> Hello folks,
> 
> I'm on Spark 2.0.0 working with Datasets -- and despite the fact that smaller 
> data unit tests work on my laptop, when I'm on a cluster, I get cryptic error 
> messages:
> 
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> 
> Unfortunately I'm not clear on how to even isolate the source of this 
> problem. I didn't have this problem in Spark 1.6.1.
> 
> Any clues? 
> 
> 
> 
> -- 
> -Dhruve Ashar
> 
> 
> 


Data frame Performance

2016-08-16 Thread Selvam Raman
Hi All,

Please suggest me the best approach to achieve result. [ Please comment if
the existing logic is fine or not]

Input Record :

ColA ColB ColC
1 2 56
1 2 46
1 3 45
1 5 34
1 5 90
2 1 89
2 5 45
​
Expected Result

ResA ResB
12:2|3:3|5:5
2   1:1|5:5

I followd the below Spark steps

(Spark version - 1.5.0)

def valsplit(elem :scala.collection.mutable.WrappedArray[String]) : String
=
{

elem.map(e => e+":"+e).mkString("|")
}

sqlContext.udf.register("valudf",valsplit(_:scala.collection.mutable.WrappedArray[String]))


val x =sqlContext.sql("select site,valudf(collect_set(requests)) as test
from sel group by site").first



-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: Sum array values by row in new column

2016-08-16 Thread ayan guha
Here is a more generic way of doing this:

from pyspark.sql import Row
df = sc.parallelize([[1,2,3,4],[10,20,30]]).map(lambda x:
Row(numbers=x)).toDF()
df.show()
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
u = udf(lambda c: sum(c), IntegerType())
df1 = df.withColumn("s",u(df.numbers))
df1.show()

On Tue, Aug 16, 2016 at 11:50 AM, Mike Metzger 
wrote:

> Assuming you know the number of elements in the list, this should work:
>
> df.withColumn('total', df["_1"].getItem(0) + df["_1"].getItem(1) +
> df["_1"].getItem(2))
>
> Mike
>
> On Mon, Aug 15, 2016 at 12:02 PM, Javier Rey  wrote:
>
>> Hi everyone,
>>
>> I have one dataframe with one column this column is an array of numbers,
>> how can I sum each array by row a obtain a new column with sum? in pyspark.
>>
>> Example:
>>
>> ++
>> | numbers|
>> ++
>> |[10, 20, 30]|
>> |[40, 50, 60]|
>> |[70, 80, 90]|
>> ++
>>
>> The idea is obtain the same df with a new column with totals:
>>
>> ++--
>> | numbers| |
>> ++--
>> |[10, 20, 30]|60   |
>> |[40, 50, 60]|150  |
>> |[70, 80, 90]|240  |
>> ++--
>>
>> Regards!
>>
>> Samir
>>
>>
>>
>>
>


-- 
Best Regards,
Ayan Guha


Re: Accessing HBase through Spark with Security enabled

2016-08-16 Thread Aneela Saleem
Thanks Steve,

I have gone through it's documentation, i did not get any idea how to
install it. Can you help me?

On Mon, Aug 15, 2016 at 4:23 PM, Steve Loughran 
wrote:

>
> On 15 Aug 2016, at 08:29, Aneela Saleem  wrote:
>
> Thanks Jacek!
>
> I have already set hbase.security.authentication property set to
> kerberos, since Hbase with kerberos is working fine.
>
> I tested again after correcting the typo but got same error. Following is
> the code, Please have a look:
>
> System.setProperty("java.security.krb5.conf", "/etc/krb5.conf");
> System.setProperty("java.security.auth.login.config",
> "/etc/hbase/conf/zk-jaas.conf");
> val hconf = HBaseConfiguration.create()
> val tableName = "emp"
> hconf.set("hbase.zookeeper.quorum", "hadoop-master")
> hconf.set(TableInputFormat.INPUT_TABLE, tableName)
> hconf.set("hbase.zookeeper.property.clientPort", "2181")
> hconf.set("hbase.master", "hadoop-master:6")
> hconf.set("hadoop.security.authentication", "kerberos")
> hconf.addResource(new Path("/etc/hbase/conf/core-site.xml"))
> hconf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))
>
>
> spark should be automatically picking those up from the classpath; adding
> them to your  own hconf isn't going to have any effect on the hbase config
> used to extract the hbase token on Yarn app launch. That all needs to be
> set up at the time the Spark cluster/app is launched. If you are running
>
> There's a little diagnostics tool, kdiag, which will be in future Hadoop
> versions —It's available as a standalone JAR for others to use
>
> https://github.com/steveloughran/kdiag
>
> This may help verify things like your keytab/login details
>
>
> val conf = new SparkConf()
> conf.set("spark.yarn.security.tokens.hbase.enabled", "true")
> conf.set("spark.authenticate", "true")
> conf.set("spark.authenticate.secret","None")
> val sc = new SparkContext(conf)
> val hBaseRDD = sc.newAPIHadoopRDD(hconf, classOf[TableInputFormat],
> classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
> classOf[org.apache.hadoop.hbase.client.Result])
>
> val count = hBaseRDD.count()
> print("HBase RDD count:" + count)
>
>
>
>
> On Sat, Aug 13, 2016 at 8:36 PM, Jacek Laskowski  wrote:
>
>> Hi Aneela,
>>
>> My (little to no) understanding of how to make it work is to use
>> hbase.security.authentication property set to kerberos (see [1]).
>>
>>
> Nobody understands kerberos; you are not alone. And the more you
> understand of Kerberos, the less you want to.
>
> Spark on YARN uses it to get the tokens for Hive, HBase et al (see
>> [2]). It happens when Client starts conversation to YARN RM (see [3]).
>>
>> You should not do that yourself (and BTW you've got a typo in
>> spark.yarn.security.tokens.habse.enabled setting). I think that the
>> entire code you pasted matches the code Spark's doing itself before
>> requesting resources from YARN.
>>
>> Give it a shot and report back since I've never worked in such a
>> configuration and would love improving in this (security) area.
>> Thanks!
>>
>> [1] http://www.cloudera.com/documentation/enterprise/5-5-x/
>> topics/cdh_sg_hbase_authentication.html#concept_zyz_vg5_nt__
>> section_s1l_nwv_ls
>> [2] https://github.com/apache/spark/blob/master/yarn/src/main/
>> scala/org/apache/spark/deploy/yarn/security/HBaseCredentialP
>> rovider.scala#L58
>> [3] https://github.com/apache/spark/blob/master/yarn/src/main/
>> scala/org/apache/spark/deploy/yarn/Client.scala#L396
>>
>>
>
> [2] is the code from last week; SPARK-14743. The predecessor code was
> pretty similar though: make an RPC call to HBase to ask for an HBase
> delegation token to be handed off to the YARN app; it requires the use to
> be Kerberos authenticated first.
>
>
> Pozdrawiam,
>> Jacek Laskowski
>>
>> >> > 2016-08-07 20:43:57,617 WARN
>> >> > [hconnection-0x24b5fa45-metaLookup-shared--pool2-t1]
>> ipc.RpcClientImpl:
>> >> > Exception encountered while connecting to the server :
>> >> > javax.security.sasl.SaslException: GSS initiate failed [Caused by
>> >> > GSSException: No valid credentials provided (Mechanism level: Failed
>> to
>> >> > find
>> >> > any Kerberos tgt)]
>> >> > 2016-08-07 20:43:57,619 ERROR
>> >> > [hconnection-0x24b5fa45-metaLookup-shared--pool2-t1]
>> ipc.RpcClientImpl:
>> >> > SASL
>> >> > authentication failed. The most likely cause is missing or invalid
>> >> > credentials. Consider 'kinit'.
>> >> > javax.security.sasl.SaslException: GSS initiate failed [Caused by
>> >> > GSSException: No valid credentials provided (Mechanism level: Failed
>> to
>> >> > find
>> >> > any Kerberos tgt)]
>> >> >   at
>> >> >
>> >> > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChalleng
>> e(GssKrb5Client.java:212)
>> >> >   at
>> >> >
>> >> > org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConn
>> ect(HBaseSaslRpcClient.java:179)
>> >> >   at
>> >> >
>> >> > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSa
>> 

Re: class not found exception Logging while running JavaKMeansExample

2016-08-16 Thread subash basnet
Hello Yuzhihong,

I didn't get how to implement what you said in the JavaKMeansExample.java.
As I get the logging exception as while creating the spark session:
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/internal/Logging
at
com.dfki.spark.kmeans.KMeansSpark.JavaKMeansExample.main(JavaKMeansExample.java
*:43*)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.internal.Logging

The exception occurs at the *builder()*:

42SparkSession spark = SparkSession
*43 .builder()*
44  .appName("JavaKMeansExample")
45 .getOrCreate();

I have added all the necessary log4j and sl4j dependencies in pom. I am
still not getting what dependencies I am missing.

Best Regards,
Subash Basnet

On Mon, Aug 15, 2016 at 6:50 PM, Ted Yu  wrote:

> Logging has become private in 2.0 release:
>
> private[spark] trait Logging {
>
> On Mon, Aug 15, 2016 at 9:48 AM, subash basnet  wrote:
>
>> Hello all,
>>
>> I am trying to run JavaKMeansExample of the spark example project. I am
>> getting the classnotfound exception error:
>> *Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/internal/Logging*
>> at java.lang.ClassLoader.defineClass1(Native Method)
>> at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
>> at jcom.dfki.spark.kmeans.KMeansSpark.JavaKMeansExample.main(
>> JavaKMeansExample.java:43)
>> *Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.internal.Logging*
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>
>> I have added all the logging related dependencies as below:
>>  org.slf4j slf4j-api
>> ${slf4j.version}  
>> org.slf4j slf4j-log4j12
>> ${slf4j.version} ${hadoop.deps.scope}
>>   org.slf4j
>> jul-to-slf4j ${slf4j.version}
>>   org.slf4j
>> jcl-over-slf4j ${slf4j.version}
>> log4j
>> log4j ${log4j.version}
>>   commons-logging
>> commons-logging 1.2
>>  What depedencies could I be missing, any idea? Regards,
>> Subash Basnet
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
>


Re: Linear regression, weights constraint

2016-08-16 Thread Taras Lehinevych
Thank you a lot for the answer.

Have a nice day.

On Tue, Aug 16, 2016 at 10:55 AM Yanbo Liang  wrote:

> Spark MLlib does not support boxed constraints on model coefficients
> currently.
>
> Thanks
> Yanbo
>
> 2016-08-15 3:53 GMT-07:00 letaiv :
>
>> Hi all,
>>
>> Is there any approach to add constrain for weights in linear regression?
>> What I need is least squares regression with non-negative constraints on
>> the
>> coefficients/weights.
>>
>> Thanks in advance.
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Linear-regression-weights-constraint-tp27535.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Spark's Logistic Regression runs unstable on Yarn cluster

2016-08-16 Thread Yanbo Liang
Could you check the log to see how much iterations does your LoR runs? Does
your program output same model between different attempts?

Thanks
Yanbo

2016-08-12 3:08 GMT-07:00 olivierjeunen :

> I'm using pyspark ML's logistic regression implementation to do some
> classification on an AWS EMR Yarn cluster.
>
> The cluster consists of 10 m3.xlarge nodes and is set up as follows:
> spark.driver.memory 10g, spark.driver.cores  3 , spark.executor.memory 10g,
> spark.executor-cores 4.
>
> I enabled yarn's dynamic allocation abilities.
>
> The problem is that my results are way unstable. Sometimes my application
> finishes using 13 executors total, sometimes all of them seem to die and
> the
> application ends up using anywhere between 100 and 200...
>
> Any insight on what could cause this stochastic behaviour would be greatly
> appreciated.
>
> The code used to run the logistic regression:
>
> data = spark.read.parquet(storage_path).repartition(80)
> lr = LogisticRegression()
> lr.setMaxIter(50)
> lr.setRegParam(0.063)
> evaluator = BinaryClassificationEvaluator()
> lrModel = lr.fit(data.filter(data.test == 0))
> predictions = lrModel.transform(data.filter(data.test == 1))
> auROC = evaluator.evaluate(predictions)
> print "auROC on test set: ", auROC
> Data is a dataframe of roughly 2.8GB
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-s-Logistic-Regression-runs-
> unstable-on-Yarn-cluster-tp27520.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: java.lang.UnsupportedOperationException: Cannot evaluate expression: fun_nm(input[0, string, true])

2016-08-16 Thread Sumit Khanna
This is just the stacktrace,but where is it you ccalling the UDF?

Regards,
Sumit

On 16-Aug-2016 2:20 pm, "pseudo oduesp"  wrote:

> hi,
> i cretae new columns with udf  after i try to filter this columns :
> i get this error why ?
>
> : java.lang.UnsupportedOperationException: Cannot evaluate expression:
> fun_nm(input[0, string, true])
> at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.
> eval(Expression.scala:221)
> at org.apache.spark.sql.execution.python.PythonUDF.
> eval(PythonUDF.scala:27)
> at org.apache.spark.sql.catalyst.expressions.BinaryExpression.
> eval(Expression.scala:408)
> at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$.
> org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$
> canFilterOutNull(Optimizer.scala:1234)
> at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$
> anonfun$55.apply(Optimizer.scala:1248)
> at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$
> anonfun$55.apply(Optimizer.scala:1248)
> at scala.collection.LinearSeqOptimized$class.
> exists(LinearSeqOptimized.scala:93)
> at scala.collection.immutable.List.exists(List.scala:84)
> at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$.
> org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$
> buildNewJoinType(Optimizer.scala:1248)
> at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$
> anonfun$apply$30.applyOrElse(Optimizer.scala:1264)
> at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$
> anonfun$apply$30.applyOrElse(Optimizer.scala:1262)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.
> apply(TreeNode.scala:279)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.
> apply(TreeNode.scala:279)
> at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.
> withOrigin(TreeNode.scala:69)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(
> TreeNode.scala:278)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformDown$1.apply(TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformDown$1.apply(TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.
> apply(TreeNode.scala:321)
> at org.apache.spark.sql.catalyst.trees.TreeNode.
> mapProductIterator(TreeNode.scala:179)
> at org.apache.spark.sql.catalyst.trees.TreeNode.
> transformChildren(TreeNode.scala:319)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(
> TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformDown$1.apply(TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformDown$1.apply(TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.
> apply(TreeNode.scala:321)
> at org.apache.spark.sql.catalyst.trees.TreeNode.
> mapProductIterator(TreeNode.scala:179)
> at org.apache.spark.sql.catalyst.trees.TreeNode.
> transformChildren(TreeNode.scala:319)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(
> TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformDown$1.apply(TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformDown$1.apply(TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.
> apply(TreeNode.scala:321)
> at org.apache.spark.sql.catalyst.trees.TreeNode.
> mapProductIterator(TreeNode.scala:179)
> at org.apache.spark.sql.catalyst.trees.TreeNode.
> transformChildren(TreeNode.scala:319)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(
> TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transform(
> TreeNode.scala:268)
> at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$.
> apply(Optimizer.scala:1262)
> at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$.
> apply(Optimizer.scala:1225)
> at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$
> execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
> at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$
> execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
> at scala.collection.IndexedSeqOptimized$class.
> foldl(IndexedSeqOptimized.scala:57)
> at scala.collection.IndexedSeqOptimized$class.
> foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.WrappedArray.foldLeft(
> WrappedArray.scala:35)
> at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$
> execute$1.apply(RuleExecutor.scala:82)
> at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$
> execute$1.apply(RuleExecutor.scala:74)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(
> RuleExecutor.scala:74)
> at 

Re: Linear regression, weights constraint

2016-08-16 Thread Yanbo Liang
Spark MLlib does not support boxed constraints on model coefficients
currently.

Thanks
Yanbo

2016-08-15 3:53 GMT-07:00 letaiv :

> Hi all,
>
> Is there any approach to add constrain for weights in linear regression?
> What I need is least squares regression with non-negative constraints on
> the
> coefficients/weights.
>
> Thanks in advance.
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Linear-regression-weights-constraint-tp27535.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


java.lang.UnsupportedOperationException: Cannot evaluate expression: fun_nm(input[0, string, true])

2016-08-16 Thread pseudo oduesp
hi,
i cretae new columns with udf  after i try to filter this columns :
i get this error why ?

: java.lang.UnsupportedOperationException: Cannot evaluate expression:
fun_nm(input[0, string, true])
at
org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:221)
at
org.apache.spark.sql.execution.python.PythonUDF.eval(PythonUDF.scala:27)
at
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:408)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(Optimizer.scala:1234)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$55.apply(Optimizer.scala:1248)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$55.apply(Optimizer.scala:1248)
at
scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
at scala.collection.immutable.List.exists(List.scala:84)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$buildNewJoinType(Optimizer.scala:1248)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$30.applyOrElse(Optimizer.scala:1264)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$30.applyOrElse(Optimizer.scala:1262)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$.apply(Optimizer.scala:1262)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$.apply(Optimizer.scala:1225)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:74)
at
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:74)
at
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:78)
at
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:76)
at

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Kazuaki Ishizaki
I just realized it since it broken a build with Scala 2.10.
https://github.com/apache/spark/commit/fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf

I can reproduce the problem in SPARK-15285 with master branch.
Should we reopen SPARK-15285?

Best Regards,
Kazuaki Ishizaki,



From:   Ted Yu 
To: dhruve ashar 
Cc: Aris , "user@spark.apache.org" 

Date:   2016/08/15 06:19
Subject:Re: Spark 2.0.0 JaninoRuntimeException



Looks like the proposed fix was reverted:

Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply 
method grows beyond 64 KB"

This reverts commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf.

Maybe this was fixed in some other JIRA ?

On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar  
wrote:
I see a similar issue being resolved recently: 
https://issues.apache.org/jira/browse/SPARK-15285

On Fri, Aug 12, 2016 at 3:33 PM, Aris  wrote:
Hello folks,

I'm on Spark 2.0.0 working with Datasets -- and despite the fact that 
smaller data unit tests work on my laptop, when I'm on a cluster, I get 
cryptic error messages:

Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
"(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
 
of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
grows beyond 64 KB

Unfortunately I'm not clear on how to even isolate the source of this 
problem. I didn't have this problem in Spark 1.6.1.

Any clues? 



-- 
-Dhruve Ashar






MLIB and R results do not match for SVD

2016-08-16 Thread roni
Hi All,
 Some time back I had asked the question about  PCA results not matching
between R and MLIB. I was suggested to use svd.v instead of PCA  to match
the uncentered PCA .
But the results of mlib and R for svd  do not match .(I can understand the
numbers not matching exactly) but the distribution of data when plotted
does not match.
Is there something I can do to fix.
I mostly have tallSkinnt matrix data with 60+ rows and  upto 100+
columns .
Any help is appreciated .
Thanks
_R