Re: SPARK Storagelevel issues

2017-07-27 Thread 周康
testdf.persist(pyspark.storagelevel.StorageLevel.MEMORY_ONLY_SER) maybe
StorageLevel should change.And check you config "
spark.memory.storageFraction" which default value is 0.5

2017-07-28 3:04 GMT+08:00 Gourav Sengupta :

> Hi,
>
> I cached in a table in a large EMR cluster and it has a size of 62 MB.
> Therefore I know the size of the table while cached.
>
> But when I am trying to cache in the table in smaller cluster which still
> has a total of 3 GB Driver memory and two executors with close to 2.5 GB
> memory the job still keeps on failing giving JVM out of memory errors.
>
> Is there something that I am missing?
>
> CODE:
> =
> sparkSession =  spark.builder \
> .config("spark.rdd.compress", "true") \
> .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
> \
> 
> .config("spark.executor.extraJavaOptions","-XX:+UseCompressedOops
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps") \
> .appName("test").enableHiveSupport().getOrCreate()
>
> testdf = sparkSession.sql("select * from tablename")
> testdf.persist(pyspark.storagelevel.StorageLevel.MEMORY_ONLY_SER)
> =
>
> This causes JVM out of memory error.
>
>
> Regards,
> Gourav Sengupta
>


Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-27 Thread Tathagata Das
For built-in SQL functions, it does not matter which language you use as
the engine will use the most optimized JVM code to execute. However, in
your case, you are asking for foreach in python. My interpretation was that
you want to specify your python function that process the rows in python.
This is different from the built-in functions, as the engine will have to
invoke your function in the python inside a python VM.







On Wed, Jul 26, 2017 at 12:54 PM, ayan guha  wrote:

> Hi TD
>
> I thought structured streaming does provide similar concept of dataframes
> where it does not matter which language I use to invoke the APIs, with
> exception of udf.
>
> So, when I think of support foreach sink in python, I think it as just a
> wrapper api and data should remain in JVM only. Similar to, for example, a
> hive writer or hdfs writer in Dataframe API.
>
> Am I too simplifying? Or is it just early days in structured streaming?
> Happy to learn any mistakes in my thinking and understanding.
>
> Best
> Ayan
>
> On Thu, 27 Jul 2017 at 4:49 am, Priyank Shrivastava <
> priy...@asperasoft.com> wrote:
>
>> Thanks TD.  I am going to try the python-scala hybrid approach by using
>> scala only for custom redis sink and python for the rest of the app .  I
>> understand it might not be as efficient as purely writing the app in scala
>> but unfortunately I am constrained on scala resources.  Have you come
>> across other use cases where people have resided to such python-scala
>> hybrid approach?
>>
>> Regards,
>> Priyank
>>
>>
>>
>> On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Hello Priyank
>>>
>>> Writing something purely in Scale/Java would be the most efficient. Even
>>> if we expose python APIs that allow writing custom sinks in pure Python, it
>>> wont be as efficient as Scala/Java foreach as the data would have to go
>>> through JVM / PVM boundary which has significant overheads. So Scala/Java
>>> foreach is always going to be the best option.
>>>
>>> TD
>>>
>>> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava <
>>> priy...@asperasoft.com> wrote:
>>>
 I am trying to write key-values to redis using a DataStreamWriter
 object using pyspark structured streaming APIs. I am using Spark 2.2

 Since the Foreach Sink is not supported for python; here
 ,
 I am trying to find out some alternatives.

 One alternative is to write a separate Scala module only to push data
 into redis using foreach; ForeachWriter
 
  is
 supported in Scala. BUT this doesn't seem like an efficient approach and
 adds deployment overhead because now I will have to support Scala in my 
 app.

 Another approach is obviously to use Scala instead of python, which is
 fine but I want to make sure that I absolutely cannot use python for this
 problem before I take this path.

 Would appreciate some feedback and alternative design approaches for
 this problem.

 Thanks.




>>>
>> --
> Best Regards,
> Ayan Guha
>


unsubscribe

2017-07-27 Thread Tao Lu
unsubscribe


回复:Re: A tool to generate simulation data

2017-07-27 Thread luohui20001

thank you Suzen, i've had a try to generate 1 billion records within 1.5min. It 
is fast.And I will go on to try some other cases.



 

ThanksBest regards!
San.Luo

- 原始邮件 -
发件人:"Suzen, Mehmet" 
收件人:luohui20...@sina.com
抄送人:user 
主题:Re: A tool to generate simulation data
日期:2017年07月28日 01点18分


I suggest RandomRDDs API. It provides nice tools. If you write
wrappers around that might be good.
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


SPARK Storagelevel issues

2017-07-27 Thread Gourav Sengupta
Hi,

I cached in a table in a large EMR cluster and it has a size of 62 MB.
Therefore I know the size of the table while cached.

But when I am trying to cache in the table in smaller cluster which still
has a total of 3 GB Driver memory and two executors with close to 2.5 GB
memory the job still keeps on failing giving JVM out of memory errors.

Is there something that I am missing?

CODE:
=
sparkSession =  spark.builder \
.config("spark.rdd.compress", "true") \
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \

.config("spark.executor.extraJavaOptions","-XX:+UseCompressedOops
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps") \
.appName("test").enableHiveSupport().getOrCreate()

testdf = sparkSession.sql("select * from tablename")
testdf.persist(pyspark.storagelevel.StorageLevel.MEMORY_ONLY_SER)
=

This causes JVM out of memory error.


Regards,
Gourav Sengupta


Re: Spark2.1 installation issue

2017-07-27 Thread Vikash Kumar
Hi,

 I have posted to cloudera community also. But its spark2 installation,
thought I might get some pointers here.

Thank you

On Thu, Jul 27, 2017, 11:29 PM Marcelo Vanzin  wrote:

> Hello,
>
> This is a CDH-specific issue, please use the Cloudera forums / support
> line instead of the Apache group.
>
> On Thu, Jul 27, 2017 at 10:54 AM, Vikash Kumar
>  wrote:
> > I have installed spark2 parcel through cloudera CDH 12.0. I see some
> issue
> > there. Look like it didn't got configured properly.
> >
> > $ spark2-shell
> > Exception in thread "main" java.lang.NoClassDefFoundError:
> > org/apache/hadoop/fs/FSDataInputStream
> > at
> >
> org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:118)
> > at
> >
> org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:118)
> > at scala.Option.getOrElse(Option.scala:121)
> > at
> >
> org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:118)
> > at
> >
> org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:104)
> > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> > Caused by: java.lang.ClassNotFoundException:
> > org.apache.hadoop.fs.FSDataInputStream
> > at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> > at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> >
> >  I have Hadoop version:
> >
> > $ hadoop version
> > Hadoop 2.6.0-cdh5.12.0
> > Subversion http://github.com/cloudera/hadoop -r
> > dba647c5a8bc5e09b572d76a8d29481c78d1a0dd
> > Compiled by jenkins on 2017-06-29T11:31Z
> > Compiled with protoc 2.5.0
> > From source with checksum 7c45ae7a4592ce5af86bc4598c5b4
> > This command was run using
> >
> /opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/jars/hadoop-common-2.6.0-cdh5.12.0.jar
> >
> > also ,
> >
> > $ ls /etc/spark/conf shows :
> >
> > classpath.txt__cloudera_metadata__
> > navigator.lineage.client.properties  spark-env.sh
> > __cloudera_generation__  log4j.properties   spark-defaults.conf
> > yarn-conf
> >
> >
> > while, /etc/spark2/conf is empty .
> >
> >
> > How should I fix this ? Do I need to do any manual configuration ?
> >
> >
> >
> > Regards,
> > Vikash
>
>
>
> --
> Marcelo
>


Re: Spark2.1 installation issue

2017-07-27 Thread Marcelo Vanzin
Hello,

This is a CDH-specific issue, please use the Cloudera forums / support
line instead of the Apache group.

On Thu, Jul 27, 2017 at 10:54 AM, Vikash Kumar
 wrote:
> I have installed spark2 parcel through cloudera CDH 12.0. I see some issue
> there. Look like it didn't got configured properly.
>
> $ spark2-shell
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/hadoop/fs/FSDataInputStream
> at
> org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:118)
> at
> org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:118)
> at scala.Option.getOrElse(Option.scala:121)
> at
> org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:118)
> at
> org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:104)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.fs.FSDataInputStream
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>
>  I have Hadoop version:
>
> $ hadoop version
> Hadoop 2.6.0-cdh5.12.0
> Subversion http://github.com/cloudera/hadoop -r
> dba647c5a8bc5e09b572d76a8d29481c78d1a0dd
> Compiled by jenkins on 2017-06-29T11:31Z
> Compiled with protoc 2.5.0
> From source with checksum 7c45ae7a4592ce5af86bc4598c5b4
> This command was run using
> /opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/jars/hadoop-common-2.6.0-cdh5.12.0.jar
>
> also ,
>
> $ ls /etc/spark/conf shows :
>
> classpath.txt__cloudera_metadata__
> navigator.lineage.client.properties  spark-env.sh
> __cloudera_generation__  log4j.properties   spark-defaults.conf
> yarn-conf
>
>
> while, /etc/spark2/conf is empty .
>
>
> How should I fix this ? Do I need to do any manual configuration ?
>
>
>
> Regards,
> Vikash



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark2.1 installation issue

2017-07-27 Thread Vikash Kumar
I have installed spark2 parcel through cloudera CDH 12.0. I see some issue
there. Look like it didn't got configured properly.

$ spark2-shell
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/hadoop/fs/FSDataInputStream
at
org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:118)
at
org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:118)
at scala.Option.getOrElse(Option.scala:121)
at
org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:118)
at
org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:104)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

 I have Hadoop version:

$ hadoop version
Hadoop 2.6.0-cdh5.12.0
Subversion http://github.com/cloudera/hadoop -r
dba647c5a8bc5e09b572d76a8d29481c78d1a0dd
Compiled by jenkins on 2017-06-29T11:31Z
Compiled with protoc 2.5.0
>From source with checksum 7c45ae7a4592ce5af86bc4598c5b4
This command was run using
/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/jars/hadoop-common-2.6.0-cdh5.12.0.jar

also ,

$ ls /etc/spark/conf shows :

classpath.txt__cloudera_metadata__
navigator.lineage.client.properties  spark-env.sh
__cloudera_generation__  log4j.properties
spark-defaults.conf  yarn-conf


while, /etc/spark2/conf is empty .


How should I fix this ? Do I need to do any manual configuration ?



Regards,
Vikash


Re: A tool to generate simulation data

2017-07-27 Thread Suzen, Mehmet
I suggest RandomRDDs API. It provides nice tools. If you write
wrappers around that might be good.

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: running spark application compiled with 1.6 on spark 2.1 cluster

2017-07-27 Thread 周康
>From spark2.x the package of Logging is changed

2017-07-27 23:45 GMT+08:00 Marcelo Vanzin :

> On Wed, Jul 26, 2017 at 10:45 PM, satishl  wrote:
> > is this a supported scenario - i.e., can I run app compiled with spark
> 1.6
> > on a 2.+ spark cluster?
>
> In general, no.
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: running spark application compiled with 1.6 on spark 2.1 cluster

2017-07-27 Thread Marcelo Vanzin
On Wed, Jul 26, 2017 at 10:45 PM, satishl  wrote:
> is this a supported scenario - i.e., can I run app compiled with spark 1.6
> on a 2.+ spark cluster?

In general, no.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



guava not compatible to hadoop version 2.6.5

2017-07-27 Thread Markus.Breuer
After upgrading from apache spark 2.1.1 to 2.2.0 our integration test fail with 
an exception:

java.lang.IllegalAccessError: tried to access method 
com.google.common.base.Stopwatch.()V from class 
org.apache.hadoop.mapred.FileInputFormat
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:312)
at 
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)

Seems there is an incompatibility with hadoop artifacts and guava, more details 
are described here:

https://stackoverflow.com/questions/36427291/illegalaccesserror-to-guavas-stopwatch-from-org-apache-hadoop-mapreduce-lib-inp

Advising maven to use more recent versions of affected artifacts solved this 
issue.

Questions: Is this a reasonable workaround for deployment into production? Or 
should we still stay on 2.1.1 until these dependencies are fixed in spark? Is 
there a ticket referencing this issue?





How does Spark handle timestamps during Pandas dataframe conversion

2017-07-27 Thread saatvikshah1994
I've summarized this question in detail in this StackOverflow question with
code snippets and logs:
https://stackoverflow.com/questions/45308406/how-does-spark-handle-timestamp-types-during-pandas-dataframe-conversion/.
Looking for efficient solutions to this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-Spark-handle-timestamps-during-Pandas-dataframe-conversion-tp29004.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Please unsubscribe

2017-07-27 Thread babita Sancheti


Sent from my iPhone

> 
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



A tool to generate simulation data

2017-07-27 Thread luohui20001
hello guys  Is there a tool or an open source project that can mock lange 
amount of data quickly, and support below :1. transaction data2. time series 
data3. specified format data like CSV files or json files.4. data generated at 
a changing speed.5. distributed data generation



 

ThanksBest regards!
San.Luo


Re: Complex types projection handling with Spark 2 SQL and Parquet

2017-07-27 Thread Patrick
Hi ,

I am having the same issue. Has any one found solution to this.

When i convert the nested JSON to parquet. I dont see the projection
working correctly.
It still reads all the nested structure columns. Parquet does support
nested column projection.

Does Spark 2 SQL provide the column projection for nested fields.? Does
predicate push-down work for nested columns?

How we can optimize things in this case. Am i using the wrong API ?

Thanks in advance.




On Sun, Jan 29, 2017 at 12:39 AM, Antoine HOM  wrote:

> Hello everybody,
>
> I have been trying to use complex types (stored in parquet) with spark
> SQL and ended up having an issue that I can't seem to be able to solve
> cleanly.
> I was hoping, through this mail, to get some insights from the
> community, maybe I'm just missing something obvious in the way I'm
> using spark :)
>
> It seems that spark only push down projections for columns at the root
> level of the records.
> This is a big I/O issue depending on how much you use complex types,
> in my samples I ended up reading 100GB of data when using only a
> single field out of a struct (It should most likely have read only
> 1GB).
>
> I already saw this PR which sounds promising:
> https://github.com/apache/spark/pull/16578
>
> However it seems that it won't be applicable if you have multiple
> array nesting level, the main reason is that I can't seem to find how
> to reference to fields deeply nested in arrays in a Column expression.
> I can do everything within lambdas but I think the optimizer won't
> drill into it to understand that I'm only accessing a few fields.
>
> If I take the following (simplified) example:
>
> {
> trip_legs:[{
> from: "LHR",
> to: "NYC",
> taxes: [{
> type: "gov",
> amount: 12
> currency: "USD"
> }]
> }]
> }
>
> col(trip_legs.from) will return an Array of all the from fields for
> each trip_leg object.
> col(trip_legs.taxes.type) will throw an exception.
>
> So my questions are:
>   * Is there a way to reference to these deeply nested fields without
> having to specify an array index with a Column expression?
>   * If not, is there an API to force the projection of a given set of
> fields so that parquet only read this specific set of columns?
>
> In addition, regarding the handling of arrays of struct within spark sql:
>   * Has it been discussed to have a way to "reshape" an array of
> structs without using lambdas? (Like the $map/$filter/etc.. operators
> of mongodb for example)
>   * If not and I'm willing to dedicate time to code for these
> features, does someone familiar with the code base could tell me how
> disruptive this would be? And if this would be a welcome change or
> not? (most likely more appropriate for the dev mailing list though)
>
> Regards,
> Antoine
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>