Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-16 Thread Arun Patel
I tried below options.

1) Increase executor memory.  Increased up to maximum possibility 14GB.
Same error.
2) Tried new version - spark-xml_2.10:0.4.1.  Same error.
3) Tried with low level rowTags.  It worked for lower level rowTag and
returned 16000 rows.

Are there any workarounds for this issue?  I tried playing with
spark.memory.fraction
and spark.memory.storageFraction.  But, it did not help.  Appreciate your
help on this!!!



On Tue, Nov 15, 2016 at 8:44 PM, Arun Patel <arunp.bigd...@gmail.com> wrote:

> Thanks for the quick response.
>
> Its a single XML file and I am using a top level rowTag.  So, it creates
> only one row in a Dataframe with 5 columns. One of these columns will
> contain most of the data as StructType.  Is there a limitation to store
> data in a cell of a Dataframe?
>
> I will check with new version and try to use different rowTags and
> increase executor-memory tomorrow. I will open a new issue as well.
>
>
>
> On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
>> Hi Arun,
>>
>>
>> I have few questions.
>>
>> Dose your XML file have like few huge documents? In this case of a row
>> having a huge size like (like 500MB), it would consume a lot of memory
>>
>> becuase at least it should hold a row to iterate if I remember correctly.
>> I remember this happened to me before while processing a huge record for
>> test purpose.
>>
>>
>> How about trying to increase --executor-memory?
>>
>>
>> Also, you could try to select only few fields to prune the data with the
>> latest version just to doubly sure if you don't mind?.
>>
>>
>> Lastly, do you mind if I ask to open an issue in
>> https://github.com/databricks/spark-xml/issues if you still face this
>> problem?
>>
>> I will try to take a look at my best.
>>
>>
>> Thank you.
>>
>>
>> 2016-11-16 9:12 GMT+09:00 Arun Patel <arunp.bigd...@gmail.com>:
>>
>>> I am trying to read an XML file which is 1GB is size.  I am getting an
>>> error 'java.lang.OutOfMemoryError: Requested array size exceeds VM
>>> limit' after reading 7 partitions in local mode.  In Yarn mode, it
>>> throws 'java.lang.OutOfMemoryError: Java heap space' error after
>>> reading 3 partitions.
>>>
>>> Any suggestion?
>>>
>>> PySpark Shell Command:pyspark --master local[4] --driver-memory 3G
>>> --jars / tmp/spark-xml_2.10-0.3.3.jar
>>>
>>>
>>>
>>> Dataframe Creation Command:   df = sqlContext.read.format('com.da
>>> tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml')
>>>
>>>
>>>
>>> 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0
>>> (TID 1) in 25978 ms on localhost (1/10)
>>>
>>> 16/11/15 18:27:04 INFO NewHadoopRDD: Input split:
>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728
>>>
>>> 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2).
>>> 2309 bytes result sent to driver
>>>
>>> 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0
>>> (TID 3, localhost, partition 3,ANY, 2266 bytes)
>>>
>>> 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
>>>
>>> 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0
>>> (TID 2) in 51001 ms on localhost (2/10)
>>>
>>> 16/11/15 18:27:55 INFO NewHadoopRDD: Input split:
>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728
>>>
>>> 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3).
>>> 2309 bytes result sent to driver
>>>
>>> 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0
>>> (TID 4, localhost, partition 4,ANY, 2266 bytes)
>>>
>>> 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
>>>
>>> 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0
>>> (TID 3) in 24336 ms on localhost (3/10)
>>>
>>> 16/11/15 18:28:19 INFO NewHadoopRDD: Input split:
>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728
>>>
>>> 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4).
>>> 2309 bytes result sent to driver
>>>
>>> 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0
>>> (TID 5, localhost, partition 5,ANY, 2266 bytes)
>>>
>>> 16/11/15 18:28:40 INFO Executor: Running task 

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Arun Patel
Thanks for the quick response.

Its a single XML file and I am using a top level rowTag.  So, it creates
only one row in a Dataframe with 5 columns. One of these columns will
contain most of the data as StructType.  Is there a limitation to store
data in a cell of a Dataframe?

I will check with new version and try to use different rowTags and increase
executor-memory tomorrow. I will open a new issue as well.



On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Hi Arun,
>
>
> I have few questions.
>
> Dose your XML file have like few huge documents? In this case of a row
> having a huge size like (like 500MB), it would consume a lot of memory
>
> becuase at least it should hold a row to iterate if I remember correctly.
> I remember this happened to me before while processing a huge record for
> test purpose.
>
>
> How about trying to increase --executor-memory?
>
>
> Also, you could try to select only few fields to prune the data with the
> latest version just to doubly sure if you don't mind?.
>
>
> Lastly, do you mind if I ask to open an issue in https://github.com/
> databricks/spark-xml/issues if you still face this problem?
>
> I will try to take a look at my best.
>
>
> Thank you.
>
>
> 2016-11-16 9:12 GMT+09:00 Arun Patel <arunp.bigd...@gmail.com>:
>
>> I am trying to read an XML file which is 1GB is size.  I am getting an
>> error 'java.lang.OutOfMemoryError: Requested array size exceeds VM
>> limit' after reading 7 partitions in local mode.  In Yarn mode, it
>> throws 'java.lang.OutOfMemoryError: Java heap space' error after reading
>> 3 partitions.
>>
>> Any suggestion?
>>
>> PySpark Shell Command:pyspark --master local[4] --driver-memory 3G
>> --jars / tmp/spark-xml_2.10-0.3.3.jar
>>
>>
>>
>> Dataframe Creation Command:   df = sqlContext.read.format('com.da
>> tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml')
>>
>>
>>
>> 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0
>> (TID 1) in 25978 ms on localhost (1/10)
>>
>> 16/11/15 18:27:04 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728
>>
>> 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2).
>> 2309 bytes result sent to driver
>>
>> 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0
>> (TID 3, localhost, partition 3,ANY, 2266 bytes)
>>
>> 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
>>
>> 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0
>> (TID 2) in 51001 ms on localhost (2/10)
>>
>> 16/11/15 18:27:55 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728
>>
>> 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3).
>> 2309 bytes result sent to driver
>>
>> 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0
>> (TID 4, localhost, partition 4,ANY, 2266 bytes)
>>
>> 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
>>
>> 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0
>> (TID 3) in 24336 ms on localhost (3/10)
>>
>> 16/11/15 18:28:19 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728
>>
>> 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4).
>> 2309 bytes result sent to driver
>>
>> 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0
>> (TID 5, localhost, partition 5,ANY, 2266 bytes)
>>
>> 16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
>>
>> 16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0
>> (TID 4) in 20895 ms on localhost (4/10)
>>
>> 16/11/15 18:28:40 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728
>>
>> 16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5).
>> 2309 bytes result sent to driver
>>
>> 16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0
>> (TID 6, localhost, partition 6,ANY, 2266 bytes)
>>
>> 16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
>>
>> 16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0
>> (TID 5) in 20793 ms on localhost (5/10)
>>
>> 16/11/15 18:29:01 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:805306368+134217728
&

Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Arun Patel
I am trying to read an XML file which is 1GB is size.  I am getting an
error 'java.lang.OutOfMemoryError: Requested array size exceeds VM limit'
after reading 7 partitions in local mode.  In Yarn mode, it
throws 'java.lang.OutOfMemoryError: Java heap space' error after reading 3
partitions.

Any suggestion?

PySpark Shell Command:pyspark --master local[4] --driver-memory 3G
--jars / tmp/spark-xml_2.10-0.3.3.jar



Dataframe Creation Command:   df = sqlContext.read.format('com.
databricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml')



16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID
1) in 25978 ms on localhost (1/10)

16/11/15 18:27:04 INFO NewHadoopRDD: Input split:
hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728

16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2).
2309 bytes result sent to driver

16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID
3, localhost, partition 3,ANY, 2266 bytes)

16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)

16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID
2) in 51001 ms on localhost (2/10)

16/11/15 18:27:55 INFO NewHadoopRDD: Input split:
hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728

16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3).
2309 bytes result sent to driver

16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID
4, localhost, partition 4,ANY, 2266 bytes)

16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)

16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID
3) in 24336 ms on localhost (3/10)

16/11/15 18:28:19 INFO NewHadoopRDD: Input split:
hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728

16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4).
2309 bytes result sent to driver

16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID
5, localhost, partition 5,ANY, 2266 bytes)

16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)

16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID
4) in 20895 ms on localhost (4/10)

16/11/15 18:28:40 INFO NewHadoopRDD: Input split:
hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728

16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5).
2309 bytes result sent to driver

16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID
6, localhost, partition 6,ANY, 2266 bytes)

16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)

16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID
5) in 20793 ms on localhost (5/10)

16/11/15 18:29:01 INFO NewHadoopRDD: Input split:
hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:805306368+134217728

16/11/15 18:29:22 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6).
2309 bytes result sent to driver

16/11/15 18:29:22 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID
7, localhost, partition 7,ANY, 2266 bytes)

16/11/15 18:29:22 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)

16/11/15 18:29:22 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID
6) in 21306 ms on localhost (6/10)

16/11/15 18:29:22 INFO NewHadoopRDD: Input split:
hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:939524096+134217728

16/11/15 18:29:43 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7).
2309 bytes result sent to driver

16/11/15 18:29:43 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID
8, localhost, partition 8,ANY, 2266 bytes)

16/11/15 18:29:43 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)

16/11/15 18:29:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID
7) in 21130 ms on localhost (7/10)

16/11/15 18:29:43 INFO NewHadoopRDD: Input split:
hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:1073741824+134217728

16/11/15 18:29:48 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

at java.util.Arrays.copyOf(Arrays.java:2271)

at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.
java:113)

at java.io.ByteArrayOutputStream.ensureCapacity(
ByteArrayOutputStream.java:93)

at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.
java:122)

at java.io.DataOutputStream.write(DataOutputStream.java:88)

at com.databricks.spark.xml.XmlRecordReader.
readUntilMatch(XmlInputFormat.scala:188)

at com.databricks.spark.xml.XmlRecordReader.next(
XmlInputFormat.scala:156)

at com.databricks.spark.xml.XmlRecordReader.nextKeyValue(
XmlInputFormat.scala:141)

at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(
NewHadoopRDD.scala:168)

at org.apache.spark.InterruptibleIterator.hasNext(
InterruptibleIterator.scala:39)

at 

Spark XML ignore namespaces

2016-11-03 Thread Arun Patel
I see that 'ignoring namespaces' issue is resolved.

https://github.com/databricks/spark-xml/pull/75

How do we enable this option and ignore namespace prefixes?

- Arun


Re: Check if a nested column exists in DataFrame

2016-09-13 Thread Arun Patel
Is there a way to check nested column exists from Schema in PySpark?

http://stackoverflow.com/questions/37471346/automatically-and-elegantly-flatten-dataframe-in-spark-sql
shows how to get the list of nested columns in Scala.  But, can this be
done in PySpark?

Please help.

On Mon, Sep 12, 2016 at 5:28 PM, Arun Patel <arunp.bigd...@gmail.com> wrote:

> I'm trying to analyze XML documents using spark-xml package.  Since all
> XML columns are optional, some columns may or may not exist. When I
> register the Dataframe as a table, how do I check if a nested column is
> existing or not? My column name is "emp" which is already exploded and I am
> trying to check if the nested column "emp.mgr.col" exists or not.  If it
> exists, I need to use it.  If it does not exist, I should set it to null.
> Is there a way to achieve this?
>
> Please note I am not able to use .columns method because it does not show
> the nested columns.
>
> Also, note that I  cannot manually specify the schema because of my
> requirement.
>
> I'm trying this in Pyspark.
>
> Thank you.
>


Check if a nested column exists in DataFrame

2016-09-12 Thread Arun Patel
I'm trying to analyze XML documents using spark-xml package.  Since all XML
columns are optional, some columns may or may not exist. When I register
the Dataframe as a table, how do I check if a nested column is existing or
not? My column name is "emp" which is already exploded and I am trying to
check if the nested column "emp.mgr.col" exists or not.  If it exists, I
need to use it.  If it does not exist, I should set it to null.  Is there a
way to achieve this?

Please note I am not able to use .columns method because it does not show
the nested columns.

Also, note that I  cannot manually specify the schema because of my
requirement.

I'm trying this in Pyspark.

Thank you.


Re: spark-xml to avro - SchemaParseException: Can't redefine

2016-09-09 Thread Arun Patel
Thank you Yong.  I just looked at it.

There was a pull request (#73
<https://github.com/databricks/spark-avro/pull/73>) as well.  Anything
wrong with that fix?  Can I use similar fix?



On Thu, Sep 8, 2016 at 8:53 PM, Yong Zhang <java8...@hotmail.com> wrote:

> Do you take a look about this -> https://github.com/databricks/
> spark-avro/issues/54
>
>
>
> Yong
> <https://github.com/databricks/spark-avro/issues/54>
> spark-avro fails to save DF with nested records having the ...
> <https://github.com/databricks/spark-avro/issues/54>
> github.com
> sixers changed the title from Save DF with nested records with the same
> name to spark-avro fails to save DF with nested records having the same
> name Jun 23, 2015
>
>
>
> --
> *From:* Arun Patel <arunp.bigd...@gmail.com>
> *Sent:* Thursday, September 8, 2016 5:31 PM
> *To:* user
> *Subject:* spark-xml to avro - SchemaParseException: Can't redefine
>
> I'm trying to convert XML to AVRO.  But, I am getting SchemaParser
> exception for 'Rules' which is existing in two separate containers.  Any
> thoughts?
>
> XML is attached.
>
>  df = sqlContext.read.format('com.databricks.spark.xml').
> options(rowTag='GGLResponse',attributePrefix='').load('GGL.xml')
>  df.show()
>  +++---++
>  | ResponseDataset|  ResponseHeader|ns2|   xmlns|
>  +++---++
>  |[1,1],[SD2000...|[2016-07-26T16:28...|GGL|http://www..c...|
>  +++---++
>
>  >>> df.printSchema()
>  root
>   |-- ResponseDataset: struct (nullable = true)
>   ||-- ResponseFileGGL: struct (nullable = true)
>   |||-- OfferSets: struct (nullable = true)
>   ||||-- OfferSet: struct (nullable = true)
>   |||||-- OfferSetHeader: struct (nullable = true)
>   ||||||-- OfferSetIdentifier: long (nullable = true)
>   ||||||-- TotalOffersProcessed: long (nullable = true)
>   |||||-- Offers: struct (nullable = true)
>   ||||||-- Identifier: string (nullable = true)
>   ||||||-- Offer: struct (nullable = true)
>   |||||||-- Rules: struct (nullable = true)
>   ||||||||-- Rule: array (nullable = true)
>   |||||||||-- element: struct
> (containsNull = true)
>   ||||||||||-- BorrowerIdentifier:
> long (nullable = true)
>   ||||||||||-- RuleIdentifier: long
> (nullable = true)
>   ||||||-- PartyRoleIdentifier: long (nullable = true)
>   ||||||-- SuffixIdentifier: string (nullable = true)
>   ||||||-- UCP: string (nullable = true)
>   |||||-- Pool: struct (nullable = true)
>   ||||||-- Identifier: string (nullable = true)
>   ||||||-- PartyRoleIdentifier: long (nullable = true)
>   ||||||-- Rules: struct (nullable = true)
>   |||||||-- Rule: array (nullable = true)
>   ||||||||-- element: struct (containsNull =
> true)
>   |||||||||-- BIdentifier: long (nullable
> = true)
>   |||||||||-- RIdentifier: long (nullable
> = true)
>   ||||||-- SuffixIdentifier: string (nullable = true)
>   ||||||-- UCP: string (nullable = true)
>   |||-- ResultHeader: struct (nullable = true)
>   ||||-- RequestDateTime: string (nullable = true)
>   ||||-- ResultDateTime: string (nullable = true)
>   ||-- ResponseFileUUID: string (nullable = true)
>   ||-- ResponseFileVersion: double (nullable = true)
>   |-- ResponseHeader: struct (nullable = true)
>   ||-- ResponseDateTime: string (nullable = true)
>   ||-- SessionIdentifier: string (nullable = true)
>   |-- ns2: string (nullable = true)
>   |-- xmlns: string (nullable = true)
>
>
> df.write.format('com.databricks.spark.avro').save('ggl_avro')
>
> 16/09/08 17:07:20 INFO MemoryStore: Block broadcast_73 stored as values in
> memory (estimated size 233.5 KB, free 772.4 KB)
> 16/09/08 17:07:20 INFO MemoryStore: Block broadcast_73_piece0 stored as
> bytes in memory (estimated size 28.2 KB, free 800.6 KB)
> 16/09/08 17:07:20 INFO BlockManagerInfo: Added broadcast_73_piece0 in
> memory on localhost:29785 (size: 28.2 KB, free: 511.4 MB)
> 16/09/08 17:

spark-xml to avro - SchemaParseException: Can't redefine

2016-09-08 Thread Arun Patel
I'm trying to convert XML to AVRO.  But, I am getting SchemaParser
exception for 'Rules' which is existing in two separate containers.  Any
thoughts?

XML is attached.

 df =
sqlContext.read.format('com.databricks.spark.xml').options(rowTag='GGLResponse',attributePrefix='').load('GGL.xml')
 df.show()
 +++---++
 | ResponseDataset|  ResponseHeader|ns2|   xmlns|
 +++---++
 |[1,1],[SD2000...|[2016-07-26T16:28...|GGL|http://www..c...|
 +++---++

 >>> df.printSchema()
 root
  |-- ResponseDataset: struct (nullable = true)
  ||-- ResponseFileGGL: struct (nullable = true)
  |||-- OfferSets: struct (nullable = true)
  ||||-- OfferSet: struct (nullable = true)
  |||||-- OfferSetHeader: struct (nullable = true)
  ||||||-- OfferSetIdentifier: long (nullable = true)
  ||||||-- TotalOffersProcessed: long (nullable = true)
  |||||-- Offers: struct (nullable = true)
  ||||||-- Identifier: string (nullable = true)
  ||||||-- Offer: struct (nullable = true)
  |||||||-- Rules: struct (nullable = true)
  ||||||||-- Rule: array (nullable = true)
  |||||||||-- element: struct (containsNull
= true)
  ||||||||||-- BorrowerIdentifier: long
(nullable = true)
  ||||||||||-- RuleIdentifier: long
(nullable = true)
  ||||||-- PartyRoleIdentifier: long (nullable = true)
  ||||||-- SuffixIdentifier: string (nullable = true)
  ||||||-- UCP: string (nullable = true)
  |||||-- Pool: struct (nullable = true)
  ||||||-- Identifier: string (nullable = true)
  ||||||-- PartyRoleIdentifier: long (nullable = true)
  ||||||-- Rules: struct (nullable = true)
  |||||||-- Rule: array (nullable = true)
  ||||||||-- element: struct (containsNull =
true)
  |||||||||-- BIdentifier: long (nullable =
true)
  |||||||||-- RIdentifier: long (nullable =
true)
  ||||||-- SuffixIdentifier: string (nullable = true)
  ||||||-- UCP: string (nullable = true)
  |||-- ResultHeader: struct (nullable = true)
  ||||-- RequestDateTime: string (nullable = true)
  ||||-- ResultDateTime: string (nullable = true)
  ||-- ResponseFileUUID: string (nullable = true)
  ||-- ResponseFileVersion: double (nullable = true)
  |-- ResponseHeader: struct (nullable = true)
  ||-- ResponseDateTime: string (nullable = true)
  ||-- SessionIdentifier: string (nullable = true)
  |-- ns2: string (nullable = true)
  |-- xmlns: string (nullable = true)


df.write.format('com.databricks.spark.avro').save('ggl_avro')

16/09/08 17:07:20 INFO MemoryStore: Block broadcast_73 stored as values in
memory (estimated size 233.5 KB, free 772.4 KB)
16/09/08 17:07:20 INFO MemoryStore: Block broadcast_73_piece0 stored as
bytes in memory (estimated size 28.2 KB, free 800.6 KB)
16/09/08 17:07:20 INFO BlockManagerInfo: Added broadcast_73_piece0 in
memory on localhost:29785 (size: 28.2 KB, free: 511.4 MB)
16/09/08 17:07:20 INFO SparkContext: Created broadcast 73 from
newAPIHadoopFile at XmlFile.scala:39
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/readwriter.py", line
397, in save
self._jwrite.save(path)
  File
"/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 813, in __call__
  File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/utils.py", line 45,
in deco
return f(*a, **kw)
  File
"/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py",
line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o426.save.
: org.apache.avro.SchemaParseException: Can't redefine: Rules
at
org.apache.avro.SchemaBuilder$NameContext.put(SchemaBuilder.java:936)
at
org.apache.avro.SchemaBuilder$NameContext.access$600(SchemaBuilder.java:884)
at
org.apache.avro.SchemaBuilder$NamespacedBuilder.completeSchema(SchemaBuilder.java:470)
at
org.apache.avro.SchemaBuilder$RecordBuilder.fields(SchemaBuilder.java:1734)

http://www..com/webservices/;>
	
		2016-07-26T16:28:30.965-04:00
		435ererewr-fdsfdsf-dfdsfdsf
	
	
		0.0
		43243243242343423ddd1qsq31323
		
			
2016-07-26T16:28:27.000
20160726 16:28
			
			

	
		1
		1
	
	
		SD200
		S
		3123232131231233
		

Re: Structured Streaming Parquet Sink

2016-07-30 Thread Arun Patel
Thanks for the response. However, I am not able to use any output mode.  In
case of Parquet sink, there should not be any aggregations?

scala> val query =
 
streamingCountsDF.writeStream.format("parquet").option("path","parq").option("checkpointLocation","chkpnt").outputMode("complete").start()
java.lang.IllegalArgumentException: Data source parquet does not support
Complete output mode
  at
org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:267)
  at
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:291)
  ... 54 elided

scala> val query =
 
streamingCountsDF.writeStream.format("parquet").option("path","parq").option("checkpointLocation","chkpnt").outputMode("append").start()
org.apache.spark.sql.AnalysisException: Append output mode not supported
when there are streaming aggregations on streaming DataFrames/DataSets;
  at
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:173)
  at
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:60)
  at
org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:236)
  at
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:287)
  ... 54 elided

scala> val query =
 
streamingCountsDF.writeStream.format("parquet").option("path","parq").option("checkpointLocation","chkpnt").outputMode("complete").start()
java.lang.IllegalArgumentException: Data source parquet does not support
Complete output mode
  at
org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:267)
  at
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:291)
  ... 54 elided


On Sat, Jul 30, 2016 at 5:59 PM, Tathagata Das <t...@databricks.com> wrote:

> Correction, the two options are.
>
> - writeStream.format("parquet").option("path", "...").start()
> - writestream.parquet("...").start()
>
> There no start with param.
>
> On Jul 30, 2016 11:22 AM, "Jacek Laskowski" <ja...@japila.pl> wrote:
>
>> Hi Arun,
>>
>> > As per documentation, parquet is the only available file sink.
>>
>> The following sinks are currently available in Spark:
>>
>> * ConsoleSink for console format.
>> * FileStreamSink for parquet format.
>> * ForeachSink used in foreach operator.
>> * MemorySink for memory format.
>>
>> You can create your own streaming format implementing StreamSinkProvider.
>>
>> > I am getting an error like 'path' is not specified.
>> > Any idea how to write this to parquet file?
>>
>> There are two ways to specify "path":
>>
>> 1. Using option method
>> 2. start(path: String): StreamingQuery
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sat, Jul 30, 2016 at 2:50 PM, Arun Patel <arunp.bigd...@gmail.com>
>> wrote:
>> > I am trying out Structured streaming parquet sink.  As per
>> documentation,
>> > parquet is the only available file sink.
>> >
>> > I am getting an error like 'path' is not specified.
>> >
>> > scala> val query =
>> streamingCountsDF.writeStream.format("parquet").start()
>> > java.lang.IllegalArgumentException: 'path' is not specified
>> >   at
>> >
>> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:264)
>> >   at
>> >
>> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:264)
>> >   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>> >   at
>> >
>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.getOrElse(ddl.scala:117)
>> >   at
>> >
>> org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:263)
>> >   at
>> >
>> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:291)
>> >   ... 60 elided
>> >
>> > But, I don't see path or parquet in DataStreamWriter.
>> >
>> > scala> val query = streamingCountsDF.writeStream.
>> > foreach   format   option   options   outputMode   partitionBy
>>  queryName
>> > start   trigger
>> >
>> > Any idea how to write this to parquet file?
>> >
>> > - Arun
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Structured Streaming Parquet Sink

2016-07-30 Thread Arun Patel
I am trying out Structured streaming parquet sink.  As per documentation,
parquet is the only available file sink.

I am getting an error like 'path' is not specified.

scala> val query = streamingCountsDF.writeStream.format("parquet").start()
java.lang.IllegalArgumentException: 'path' is not specified
  at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:264)
  at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:264)
  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
  at
org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.getOrElse(ddl.scala:117)
  at
org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:263)
  at
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:291)
  ... 60 elided

But, I don't see path or parquet in DataStreamWriter.

scala> val query = streamingCountsDF.writeStream.
foreach   format   option   options   outputMode   partitionBy   queryName
  start   trigger

Any idea how to write this to parquet file?

- Arun


Re: Graphframe Error

2016-07-07 Thread Arun Patel
I have tied this already.  It does not work.

What version of Python is needed for this package?

On Wed, Jul 6, 2016 at 12:45 AM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> This could be the workaround:
>
> http://stackoverflow.com/a/36419857
>
>
>
>
> On Tue, Jul 5, 2016 at 5:37 AM -0700, "Arun Patel" <
> arunp.bigd...@gmail.com> wrote:
>
> Thanks Yanbo and Felix.
>
> I tried these commands on CDH Quickstart VM and also on "Spark 1.6
> pre-built for Hadoop" version.  I am still not able to get it working.  Not
> sure what I am missing.  Attaching the logs.
>
>
>
>
> On Mon, Jul 4, 2016 at 5:33 AM, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> It looks like either the extracted Python code is corrupted or there is a
>> mismatch Python version. Are you using Python 3?
>>
>>
>> stackoverflow.com/questions/514371/whats-the-bad-magic-number-error
>>
>>
>>
>>
>>
>> On Mon, Jul 4, 2016 at 1:37 AM -0700, "Yanbo Liang" <yblia...@gmail.com>
>> wrote:
>>
>> Hi Arun,
>>
>> The command
>>
>> bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6
>>
>> will automatically load the required graphframes jar file from maven
>> repository, it was not affected by the location where the jar file was
>> placed. Your examples works well in my laptop.
>>
>> Or you can use try with
>>
>> bin/pyspark --py-files ***/graphframes.jar --jars ***/graphframes.jar
>>
>> to launch PySpark with graphframes enabled. You should set "--py-files"
>> and "--jars" options with the directory where you saved graphframes.jar.
>>
>> Thanks
>> Yanbo
>>
>>
>> 2016-07-03 15:48 GMT-07:00 Arun Patel <arunp.bigd...@gmail.com>:
>>
>>> I started my pyspark shell with command  (I am using spark 1.6).
>>>
>>> bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6
>>>
>>> I have copied
>>> http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.1.0-spark1.6/graphframes-0.1.0-spark1.6.jar
>>> to the lib directory of Spark as well.
>>>
>>> I was getting below error
>>>
>>> >>> from graphframes import *
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>> zipimport.ZipImportError: can't find module 'graphframes'
>>> >>>
>>>
>>> So, as per suggestions from similar questions, I have extracted the
>>> graphframes python directory and copied to the local directory where I am
>>> running pyspark.
>>>
>>> >>> from graphframes import *
>>>
>>> But, not able to create the GraphFrame
>>>
>>> >>> g = GraphFrame(v, e)
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>> NameError: name 'GraphFrame' is not defined
>>>
>>> Also, I am getting below error.
>>> >>> from graphframes.examples import Graphs
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>> ImportError: Bad magic number in graphframes/examples.pyc
>>>
>>> Any help will be highly appreciated.
>>>
>>> - Arun
>>>
>>
>>
>


Re: Graphframe Error

2016-07-05 Thread Arun Patel
Thanks Yanbo and Felix.

I tried these commands on CDH Quickstart VM and also on "Spark 1.6
pre-built for Hadoop" version.  I am still not able to get it working.  Not
sure what I am missing.  Attaching the logs.




On Mon, Jul 4, 2016 at 5:33 AM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> It looks like either the extracted Python code is corrupted or there is a
> mismatch Python version. Are you using Python 3?
>
>
> stackoverflow.com/questions/514371/whats-the-bad-magic-number-error
>
>
>
>
>
> On Mon, Jul 4, 2016 at 1:37 AM -0700, "Yanbo Liang" <yblia...@gmail.com>
> wrote:
>
> Hi Arun,
>
> The command
>
> bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6
>
> will automatically load the required graphframes jar file from maven
> repository, it was not affected by the location where the jar file was
> placed. Your examples works well in my laptop.
>
> Or you can use try with
>
> bin/pyspark --py-files ***/graphframes.jar --jars ***/graphframes.jar
>
> to launch PySpark with graphframes enabled. You should set "--py-files"
> and "--jars" options with the directory where you saved graphframes.jar.
>
> Thanks
> Yanbo
>
>
> 2016-07-03 15:48 GMT-07:00 Arun Patel <arunp.bigd...@gmail.com>:
>
>> I started my pyspark shell with command  (I am using spark 1.6).
>>
>> bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6
>>
>> I have copied
>> http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.1.0-spark1.6/graphframes-0.1.0-spark1.6.jar
>> to the lib directory of Spark as well.
>>
>> I was getting below error
>>
>> >>> from graphframes import *
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> zipimport.ZipImportError: can't find module 'graphframes'
>> >>>
>>
>> So, as per suggestions from similar questions, I have extracted the
>> graphframes python directory and copied to the local directory where I am
>> running pyspark.
>>
>> >>> from graphframes import *
>>
>> But, not able to create the GraphFrame
>>
>> >>> g = GraphFrame(v, e)
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> NameError: name 'GraphFrame' is not defined
>>
>> Also, I am getting below error.
>> >>> from graphframes.examples import Graphs
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> ImportError: Bad magic number in graphframes/examples.pyc
>>
>> Any help will be highly appreciated.
>>
>> - Arun
>>
>
>


Graphframes-packages.log
Description: Binary data


Graphframes-pyfiles.log
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Graphframe Error

2016-07-03 Thread Arun Patel
I started my pyspark shell with command  (I am using spark 1.6).

bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6

I have copied
http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.1.0-spark1.6/graphframes-0.1.0-spark1.6.jar
to the lib directory of Spark as well.

I was getting below error

>>> from graphframes import *
Traceback (most recent call last):
  File "", line 1, in 
zipimport.ZipImportError: can't find module 'graphframes'
>>>

So, as per suggestions from similar questions, I have extracted the
graphframes python directory and copied to the local directory where I am
running pyspark.

>>> from graphframes import *

But, not able to create the GraphFrame

>>> g = GraphFrame(v, e)
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'GraphFrame' is not defined

Also, I am getting below error.
>>> from graphframes.examples import Graphs
Traceback (most recent call last):
  File "", line 1, in 
ImportError: Bad magic number in graphframes/examples.pyc

Any help will be highly appreciated.

- Arun


Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-14 Thread Arun Patel
Can anyone answer these questions please.



On Mon, Jun 13, 2016 at 6:51 PM, Arun Patel <arunp.bigd...@gmail.com> wrote:

> Thanks Michael.
>
> I went thru these slides already and could not find answers for these
> specific questions.
>
> I created a Dataset and converted it to DataFrame in 1.6 and 2.0.  I don't
> see any difference in 1.6 vs 2.0.  So, I really got confused and asked
> these questions about unification.
>
> Appreciate if you can answer these specific questions.  Thank you very
> much!
>
> On Mon, Jun 13, 2016 at 2:55 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Here's a talk I gave on the topic:
>>
>> https://www.youtube.com/watch?v=i7l3JQRx7Qw
>>
>> http://www.slideshare.net/SparkSummit/structuring-spark-dataframes-datasets-and-streaming-by-michael-armbrust
>>
>> On Mon, Jun 13, 2016 at 4:01 AM, Arun Patel <arunp.bigd...@gmail.com>
>> wrote:
>>
>>> In Spark 2.0, DataFrames and Datasets are unified. DataFrame is simply
>>> an alias for a Dataset of type row.   I have few questions.
>>>
>>> 1) What does this really mean to an Application developer?
>>> 2) Why this unification was needed in Spark 2.0?
>>> 3) What changes can be observed in Spark 2.0 vs Spark 1.6?
>>> 4) Compile time safety will be there for DataFrames too?
>>> 5) Python API is supported for Datasets in 2.0?
>>>
>>> Thanks
>>> Arun
>>>
>>
>>
>


Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Arun Patel
Thanks Michael.

I went thru these slides already and could not find answers for these
specific questions.

I created a Dataset and converted it to DataFrame in 1.6 and 2.0.  I don't
see any difference in 1.6 vs 2.0.  So, I really got confused and asked
these questions about unification.

Appreciate if you can answer these specific questions.  Thank you very much!

On Mon, Jun 13, 2016 at 2:55 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Here's a talk I gave on the topic:
>
> https://www.youtube.com/watch?v=i7l3JQRx7Qw
>
> http://www.slideshare.net/SparkSummit/structuring-spark-dataframes-datasets-and-streaming-by-michael-armbrust
>
> On Mon, Jun 13, 2016 at 4:01 AM, Arun Patel <arunp.bigd...@gmail.com>
> wrote:
>
>> In Spark 2.0, DataFrames and Datasets are unified. DataFrame is simply an
>> alias for a Dataset of type row.   I have few questions.
>>
>> 1) What does this really mean to an Application developer?
>> 2) Why this unification was needed in Spark 2.0?
>> 3) What changes can be observed in Spark 2.0 vs Spark 1.6?
>> 4) Compile time safety will be there for DataFrames too?
>> 5) Python API is supported for Datasets in 2.0?
>>
>> Thanks
>> Arun
>>
>
>


Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Arun Patel
In Spark 2.0, DataFrames and Datasets are unified. DataFrame is simply an
alias for a Dataset of type row.   I have few questions.

1) What does this really mean to an Application developer?
2) Why this unification was needed in Spark 2.0?
3) What changes can be observed in Spark 2.0 vs Spark 1.6?
4) Compile time safety will be there for DataFrames too?
5) Python API is supported for Datasets in 2.0?

Thanks
Arun


Re: Spark 2.0 Release Date

2016-06-07 Thread Arun Patel
Thanks Sean and Jacek.

Do we have any updated documentation for 2.0 somewhere?

On Tue, Jun 7, 2016 at 9:34 AM, Jacek Laskowski  wrote:

> On Tue, Jun 7, 2016 at 3:25 PM, Sean Owen  wrote:
> > That's not any kind of authoritative statement, just my opinion and
> guess.
>
> Oh, come on. You're not **a** Sean but **the** Sean (= a PMC member
> and the JIRA/PRs keeper) so what you say **is** kinda official. Sorry.
> But don't worry the PMC (the group) can decide whatever it wants to
> decide so any date is a good date.
>
> > Reynold mentioned the idea of releasing monthly milestone releases for
> > the latest branch. That's an interesting idea for the future.
>
> +1
>
> > I know there was concern that publishing a preview release, which is
> like an alpha,
> > might leave alpha-quality code out there too long as the latest
> > release. Hence, probably support for publishing some kind of preview 2
> > or beta or whatever
>
> The issue seems to have been sorted out when Matei and Reynold agreed
> to push the preview out (which is a good thing!), and I'm sure
> there'll be little to no concern to do it again and again. 2.0 is
> certainly taking far too long (as if there were some magic in 2.0).
>
> p.s. It's so frustrating to tell people about the latest and greatest
> of 2.0 and then switch to 1.6.1 or even older in projects :( S
> frustrating...
>
> Jacek
>


Re: Spark 2.0 Release Date

2016-06-07 Thread Arun Patel
Do we have any further updates on release date?

Also, Is there a updated documentation for 2.0 somewhere?

Thanks
Arun

On Thu, Apr 28, 2016 at 4:50 PM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi Arun,
>
> My bet is...https://spark-summit.org/2016 :)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Apr 28, 2016 at 1:43 PM, Arun Patel <arunp.bigd...@gmail.com>
> wrote:
> > A small request.
> >
> > Would you mind providing an approximate date of Spark 2.0 release?  Is it
> > early May or Mid May or End of May?
> >
> > Thanks,
> > Arun
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark 2.0 Release Date

2016-04-28 Thread Arun Patel
A small request.

Would you mind providing an approximate date of Spark 2.0 release?  Is it
early May or Mid May or End of May?

Thanks,
Arun


Re: Spark Streaming - Number of RDDs in Dstream

2015-12-21 Thread Arun Patel
So, Does that mean only one RDD is created by all receivers?



On Sun, Dec 20, 2015 at 10:23 PM, Saisai Shao <sai.sai.s...@gmail.com>
wrote:

> Normally there will be one RDD in each batch.
>
> You could refer to the implementation of DStream#getOrCompute.
>
>
> On Mon, Dec 21, 2015 at 11:04 AM, Arun Patel <arunp.bigd...@gmail.com>
> wrote:
>
>> It may be simple question...But, I am struggling to understand this
>>
>> DStream is a sequence of RDDs created in a batch window.  So, how do I
>> know how many RDDs are created in a batch?
>>
>> I am clear about the number of partitions created which is
>>
>> Number of Partitions =  (Batch Interval / spark.streaming.blockInterval)
>> * number of receivers
>>
>> Is it like one RDD per receiver? or Multiple RDDs per receiver? What is
>> the easiest way to find it?
>>
>> Arun
>>
>
>


Spark Streaming - Number of RDDs in Dstream

2015-12-20 Thread Arun Patel
It may be simple question...But, I am struggling to understand this

DStream is a sequence of RDDs created in a batch window.  So, how do I know
how many RDDs are created in a batch?

I am clear about the number of partitions created which is

Number of Partitions =  (Batch Interval / spark.streaming.blockInterval) *
number of receivers

Is it like one RDD per receiver? or Multiple RDDs per receiver? What is the
easiest way to find it?

Arun


Scheduling across applications - Need suggestion

2015-04-22 Thread Arun Patel
I believe we can use the properties like --executor-memory
 --total-executor-cores to configure the resources allocated for each
application.  But, in a multi user environment, shells and applications are
being submitted by multiple users at the same time.  All users are
requesting resources with different properties.  At times, some users are
not getting resources of the cluster.


How to control resource usage in this case?  Please share any best
practices followed.


As per my understanding, Fair scheduler can used for scheduling tasks
within an application but not across multiple applications.  Is this
correct?


Regards,

Arun


mapPartitions vs foreachPartition

2015-04-20 Thread Arun Patel
What is difference between mapPartitions vs foreachPartition?

When to use these?

Thanks,
Arun


Re: mapPartitions vs foreachPartition

2015-04-20 Thread Arun Patel
mapPartitions is a transformation and foreachPartition is a an action?

Thanks
Arun

On Mon, Apr 20, 2015 at 4:38 AM, Archit Thakur archit279tha...@gmail.com
wrote:

 The same, which is between map and foreach. map takes iterator returns
 iterator foreach takes iterator returns Unit.

 On Mon, Apr 20, 2015 at 4:05 PM, Arun Patel arunp.bigd...@gmail.com
 wrote:

 What is difference between mapPartitions vs foreachPartition?

 When to use these?

 Thanks,
 Arun





Code Deployment tools in Production

2015-04-19 Thread Arun Patel
Generally what tools are used to schedule spark jobs in production?

How is spark streaming code is deployed?

I am interested in knowing the tools used like cron, oozie etc.

Thanks,
Arun


Re: Dataframes Question

2015-04-19 Thread Arun Patel
Thanks Ted.

So, whatever the operations I am performing now are DataFrames and not
SchemaRDD?  Is that right?

Regards,
Venkat

On Sun, Apr 19, 2015 at 9:13 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. SchemaRDD is not existing in 1.3?

 That's right.

 See this thread for more background:

 http://search-hadoop.com/m/JW1q5zQ1Xw/spark+DataFrame+schemarddsubj=renaming+SchemaRDD+gt+DataFrame



 On Sat, Apr 18, 2015 at 5:43 PM, Abhishek R. Singh 
 abhis...@tetrationanalytics.com wrote:

 I am no expert myself, but from what I understand DataFrame is
 grandfathering SchemaRDD. This was done for API stability as spark sql
 matured out of alpha as part of 1.3.0 release.

 It is forward looking and brings (dataframe like) syntax that was not
 available with the older schema RDD.

 On Apr 18, 2015, at 4:43 PM, Arun Patel arunp.bigd...@gmail.com wrote:

  Experts,
 
  I have few basic questions on DataFrames vs Spark SQL.  My confusion is
 more with DataFrames.
 
  1)  What is the difference between Spark SQL and DataFrames?  Are they
 same?
  2)  Documentation says SchemaRDD is renamed as DataFrame. This means
 SchemaRDD is not existing in 1.3?
  3)  As per documentation, it looks like creating dataframe is no
 different than SchemaRDD -  df =
 sqlContext.jsonFile(examples/src/main/resources/people.json).
  So, my question is what is the difference?
 
  Thanks for your help.
 
  Arun


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Dataframes Question

2015-04-18 Thread Arun Patel
Experts,

I have few basic questions on DataFrames vs Spark SQL.  My confusion is
more with DataFrames.

1)  What is the difference between Spark SQL and DataFrames?  Are they same?
2)  Documentation says SchemaRDD is renamed as DataFrame. This means
SchemaRDD is not existing in 1.3?
3)  As per documentation, it looks like creating dataframe is no different
than SchemaRDD -  df =
sqlContext.jsonFile(examples/src/main/resources/people.json).
So, my question is what is the difference?

Thanks for your help.

Arun