Re: Question / issue while creating a parquet file using a text file with spark 2.0...

2016-07-28 Thread Muthu Jayakumar
Hello Dong Meng,

Thanks for the tip. But, I do have code in place that looks like this...

StructField(columnName, getSparkDataType(dataType), nullable = true)

May be I am missing something else. The same code works fine with Spark
1.6.2 though. On a side note, I could be using SparkSession, but i don't
know how to split and map the row elegantly. Hence using it as RDD.

Thanks,
Muthu


On Thu, Jul 28, 2016 at 10:47 PM, Dong Meng  wrote:

> you can specify nullable in StructField
>
> On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumar 
> wrote:
>
>> Hello there,
>>
>> I am using Spark 2.0.0 to create a parquet file using a text file with
>> Scala. I am trying to read a text file with bunch of values of type string
>> and long (mostly). And all the occurrences can be null. In order to support
>> nulls, all the values are boxed with Option (ex:- Option[String],
>> Option[Long]).
>> The schema for the parquet file is based on some external metadata file,
>> so I use 'StructField' to create a schema programmatically and perform some
>> code snippet like below...
>>
>> sc.textFile(flatFile).map(_.split(fileSeperator)).map { line =>
>>   convertToRawColumns(line, schemaSeq)
>> }
>>
>> ...
>>
>> val df = sqlContext.createDataFrame(rawFileRDD, schema) //Failed line.
>>
>> On a side note, the same code used to work fine with Spark 1.6.2.
>>
>> Here is the error from Spark 2.0.0.
>>
>> Jul 28, 2016 8:27:10 PM INFO:
>> org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
>> Jul 28, 2016 8:27:10 PM INFO:
>> org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size Jul-28
>> 20:27:10 [Executor task launch worker-483] ERROR   Utils Aborting task
>> java.lang.RuntimeException: Error while encoding:
>> java.lang.RuntimeException: scala.Some is not a valid external type for
>> schema of string
>> if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row
>> object).isNullAt) null else staticinvoke(class
>> org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
>> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
>> org.apache.spark.sql.Row, true], top level row object), 0, host),
>> StringType), true) AS host#37315
>> +- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
>> row object).isNullAt) null else staticinvoke(class
>> org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
>> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
>> org.apache.spark.sql.Row, true], top level row object), 0, host),
>> StringType), true)
>>:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
>> row object).isNullAt
>>:  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top
>> level row object)
>>:  :  +- input[0, org.apache.spark.sql.Row, true]
>>:  +- 0
>>:- null
>>+- staticinvoke(class org.apache.spark.unsafe.types.UTF8String,
>> StringType, fromString,
>> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
>> org.apache.spark.sql.Row, true], top level row object), 0, host),
>> StringType), true)
>>   +- validateexternaltype(getexternalrowfield(assertnotnull(input[0,
>> org.apache.spark.sql.Row, true], top level row object), 0, host),
>> StringType)
>>  +- getexternalrowfield(assertnotnull(input[0,
>> org.apache.spark.sql.Row, true], top level row object), 0, host)
>> +- assertnotnull(input[0, org.apache.spark.sql.Row, true],
>> top level row object)
>>+- input[0, org.apache.spark.sql.Row, true]
>>
>>
>> Let me know if you would like me try to create a more simplified
>> reproducer to this problem. Perhaps I should not be using Option[T] for
>> nullable schema values?
>>
>> Please advice,
>> Muthu
>>
>
>


Re: Question / issue while creating a parquet file using a text file with spark 2.0...

2016-07-28 Thread Dong Meng
you can specify nullable in StructField

On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumar  wrote:

> Hello there,
>
> I am using Spark 2.0.0 to create a parquet file using a text file with
> Scala. I am trying to read a text file with bunch of values of type string
> and long (mostly). And all the occurrences can be null. In order to support
> nulls, all the values are boxed with Option (ex:- Option[String],
> Option[Long]).
> The schema for the parquet file is based on some external metadata file,
> so I use 'StructField' to create a schema programmatically and perform some
> code snippet like below...
>
> sc.textFile(flatFile).map(_.split(fileSeperator)).map { line =>
>   convertToRawColumns(line, schemaSeq)
> }
>
> ...
>
> val df = sqlContext.createDataFrame(rawFileRDD, schema) //Failed line.
>
> On a side note, the same code used to work fine with Spark 1.6.2.
>
> Here is the error from Spark 2.0.0.
>
> Jul 28, 2016 8:27:10 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Jul 28, 2016 8:27:10 PM INFO:
> org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size Jul-28
> 20:27:10 [Executor task launch worker-483] ERROR   Utils Aborting task
> java.lang.RuntimeException: Error while encoding:
> java.lang.RuntimeException: scala.Some is not a valid external type for
> schema of string
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row
> object).isNullAt) null else staticinvoke(class
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
> org.apache.spark.sql.Row, true], top level row object), 0, host),
> StringType), true) AS host#37315
> +- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
> row object).isNullAt) null else staticinvoke(class
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
> org.apache.spark.sql.Row, true], top level row object), 0, host),
> StringType), true)
>:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
> row object).isNullAt
>:  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
> row object)
>:  :  +- input[0, org.apache.spark.sql.Row, true]
>:  +- 0
>:- null
>+- staticinvoke(class org.apache.spark.unsafe.types.UTF8String,
> StringType, fromString,
> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
> org.apache.spark.sql.Row, true], top level row object), 0, host),
> StringType), true)
>   +- validateexternaltype(getexternalrowfield(assertnotnull(input[0,
> org.apache.spark.sql.Row, true], top level row object), 0, host),
> StringType)
>  +- getexternalrowfield(assertnotnull(input[0,
> org.apache.spark.sql.Row, true], top level row object), 0, host)
> +- assertnotnull(input[0, org.apache.spark.sql.Row, true], top
> level row object)
>+- input[0, org.apache.spark.sql.Row, true]
>
>
> Let me know if you would like me try to create a more simplified
> reproducer to this problem. Perhaps I should not be using Option[T] for
> nullable schema values?
>
> Please advice,
> Muthu
>


Re: Spark 2.0 Build Failed

2016-07-28 Thread Ascot Moss
I just run

wget https://repo1.maven.org/maven2/org/apache/apache/14/apache-14.pom, can
get it without issue.

On Fri, Jul 29, 2016 at 1:44 PM, Ascot Moss  wrote:

> Hi thanks!
>
> mvn dependency:tree
>
> [INFO] Scanning for projects...
>
> Downloading:
> https://repo1.maven.org/maven2/org/apache/apache/14/apache-14.pom
>
> [ERROR] [ERROR] Some problems were encountered while processing the POMs:
>
> [FATAL] Non-resolvable parent POM for
> org.apache.spark:spark-parent_2.11:2.0.0: Could not transfer artifact
> org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
> sun.security.validator.ValidatorException: PKIX path building failed:
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find
> valid certification path to requested target and 'parent.relativePath'
> points at wrong local POM @ line 22, column 11
>
>  @
>
> [ERROR] The build could not read 1 project -> [Help 1]
>
> [ERROR]
>
> [ERROR]   The project org.apache.spark:spark-parent_2.11:2.0.0
> (/edh_all_sources/edh_2.6.0/spark-2.0.0/pom.xml) has 1 error
>
> [ERROR] Non-resolvable parent POM for
> org.apache.spark:spark-parent_2.11:2.0.0: Could not transfer artifact
> org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
> sun.security.validator.ValidatorException: PKIX path building failed:
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find
> valid certification path to requested target and 'parent.relativePath'
> points at wrong local POM @ line 22, column 11 -> [Help 2]
>
> [ERROR]
>
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
>
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>
> [ERROR]
>
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
>
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException
>
> [ERROR] [Help 2]
> http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException
>
>
>
>
>
>
> On Fri, Jul 29, 2016 at 1:34 PM, Dong Meng  wrote:
>
>> Before build, first do a "mvn dependency:tree" to make sure the
>> dependency is right
>>
>> On Thu, Jul 28, 2016 at 10:18 PM, Ascot Moss 
>> wrote:
>>
>>> Thanks for your reply.
>>>
>>> Is there a way to find the correct Hadoop profile name?
>>>
>>> On Fri, Jul 29, 2016 at 7:06 AM, Sean Owen  wrote:
>>>
 You have at least two problems here: wrong Hadoop profile name, and
 some kind of firewall interrupting access to the Maven repo. It's not
 related to Spark.

 On Thu, Jul 28, 2016 at 4:04 PM, Ascot Moss 
 wrote:
 > Hi,
 >
 > I tried to build spark,
 >
 > (try 1)
 > mvn -Pyarn -Phadoop-2.7.0 -Dscala-2.11 -Dhadoop.version=2.7.0 -Phive
 > -Phive-thriftserver -DskipTests clean package
 >
 > [INFO] Spark Project Parent POM ... FAILURE
 [  0.658
 > s]
 >
 > [INFO] Spark Project Tags . SKIPPED
 >
 > [INFO] Spark Project Sketch ... SKIPPED
 >
 > [INFO] Spark Project Networking ... SKIPPED
 >
 > [INFO] Spark Project Shuffle Streaming Service  SKIPPED
 >
 > [INFO] Spark Project Unsafe ... SKIPPED
 >
 > [INFO] Spark Project Launcher . SKIPPED
 >
 > [INFO] Spark Project Core . SKIPPED
 >
 > [INFO] Spark Project GraphX ... SKIPPED
 >
 > [INFO] Spark Project Streaming  SKIPPED
 >
 > [INFO] Spark Project Catalyst . SKIPPED
 >
 > [INFO] Spark Project SQL .. SKIPPED
 >
 > [INFO] Spark Project ML Local Library . SKIPPED
 >
 > [INFO] Spark Project ML Library ... SKIPPED
 >
 > [INFO] Spark Project Tools  SKIPPED
 >
 > [INFO] Spark Project Hive . SKIPPED
 >
 > [INFO] Spark Project REPL . SKIPPED
 >
 > [INFO] Spark Project YARN Shuffle Service . SKIPPED
 >
 > [INFO] Spark Project YARN . SKIPPED
 >
 > [INFO] Spark Project Hive Thrift Server ... SKIPPED
 >
 > [INFO] Spark Project Assembly . SKIPPED
 >
 > [INFO] Spark Project External Flume Sink .. SKIPPED
 >
 > [INFO] Spark Project External Flume ... SKIPPED
 >
 > [INFO] Spark Project External Flume Assembly .. SKIPPED
 >
 > [INFO] Spark 

Re: Custom Image RDD and Sequence Files

2016-07-28 Thread Jörn Franke
Why don't you write your own Hadoop FileInputFormat. It can be used by Spark...

> On 28 Jul 2016, at 20:04, jtgenesis  wrote:
> 
> Hey all,
> 
> I was wondering what the best course of action is for processing an image
> that has an involved internal structure (file headers, sub-headers, image
> data, more sub-headers, more kinds of data etc). I was hoping to get some
> insight on the approach I'm using and whether there is a better, more Spark
> way of handling it.
> 
> I'm coming from a Hadoop approach where I convert the image to a sequence
> file. Now, i'm new to both Spark and Hadoop, but I have a deeper
> understanding of Hadoop, which is why I went with the sequence files. The
> sequence file is chopped into key/value pairs that contain file and image
> meta-data and separate key/value pairs that contain the raw image data. I
> currently use a LongWritable for the key and a BytesWritable for the value.
> This is a naive approach, but I plan to create custom Writable key type that
> contain pertinent information to the corresponding image data. The idea is
> to create a custom Spark Partitioner, taking advantage of the key structure,
> to reduce inter-cluster communication. Example. store all image tiles with
> the same key.id property on the same node.
> 
> 1.) Is converting the image to a Sequence File superfluous? Is it better to
> do this pre-processing and creating a custom key/value type another way.
> Would it be through Spark or Hadoop's Writable? It seems like Spark just
> uses different flavors of Hadoop's InputFormat under the hood.
> 
> I see that Spark does have support for SequenceFiles, but I'm still not
> fully clear on the extent of it.
> 
> 2.)  When you read in a .seq file through sc.sequenceFIle(), it's using
> SequenceFileInputFormat. This means that the number of partitions will be
> determined by the number of splits, specified in the
> SequenceFileInputFormat.getSplits. Do the input splits happen on key/value
> boundaries? 
> 
> 3.) The RDD created from Sequence Files will have the translated Scala
> key/value type, but if I use a custom Hadoop Writable, will I have to do
> anything on Spark/Scala side to understand it?
> 
> 4.) Since I'm using a custom Hadoop Writable, is it best to register my
> writable types with Kryo?
> 
> Thanks for any help!
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Image-RDD-and-Sequence-Files-tp27426.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2.0 Build Failed

2016-07-28 Thread Ascot Moss
Hi thanks!

mvn dependency:tree

[INFO] Scanning for projects...

Downloading:
https://repo1.maven.org/maven2/org/apache/apache/14/apache-14.pom

[ERROR] [ERROR] Some problems were encountered while processing the POMs:

[FATAL] Non-resolvable parent POM for
org.apache.spark:spark-parent_2.11:2.0.0: Could not transfer artifact
org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
sun.security.validator.ValidatorException: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target and 'parent.relativePath'
points at wrong local POM @ line 22, column 11

 @

[ERROR] The build could not read 1 project -> [Help 1]

[ERROR]

[ERROR]   The project org.apache.spark:spark-parent_2.11:2.0.0
(/edh_all_sources/edh_2.6.0/spark-2.0.0/pom.xml) has 1 error

[ERROR] Non-resolvable parent POM for
org.apache.spark:spark-parent_2.11:2.0.0: Could not transfer artifact
org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
sun.security.validator.ValidatorException: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target and 'parent.relativePath'
points at wrong local POM @ line 22, column 11 -> [Help 2]

[ERROR]

[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR]

[ERROR] For more information about the errors and possible solutions,
please read the following articles:

[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException

[ERROR] [Help 2]
http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException






On Fri, Jul 29, 2016 at 1:34 PM, Dong Meng  wrote:

> Before build, first do a "mvn dependency:tree" to make sure the
> dependency is right
>
> On Thu, Jul 28, 2016 at 10:18 PM, Ascot Moss  wrote:
>
>> Thanks for your reply.
>>
>> Is there a way to find the correct Hadoop profile name?
>>
>> On Fri, Jul 29, 2016 at 7:06 AM, Sean Owen  wrote:
>>
>>> You have at least two problems here: wrong Hadoop profile name, and
>>> some kind of firewall interrupting access to the Maven repo. It's not
>>> related to Spark.
>>>
>>> On Thu, Jul 28, 2016 at 4:04 PM, Ascot Moss 
>>> wrote:
>>> > Hi,
>>> >
>>> > I tried to build spark,
>>> >
>>> > (try 1)
>>> > mvn -Pyarn -Phadoop-2.7.0 -Dscala-2.11 -Dhadoop.version=2.7.0 -Phive
>>> > -Phive-thriftserver -DskipTests clean package
>>> >
>>> > [INFO] Spark Project Parent POM ... FAILURE [
>>> 0.658
>>> > s]
>>> >
>>> > [INFO] Spark Project Tags . SKIPPED
>>> >
>>> > [INFO] Spark Project Sketch ... SKIPPED
>>> >
>>> > [INFO] Spark Project Networking ... SKIPPED
>>> >
>>> > [INFO] Spark Project Shuffle Streaming Service  SKIPPED
>>> >
>>> > [INFO] Spark Project Unsafe ... SKIPPED
>>> >
>>> > [INFO] Spark Project Launcher . SKIPPED
>>> >
>>> > [INFO] Spark Project Core . SKIPPED
>>> >
>>> > [INFO] Spark Project GraphX ... SKIPPED
>>> >
>>> > [INFO] Spark Project Streaming  SKIPPED
>>> >
>>> > [INFO] Spark Project Catalyst . SKIPPED
>>> >
>>> > [INFO] Spark Project SQL .. SKIPPED
>>> >
>>> > [INFO] Spark Project ML Local Library . SKIPPED
>>> >
>>> > [INFO] Spark Project ML Library ... SKIPPED
>>> >
>>> > [INFO] Spark Project Tools  SKIPPED
>>> >
>>> > [INFO] Spark Project Hive . SKIPPED
>>> >
>>> > [INFO] Spark Project REPL . SKIPPED
>>> >
>>> > [INFO] Spark Project YARN Shuffle Service . SKIPPED
>>> >
>>> > [INFO] Spark Project YARN . SKIPPED
>>> >
>>> > [INFO] Spark Project Hive Thrift Server ... SKIPPED
>>> >
>>> > [INFO] Spark Project Assembly . SKIPPED
>>> >
>>> > [INFO] Spark Project External Flume Sink .. SKIPPED
>>> >
>>> > [INFO] Spark Project External Flume ... SKIPPED
>>> >
>>> > [INFO] Spark Project External Flume Assembly .. SKIPPED
>>> >
>>> > [INFO] Spark Integration for Kafka 0.8  SKIPPED
>>> >
>>> > [INFO] Spark Project Examples . SKIPPED
>>> >
>>> > [INFO] Spark Project External Kafka Assembly .. SKIPPED
>>> >
>>> > [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
>>> >
>>> > [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED

Re: Spark 2.0 Build Failed

2016-07-28 Thread Dong Meng
Before build, first do a "mvn dependency:tree" to make sure the dependency
is right

On Thu, Jul 28, 2016 at 10:18 PM, Ascot Moss  wrote:

> Thanks for your reply.
>
> Is there a way to find the correct Hadoop profile name?
>
> On Fri, Jul 29, 2016 at 7:06 AM, Sean Owen  wrote:
>
>> You have at least two problems here: wrong Hadoop profile name, and
>> some kind of firewall interrupting access to the Maven repo. It's not
>> related to Spark.
>>
>> On Thu, Jul 28, 2016 at 4:04 PM, Ascot Moss  wrote:
>> > Hi,
>> >
>> > I tried to build spark,
>> >
>> > (try 1)
>> > mvn -Pyarn -Phadoop-2.7.0 -Dscala-2.11 -Dhadoop.version=2.7.0 -Phive
>> > -Phive-thriftserver -DskipTests clean package
>> >
>> > [INFO] Spark Project Parent POM ... FAILURE [
>> 0.658
>> > s]
>> >
>> > [INFO] Spark Project Tags . SKIPPED
>> >
>> > [INFO] Spark Project Sketch ... SKIPPED
>> >
>> > [INFO] Spark Project Networking ... SKIPPED
>> >
>> > [INFO] Spark Project Shuffle Streaming Service  SKIPPED
>> >
>> > [INFO] Spark Project Unsafe ... SKIPPED
>> >
>> > [INFO] Spark Project Launcher . SKIPPED
>> >
>> > [INFO] Spark Project Core . SKIPPED
>> >
>> > [INFO] Spark Project GraphX ... SKIPPED
>> >
>> > [INFO] Spark Project Streaming  SKIPPED
>> >
>> > [INFO] Spark Project Catalyst . SKIPPED
>> >
>> > [INFO] Spark Project SQL .. SKIPPED
>> >
>> > [INFO] Spark Project ML Local Library . SKIPPED
>> >
>> > [INFO] Spark Project ML Library ... SKIPPED
>> >
>> > [INFO] Spark Project Tools  SKIPPED
>> >
>> > [INFO] Spark Project Hive . SKIPPED
>> >
>> > [INFO] Spark Project REPL . SKIPPED
>> >
>> > [INFO] Spark Project YARN Shuffle Service . SKIPPED
>> >
>> > [INFO] Spark Project YARN . SKIPPED
>> >
>> > [INFO] Spark Project Hive Thrift Server ... SKIPPED
>> >
>> > [INFO] Spark Project Assembly . SKIPPED
>> >
>> > [INFO] Spark Project External Flume Sink .. SKIPPED
>> >
>> > [INFO] Spark Project External Flume ... SKIPPED
>> >
>> > [INFO] Spark Project External Flume Assembly .. SKIPPED
>> >
>> > [INFO] Spark Integration for Kafka 0.8  SKIPPED
>> >
>> > [INFO] Spark Project Examples . SKIPPED
>> >
>> > [INFO] Spark Project External Kafka Assembly .. SKIPPED
>> >
>> > [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
>> >
>> > [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
>> >
>> > [INFO] Spark Project Java 8 Tests . SKIPPED
>> >
>> > [INFO]
>> > 
>> >
>> > [INFO] BUILD FAILURE
>> >
>> > [INFO]
>> > 
>> >
>> > [INFO] Total time: 1.090 s
>> >
>> > [INFO] Finished at: 2016-07-29T07:01:57+08:00
>> >
>> > [INFO] Final Memory: 30M/605M
>> >
>> > [INFO]
>> > 
>> >
>> > [WARNING] The requested profile "hadoop-2.7.0" could not be activated
>> > because it does not exist.
>> >
>> > [ERROR] Plugin org.apache.maven.plugins:maven-site-plugin:3.3 or one of
>> its
>> > dependencies could not be resolved: Failed to read artifact descriptor
>> for
>> > org.apache.maven.plugins:maven-site-plugin:jar:3.3: Could not transfer
>> > artifact org.apache.maven.plugins:maven-site-plugin:pom:3.3 from/to
>> central
>> > (https://repo1.maven.org/maven2):
>> sun.security.validator.ValidatorException:
>> > PKIX path building failed:
>> > sun.security.provider.certpath.SunCertPathBuilderException: unable to
>> find
>> > valid certification path to requested target -> [Help 1]
>> >
>> > [ERROR]
>> >
>> > [ERROR] To see the full stack trace of the errors, re-run Maven with
>> the -e
>> > switch.
>> >
>> > [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>> >
>> > [ERROR]
>> >
>> > [ERROR] For more information about the errors and possible solutions,
>> please
>> > read the following articles:
>> >
>> > [ERROR] [Help 1]
>> >
>> http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException
>> >
>> >
>> > (try 2)
>> >
>> > ./build/mvn -DskipTests clean package
>> >
>> > [INFO]
>> > 
>> >
>> > [INFO] Reactor Summary:
>> >
>> > [INFO]
>> >
>> > [INFO] Spark Project Parent POM 

Re: Spark 2.0 Build Failed

2016-07-28 Thread Ascot Moss
Thanks for your reply.

Is there a way to find the correct Hadoop profile name?

On Fri, Jul 29, 2016 at 7:06 AM, Sean Owen  wrote:

> You have at least two problems here: wrong Hadoop profile name, and
> some kind of firewall interrupting access to the Maven repo. It's not
> related to Spark.
>
> On Thu, Jul 28, 2016 at 4:04 PM, Ascot Moss  wrote:
> > Hi,
> >
> > I tried to build spark,
> >
> > (try 1)
> > mvn -Pyarn -Phadoop-2.7.0 -Dscala-2.11 -Dhadoop.version=2.7.0 -Phive
> > -Phive-thriftserver -DskipTests clean package
> >
> > [INFO] Spark Project Parent POM ... FAILURE [
> 0.658
> > s]
> >
> > [INFO] Spark Project Tags . SKIPPED
> >
> > [INFO] Spark Project Sketch ... SKIPPED
> >
> > [INFO] Spark Project Networking ... SKIPPED
> >
> > [INFO] Spark Project Shuffle Streaming Service  SKIPPED
> >
> > [INFO] Spark Project Unsafe ... SKIPPED
> >
> > [INFO] Spark Project Launcher . SKIPPED
> >
> > [INFO] Spark Project Core . SKIPPED
> >
> > [INFO] Spark Project GraphX ... SKIPPED
> >
> > [INFO] Spark Project Streaming  SKIPPED
> >
> > [INFO] Spark Project Catalyst . SKIPPED
> >
> > [INFO] Spark Project SQL .. SKIPPED
> >
> > [INFO] Spark Project ML Local Library . SKIPPED
> >
> > [INFO] Spark Project ML Library ... SKIPPED
> >
> > [INFO] Spark Project Tools  SKIPPED
> >
> > [INFO] Spark Project Hive . SKIPPED
> >
> > [INFO] Spark Project REPL . SKIPPED
> >
> > [INFO] Spark Project YARN Shuffle Service . SKIPPED
> >
> > [INFO] Spark Project YARN . SKIPPED
> >
> > [INFO] Spark Project Hive Thrift Server ... SKIPPED
> >
> > [INFO] Spark Project Assembly . SKIPPED
> >
> > [INFO] Spark Project External Flume Sink .. SKIPPED
> >
> > [INFO] Spark Project External Flume ... SKIPPED
> >
> > [INFO] Spark Project External Flume Assembly .. SKIPPED
> >
> > [INFO] Spark Integration for Kafka 0.8  SKIPPED
> >
> > [INFO] Spark Project Examples . SKIPPED
> >
> > [INFO] Spark Project External Kafka Assembly .. SKIPPED
> >
> > [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
> >
> > [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
> >
> > [INFO] Spark Project Java 8 Tests . SKIPPED
> >
> > [INFO]
> > 
> >
> > [INFO] BUILD FAILURE
> >
> > [INFO]
> > 
> >
> > [INFO] Total time: 1.090 s
> >
> > [INFO] Finished at: 2016-07-29T07:01:57+08:00
> >
> > [INFO] Final Memory: 30M/605M
> >
> > [INFO]
> > 
> >
> > [WARNING] The requested profile "hadoop-2.7.0" could not be activated
> > because it does not exist.
> >
> > [ERROR] Plugin org.apache.maven.plugins:maven-site-plugin:3.3 or one of
> its
> > dependencies could not be resolved: Failed to read artifact descriptor
> for
> > org.apache.maven.plugins:maven-site-plugin:jar:3.3: Could not transfer
> > artifact org.apache.maven.plugins:maven-site-plugin:pom:3.3 from/to
> central
> > (https://repo1.maven.org/maven2):
> sun.security.validator.ValidatorException:
> > PKIX path building failed:
> > sun.security.provider.certpath.SunCertPathBuilderException: unable to
> find
> > valid certification path to requested target -> [Help 1]
> >
> > [ERROR]
> >
> > [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e
> > switch.
> >
> > [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> >
> > [ERROR]
> >
> > [ERROR] For more information about the errors and possible solutions,
> please
> > read the following articles:
> >
> > [ERROR] [Help 1]
> >
> http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException
> >
> >
> > (try 2)
> >
> > ./build/mvn -DskipTests clean package
> >
> > [INFO]
> > 
> >
> > [INFO] Reactor Summary:
> >
> > [INFO]
> >
> > [INFO] Spark Project Parent POM ... FAILURE [
> 0.653
> > s]
> >
> > [INFO] Spark Project Tags . SKIPPED
> >
> > [INFO] Spark Project Sketch ... SKIPPED
> >
> > [INFO] Spark Project Networking ... SKIPPED
> >
> > [INFO] Spark Project Shuffle Streaming 

Question / issue while creating a parquet file using a text file with spark 2.0...

2016-07-28 Thread Muthu Jayakumar
Hello there,

I am using Spark 2.0.0 to create a parquet file using a text file with
Scala. I am trying to read a text file with bunch of values of type string
and long (mostly). And all the occurrences can be null. In order to support
nulls, all the values are boxed with Option (ex:- Option[String],
Option[Long]).
The schema for the parquet file is based on some external metadata file, so
I use 'StructField' to create a schema programmatically and perform some
code snippet like below...

sc.textFile(flatFile).map(_.split(fileSeperator)).map { line =>
  convertToRawColumns(line, schemaSeq)
}

...

val df = sqlContext.createDataFrame(rawFileRDD, schema) //Failed line.

On a side note, the same code used to work fine with Spark 1.6.2.

Here is the error from Spark 2.0.0.

Jul 28, 2016 8:27:10 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
Compression: SNAPPY
Jul 28, 2016 8:27:10 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size Jul-28
20:27:10 [Executor task launch worker-483] ERROR   Utils Aborting task
java.lang.RuntimeException: Error while encoding:
java.lang.RuntimeException: scala.Some is not a valid external type for
schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row
object).isNullAt) null else staticinvoke(class
org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
validateexternaltype(getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true], top level row object), 0, host),
StringType), true) AS host#37315
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
row object).isNullAt) null else staticinvoke(class
org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
validateexternaltype(getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true], top level row object), 0, host),
StringType), true)
   :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row
object).isNullAt
   :  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
row object)
   :  :  +- input[0, org.apache.spark.sql.Row, true]
   :  +- 0
   :- null
   +- staticinvoke(class org.apache.spark.unsafe.types.UTF8String,
StringType, fromString,
validateexternaltype(getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true], top level row object), 0, host),
StringType), true)
  +- validateexternaltype(getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true], top level row object), 0, host),
StringType)
 +- getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true], top level row object), 0, host)
+- assertnotnull(input[0, org.apache.spark.sql.Row, true], top
level row object)
   +- input[0, org.apache.spark.sql.Row, true]


Let me know if you would like me try to create a more simplified reproducer
to this problem. Perhaps I should not be using Option[T] for nullable
schema values?

Please advice,
Muthu


Re: Writing custom Transformers and Estimators like Tokenizer in spark ML

2016-07-28 Thread Phuong LE-HONG
Hi,

I've developed a simple ML estimator (in Java) that implements
conditional Markov model for sequence labelling in Vitk toolkit. You
can check it out here:

https://github.com/phuonglh/vn.vitk/blob/master/src/main/java/vn/vitk/tag/CMM.java

Phuong Le-Hong

On Fri, Jul 29, 2016 at 9:01 AM, janardhan shetty
 wrote:
> Thanks Steve.
>
> Any pointers to custom estimators development as well ?
>
> On Wed, Jul 27, 2016 at 11:35 AM, Steve Rowe  wrote:
>>
>> You can see the source for my transformer configurable bridge to Lucene
>> analysis components here, in my company Lucidworks’ spark-solr project:
>> .
>>
>> Here’s a blog I wrote about using this transformer, as well as
>> non-ML-context use in Spark of the underlying analysis component, here:
>> .
>>
>> --
>> Steve
>> www.lucidworks.com
>>
>> > On Jul 27, 2016, at 1:31 PM, janardhan shetty 
>> > wrote:
>> >
>> > 1.  Any links or blogs to develop custom transformers ? ex: Tokenizer
>> >
>> > 2. Any links or blogs to develop custom estimators ? ex: any ml
>> > algorithm
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark run shell On yarn

2016-07-28 Thread Marcelo Vanzin
Well, it's more of an unfortunate incompatibility caused by dependency
hell. There's a YARN issue to make this better by avoiding that code
path when it's not needed, but I'm not sure what's the status of that.

On Thu, Jul 28, 2016 at 6:54 PM, censj  wrote:
> ok !  solved !!
> But this is a bug?
> ===
> Name: cen sujun
> Mobile: 13067874572
> Mail: ce...@lotuseed.com
>
> 在 2016年7月29日,08:19,Marcelo Vanzin  写道:
>
> spark.hadoop.yarn.timeline-service.enabled=false
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Writing custom Transformers and Estimators like Tokenizer in spark ML

2016-07-28 Thread janardhan shetty
Thanks Steve.

Any pointers to custom estimators development as well ?

On Wed, Jul 27, 2016 at 11:35 AM, Steve Rowe  wrote:

> You can see the source for my transformer configurable bridge to Lucene
> analysis components here, in my company Lucidworks’ spark-solr project: <
> https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/ml/feature/LuceneTextAnalyzerTransformer.scala
> >.
>
> Here’s a blog I wrote about using this transformer, as well as
> non-ML-context use in Spark of the underlying analysis component, here: <
> https://lucidworks.com/blog/2016/04/13/spark-solr-lucenetextanalyzer/>.
>
> --
> Steve
> www.lucidworks.com
>
> > On Jul 27, 2016, at 1:31 PM, janardhan shetty 
> wrote:
> >
> > 1.  Any links or blogs to develop custom transformers ? ex: Tokenizer
> >
> > 2. Any links or blogs to develop custom estimators ? ex: any ml algorithm
>
>


Re: spark run shell On yarn

2016-07-28 Thread Marcelo Vanzin
You can probably do that in Spark's conf too:

spark.hadoop.yarn.timeline-service.enabled=false

On Thu, Jul 28, 2016 at 5:13 PM, Jeff Zhang  wrote:
> One workaround is disable timeline in yarn-site,
>
> set yarn.timeline-service.enabled as false in yarn-site.xml
>
> On Thu, Jul 28, 2016 at 5:31 PM, censj  wrote:
>>
>> 16/07/28 17:07:34 WARN shortcircuit.DomainSocketFactory: The short-circuit
>> local reads feature cannot be used because libhadoop cannot be loaded.
>> java.lang.NoClassDefFoundError:
>> com/sun/jersey/api/client/config/ClientConfig
>>   at
>> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
>>   at
>> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
>>   at
>> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>>   at
>> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
>>   at
>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>>   at
>> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
>>   at org.apache.spark.SparkContext.(SparkContext.scala:500)
>>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256)
>>   at
>> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>>   at
>> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>>   at scala.Option.getOrElse(Option.scala:121)
>>   at
>> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>>   ... 47 elided
>> Caused by: java.lang.ClassNotFoundException:
>> com.sun.jersey.api.client.config.ClientConfig
>>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>   ... 60 more
>> :14: error: not found: value spark
>>import spark.implicits._
>>   ^
>> :14: error: not found: value spark
>>import spark.sql
>>   ^
>> Welcome to
>>
>>
>>
>>
>> hi:
>> I use spark 2.0,but when I run
>> "/etc/spark-2.0.0-bin-hadoop2.6/bin/spark-shell --master yarn” , appear this
>> Error.
>>
>> /etc/spark-2.0.0-bin-hadoop2.6/bin/spark-submit
>> export YARN_CONF_DIR=/etc/hadoop/conf
>> export HADOOP_CONF_DIR=/etc/hadoop/conf
>> export SPARK_HOME=/etc/spark-2.0.0-bin-hadoop2.6
>>
>>
>> how I to update?
>>
>>
>>
>>
>>
>> ===
>> Name: cen sujun
>> Mobile: 13067874572
>> Mail: ce...@lotuseed.com
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark run shell On yarn

2016-07-28 Thread Jeff Zhang
One workaround is disable timeline in yarn-site,

set yarn.timeline-service.enabled as false in yarn-site.xml

On Thu, Jul 28, 2016 at 5:31 PM, censj  wrote:

> 16/07/28 17:07:34 WARN shortcircuit.DomainSocketFactory: The short-circuit
> local reads feature cannot be used because libhadoop cannot be loaded.
> java.lang.NoClassDefFoundError:
> com/sun/jersey/api/client/config/ClientConfig
>   at
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
>   at
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
>   at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
>   at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
>   at org.apache.spark.SparkContext.(SparkContext.scala:500)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256)
>   at
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   ... 47 elided
> Caused by: java.lang.ClassNotFoundException:
> com.sun.jersey.api.client.config.ClientConfig
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 60 more
> :14: error: not found: value spark
>import spark.implicits._
>   ^
> :14: error: not found: value spark
>import spark.sql
>   ^
> Welcome to
>
>
>
>
> hi:
> I use spark 2.0,but when I run
>  "/etc/spark-2.0.0-bin-hadoop2.6/bin/spark-shell --master yarn” , appear
> this Error.
>
> /etc/spark-2.0.0-bin-hadoop2.6/bin/spark-submit
> export YARN_CONF_DIR=/etc/hadoop/conf
> export HADOOP_CONF_DIR=/etc/hadoop/conf
> export SPARK_HOME=/etc/spark-2.0.0-bin-hadoop2.6
>
>
> how I to update?
>
>
>
>
>
> ===
> Name: cen sujun
> Mobile: 13067874572
> Mail: ce...@lotuseed.com
>
>


-- 
Best Regards

Jeff Zhang


Re: Role-based S3 access outside of EMR

2016-07-28 Thread Everett Anderson
Hey,

Just wrapping this up --

I ended up following the instructions
 to build a custom
Spark release with Hadoop 2.7.2, stealing from Steve's SPARK-7481 PR a bit,
in order to get Spark 1.6.2 + Hadoop 2.7.2 + the hadoop-aws library (which
pulls in the proper AWS Java SDK dependency).

Now that there's an official Spark 2.0 + Hadoop 2.7.x release, this is
probably no longer necessary, but I haven't tried it, yet.

With the custom release, s3a paths work fine with EC2 role credentials
without doing anything special. The only thing I had to do was to add this
extra --conf flag to spark-submit in order to write to encrypted S3 buckets
--

--conf spark.hadoop.fs.s3a.server-side-encryption-algorithm=AES256

Full instructions for building on Mac are here:

1) Download the Spark 1.6.2 source from
https://spark.apache.org/downloads.html

2) Install R

brew tap homebrew/science
brew install r

3) Set JAVA_HOME and the MAVEN_OPTS as in the instructions

4) Modify the root pom.xml to add a hadoop-2.7 profile (mostly stolen from
Spark 2.0)


  hadoop-2.7
  
2.7.2
0.9.3
3.4.6
2.6.0
  
  

  
org.apache.hadoop
hadoop-aws
${hadoop.version}
${hadoop.deps.scope}

  
org.apache.hadoop
hadoop-common
  
  
commons-logging
commons-logging
  

  

  


5) Modify core/pom.xml to include the corresponding hadoop-aws and AWS SDK
libs


  org.apache.hadoop
  hadoop-client


  org.apache.hadoop
  hadoop-aws
  

  org.apache.hadoop
  hadoop-common


  commons-logging
  commons-logging

  


6) Build with

./make-distribution.sh --name custom-hadoop-2.7-2-aws-s3a --tgz -Psparkr
-Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn







On Sat, Jul 23, 2016 at 4:11 AM, Steve Loughran 
wrote:

>
>
> Amazon S3 has stronger consistency guarantees than the ASF s3 clients, it
> uses dynamo to do this.
>
> there is some work underway to do something similar atop S3a, S3guard, see
> https://issues.apache.org/jira/browse/HADOOP-13345  .
>
> Regarding IAM support in Spark, The latest version of S3A, which will ship
> in Hadoop 2.8, adds: IAM, temporary credential, direct env var pickup —and
> the ability to add your own.
>
> Regarding getting the relevant binaries into your app, you need a version
> of the hadoop-aws library consistent with the rest of hadoop, and the
> version of the amazon AWS SDKs that hadoop was built against. APIs in the
> SDK have changed and attempting to upgrade the amazon JAR will fail.
>
> There's a PR attached to SPARK-7481 which does the bundling and adds a
> suite of tests...it's designed to work with Hadoop 2.7+ builds. if you are
> building Spark locally, please try it and provide feedback on the PR
>
> finally, don't try an use s3a  on hadoop-2.6...that was really in preview
> state, and it let bugs surface which were fixed in 2.7.
>
> -Steve
>
> ps: More on S3a in Hadoop 2.8. Things will be way better!
> http://slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production
>
>
> On 21 Jul 2016, at 17:23, Ewan Leith  wrote:
>
> If you use S3A rather than S3N, it supports IAM roles.
>
> I think you can make s3a used for s3:// style URLs so it’s consistent with
> your EMR paths by adding this to your Hadoop config, probably in
> core-site.xml:
>
> fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
> fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
> fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A
> fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A
>
> And make sure the s3a jars are in your classpath
>
> Thanks,
> Ewan
>
> *From:* Everett Anderson [mailto:ever...@nuna.com.INVALID
> ]
> *Sent:* 21 July 2016 17:01
> *To:* Gourav Sengupta 
> *Cc:* Teng Qiu ; Andy Davidson <
> a...@santacruzintegration.com>; user 
> *Subject:* Re: Role-based S3 access outside of EMR
>
> Hey,
>
> FWIW, we are using EMR, actually, in production.
>
> The main case I have for wanting to access S3 with Spark outside of EMR is
> that during development, our developers tend to run EC2 sandbox instances
> that have all the rest of our code and access to some of the input data on
> S3. It'd be nice if S3 access "just worked" on these without storing the
> access keys in an exposed manner.
>
> Teng -- when you say you use EMRFS, does that mean you copied AWS's EMRFS
> JAR from an EMR cluster and are using it outside? My impression is that AWS
> hasn't released the EMRFS implementation as part of the aws-java-sdk, so
> I'm wary 

Re: Spark 2.0 Build Failed

2016-07-28 Thread Sean Owen
You have at least two problems here: wrong Hadoop profile name, and
some kind of firewall interrupting access to the Maven repo. It's not
related to Spark.

On Thu, Jul 28, 2016 at 4:04 PM, Ascot Moss  wrote:
> Hi,
>
> I tried to build spark,
>
> (try 1)
> mvn -Pyarn -Phadoop-2.7.0 -Dscala-2.11 -Dhadoop.version=2.7.0 -Phive
> -Phive-thriftserver -DskipTests clean package
>
> [INFO] Spark Project Parent POM ... FAILURE [  0.658
> s]
>
> [INFO] Spark Project Tags . SKIPPED
>
> [INFO] Spark Project Sketch ... SKIPPED
>
> [INFO] Spark Project Networking ... SKIPPED
>
> [INFO] Spark Project Shuffle Streaming Service  SKIPPED
>
> [INFO] Spark Project Unsafe ... SKIPPED
>
> [INFO] Spark Project Launcher . SKIPPED
>
> [INFO] Spark Project Core . SKIPPED
>
> [INFO] Spark Project GraphX ... SKIPPED
>
> [INFO] Spark Project Streaming  SKIPPED
>
> [INFO] Spark Project Catalyst . SKIPPED
>
> [INFO] Spark Project SQL .. SKIPPED
>
> [INFO] Spark Project ML Local Library . SKIPPED
>
> [INFO] Spark Project ML Library ... SKIPPED
>
> [INFO] Spark Project Tools  SKIPPED
>
> [INFO] Spark Project Hive . SKIPPED
>
> [INFO] Spark Project REPL . SKIPPED
>
> [INFO] Spark Project YARN Shuffle Service . SKIPPED
>
> [INFO] Spark Project YARN . SKIPPED
>
> [INFO] Spark Project Hive Thrift Server ... SKIPPED
>
> [INFO] Spark Project Assembly . SKIPPED
>
> [INFO] Spark Project External Flume Sink .. SKIPPED
>
> [INFO] Spark Project External Flume ... SKIPPED
>
> [INFO] Spark Project External Flume Assembly .. SKIPPED
>
> [INFO] Spark Integration for Kafka 0.8  SKIPPED
>
> [INFO] Spark Project Examples . SKIPPED
>
> [INFO] Spark Project External Kafka Assembly .. SKIPPED
>
> [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
>
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
>
> [INFO] Spark Project Java 8 Tests . SKIPPED
>
> [INFO]
> 
>
> [INFO] BUILD FAILURE
>
> [INFO]
> 
>
> [INFO] Total time: 1.090 s
>
> [INFO] Finished at: 2016-07-29T07:01:57+08:00
>
> [INFO] Final Memory: 30M/605M
>
> [INFO]
> 
>
> [WARNING] The requested profile "hadoop-2.7.0" could not be activated
> because it does not exist.
>
> [ERROR] Plugin org.apache.maven.plugins:maven-site-plugin:3.3 or one of its
> dependencies could not be resolved: Failed to read artifact descriptor for
> org.apache.maven.plugins:maven-site-plugin:jar:3.3: Could not transfer
> artifact org.apache.maven.plugins:maven-site-plugin:pom:3.3 from/to central
> (https://repo1.maven.org/maven2): sun.security.validator.ValidatorException:
> PKIX path building failed:
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find
> valid certification path to requested target -> [Help 1]
>
> [ERROR]
>
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
> switch.
>
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>
> [ERROR]
>
> [ERROR] For more information about the errors and possible solutions, please
> read the following articles:
>
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException
>
>
> (try 2)
>
> ./build/mvn -DskipTests clean package
>
> [INFO]
> 
>
> [INFO] Reactor Summary:
>
> [INFO]
>
> [INFO] Spark Project Parent POM ... FAILURE [  0.653
> s]
>
> [INFO] Spark Project Tags . SKIPPED
>
> [INFO] Spark Project Sketch ... SKIPPED
>
> [INFO] Spark Project Networking ... SKIPPED
>
> [INFO] Spark Project Shuffle Streaming Service  SKIPPED
>
> [INFO] Spark Project Unsafe ... SKIPPED
>
> [INFO] Spark Project Launcher . SKIPPED
>
> [INFO] Spark Project Core . SKIPPED
>
> [INFO] Spark Project GraphX ... SKIPPED
>
> [INFO] Spark Project Streaming  SKIPPED
>
> [INFO] Spark Project Catalyst 

Spark 2.0 Build Failed

2016-07-28 Thread Ascot Moss
Hi,

I tried to build spark,

(try 1)
mvn -Pyarn *-Phadoop-2.7.0* *-Dscala-2.11* -Dhadoop.version=2.7.0 -Phive
-Phive-thriftserver -DskipTests clean package

[INFO] Spark Project Parent POM ... FAILURE [
0.658 s]

[INFO] Spark Project Tags . SKIPPED

[INFO] Spark Project Sketch ... SKIPPED

[INFO] Spark Project Networking ... SKIPPED

[INFO] Spark Project Shuffle Streaming Service  SKIPPED

[INFO] Spark Project Unsafe ... SKIPPED

[INFO] Spark Project Launcher . SKIPPED

[INFO] Spark Project Core . SKIPPED

[INFO] Spark Project GraphX ... SKIPPED

[INFO] Spark Project Streaming  SKIPPED

[INFO] Spark Project Catalyst . SKIPPED

[INFO] Spark Project SQL .. SKIPPED

[INFO] Spark Project ML Local Library . SKIPPED

[INFO] Spark Project ML Library ... SKIPPED

[INFO] Spark Project Tools  SKIPPED

[INFO] Spark Project Hive . SKIPPED

[INFO] Spark Project REPL . SKIPPED

[INFO] Spark Project YARN Shuffle Service . SKIPPED

[INFO] Spark Project YARN . SKIPPED

[INFO] Spark Project Hive Thrift Server ... SKIPPED

[INFO] Spark Project Assembly . SKIPPED

[INFO] Spark Project External Flume Sink .. SKIPPED

[INFO] Spark Project External Flume ... SKIPPED

[INFO] Spark Project External Flume Assembly .. SKIPPED

[INFO] Spark Integration for Kafka 0.8  SKIPPED

[INFO] Spark Project Examples . SKIPPED

[INFO] Spark Project External Kafka Assembly .. SKIPPED

[INFO] Spark Integration for Kafka 0.10 ... SKIPPED

[INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED

[INFO] Spark Project Java 8 Tests . SKIPPED

[INFO]


[INFO] BUILD FAILURE

[INFO]


[INFO] Total time: 1.090 s

[INFO] Finished at: 2016-07-29T07:01:57+08:00

[INFO] Final Memory: 30M/605M

[INFO]


[WARNING] The requested profile "hadoop-2.7.0" could not be activated
because it does not exist.

[ERROR] Plugin org.apache.maven.plugins:maven-site-plugin:3.3 or one of its
dependencies could not be resolved: Failed to read artifact descriptor for
org.apache.maven.plugins:maven-site-plugin:jar:3.3: Could not transfer
artifact org.apache.maven.plugins:maven-site-plugin:pom:3.3 from/to central
(https://repo1.maven.org/maven2):
sun.security.validator.ValidatorException: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target -> [Help 1]

[ERROR]

[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR]

[ERROR] For more information about the errors and possible solutions,
please read the following articles:

[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException


(try 2)

./build/mvn -DskipTests clean package

[INFO]


[INFO] Reactor Summary:

[INFO]

[INFO] Spark Project Parent POM ... FAILURE [
0.653 s]

[INFO] Spark Project Tags . SKIPPED

[INFO] Spark Project Sketch ... SKIPPED

[INFO] Spark Project Networking ... SKIPPED

[INFO] Spark Project Shuffle Streaming Service  SKIPPED

[INFO] Spark Project Unsafe ... SKIPPED

[INFO] Spark Project Launcher . SKIPPED

[INFO] Spark Project Core . SKIPPED

[INFO] Spark Project GraphX ... SKIPPED

[INFO] Spark Project Streaming  SKIPPED

[INFO] Spark Project Catalyst . SKIPPED

[INFO] Spark Project SQL .. SKIPPED

[INFO] Spark Project ML Local Library . SKIPPED

[INFO] Spark Project ML Library ... SKIPPED

[INFO] Spark Project Tools  SKIPPED

[INFO] Spark Project Hive . SKIPPED

[INFO] Spark Project REPL . SKIPPED

[INFO] Spark Project Assembly 

Re: spark 1.6.0 read s3 files error.

2016-07-28 Thread freedafeng
tried the following. still failed the same way.. it ran in yarn. cdh5.8.0

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('s3 ---')
sc = SparkContext(conf=conf)

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "...")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "...")

myRdd =
sc.textFile("s3n:///y=2016/m=5/d=26/h=20/2016.5.26.21.9.52.6d53180a-28b9-4e65-b749-b4a2694b9199.json.gz")

count = myRdd.count()
print "The count is", count



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27427.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Alexander Pivovarov
Found 0 matching posts for *ORC v/s Parquet for Spark 2.0* in Apache Spark
User List 
http://apache-spark-user-list.1001560.n3.nabble.com/

Anyone have a link to this discussion? Want to share it with my colleagues.

On Thu, Jul 28, 2016 at 2:35 PM, Mich Talebzadeh 
wrote:

> As far as I know Spark still lacks the ability to handle Updates or
> deletes vis-à-vis ORC transactional tables. As you may know in Hive an ORC
> transactional table can handle updates and deletes. Transactional support
> was added to Hive for ORC tables. No transactional support with Spark SQL
> on ORC tables yet. Locking and concurrency (as used by Hive) with Spark
> app running a Hive context. I am not convinced this works actually. Case in
> point, you can test it for yourself in Spark and see whether locks are
> applied in Hive metastore . In my opinion Spark value comes as a query tool
> for faster query processing (DAG + IM capability)
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 18:46, Ofir Manor  wrote:
>
>> BTW - this thread has many anecdotes on Apache ORC vs. Apache Parquet (I
>> personally think both are great at this point).
>> But the original question was about Spark 2.0. Anyone has some insights
>> about Parquet-specific optimizations / limitations vs. ORC-specific
>> optimizations / limitations in pre-2.0 vs. 2.0? I've put one in the
>> beginning of the thread regarding Structured Streaming, but there was a
>> general claim that pre-2.0 Spark was missing many ORC optimizations, and
>> that some (all?) were added in 2.0.
>> I saw that a lot of related tickets closed in 2.0, but it would great if
>> someone close to the details can explain.
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>
>> On Thu, Jul 28, 2016 at 6:49 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Like anything else your mileage varies.
>>>
>>> ORC with Vectorised query execution
>>> 
>>>  is
>>> the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
>>> with columnar indexes. To me that is cool. Parquet has been around and has
>>> its use case as well.
>>>
>>> I guess there is no hard and fast rule which one to use all the time.
>>> Use the one that provides best fit for the condition.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 28 July 2016 at 09:18, Jörn Franke  wrote:
>>>
 I see it more as a process of innovation and thus competition is good.
 Companies just should not follow these religious arguments but try
 themselves what suits them. There is more than software when using software
 ;)

 On 28 Jul 2016, at 01:44, Mich Talebzadeh 
 wrote:

 And frankly this is becoming some sort of religious arguments now



 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 28 July 2016 at 00:01, Sudhir Babu Pothineni 
 wrote:

> It depends 

Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Mich Talebzadeh
As far as I know Spark still lacks the ability to handle Updates or deletes
vis-à-vis ORC transactional tables. As you may know in Hive an ORC
transactional table can handle updates and deletes. Transactional support
was added to Hive for ORC tables. No transactional support with Spark SQL
on ORC tables yet. Locking and concurrency (as used by Hive) with Spark app
running a Hive context. I am not convinced this works actually. Case in
point, you can test it for yourself in Spark and see whether locks are
applied in Hive metastore . In my opinion Spark value comes as a query tool
for faster query processing (DAG + IM capability)

HTH





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 18:46, Ofir Manor  wrote:

> BTW - this thread has many anecdotes on Apache ORC vs. Apache Parquet (I
> personally think both are great at this point).
> But the original question was about Spark 2.0. Anyone has some insights
> about Parquet-specific optimizations / limitations vs. ORC-specific
> optimizations / limitations in pre-2.0 vs. 2.0? I've put one in the
> beginning of the thread regarding Structured Streaming, but there was a
> general claim that pre-2.0 Spark was missing many ORC optimizations, and
> that some (all?) were added in 2.0.
> I saw that a lot of related tickets closed in 2.0, but it would great if
> someone close to the details can explain.
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Thu, Jul 28, 2016 at 6:49 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Like anything else your mileage varies.
>>
>> ORC with Vectorised query execution
>> 
>>  is
>> the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
>> with columnar indexes. To me that is cool. Parquet has been around and has
>> its use case as well.
>>
>> I guess there is no hard and fast rule which one to use all the time. Use
>> the one that provides best fit for the condition.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 28 July 2016 at 09:18, Jörn Franke  wrote:
>>
>>> I see it more as a process of innovation and thus competition is good.
>>> Companies just should not follow these religious arguments but try
>>> themselves what suits them. There is more than software when using software
>>> ;)
>>>
>>> On 28 Jul 2016, at 01:44, Mich Talebzadeh 
>>> wrote:
>>>
>>> And frankly this is becoming some sort of religious arguments now
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 28 July 2016 at 00:01, Sudhir Babu Pothineni 
>>> wrote:
>>>
 It depends on what you are dong, here is the recent comparison of ORC,
 Parquet


 https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet

 Although from ORC authors, I thought fair comparison, We use ORC as
 System of Record on our Cloudera HDFS cluster, our experience is so far
 good.

 Perquet is backed by Cloudera, which has more installations of Hadoop.
 ORC is by Hortonworks, so battle of file format continues...

 Sent from my 

Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-28 Thread Rohit Chaddha
After looking at the comments - I am not sure what the proposed fix is ?

On Fri, Jul 29, 2016 at 12:47 AM, Sean Owen  wrote:

> Ah, right. This wasn't actually resolved. Yeah your input on 15899
> would be welcome. See if the proposed fix helps.
>
> On Thu, Jul 28, 2016 at 11:52 AM, Rohit Chaddha
>  wrote:
> > Sean,
> >
> > I saw some JIRA tickets and looks like this is still an open bug (rather
> > than an improvement as marked in JIRA).
> >
> > https://issues.apache.org/jira/browse/SPARK-15893
> > https://issues.apache.org/jira/browse/SPARK-15899
> >
> > I am experimenting, but do you know of any solution on top of your head
> >
> >
> >
> > On Fri, Jul 29, 2016 at 12:06 AM, Rohit Chaddha <
> rohitchaddha1...@gmail.com>
> > wrote:
> >>
> >> I am simply trying to do
> >> session.read().json("file:///C:/data/a.json");
> >>
> >> in 2.0.0-preview it was working fine with
> >> sqlContext.read().json("C:/data/a.json");
> >>
> >>
> >> -Rohit
> >>
> >> On Fri, Jul 29, 2016 at 12:03 AM, Sean Owen  wrote:
> >>>
> >>> Hm, file:///C:/... doesn't work? that should certainly be an absolute
> >>> URI with an absolute path. What exactly is your input value for this
> >>> property?
> >>>
> >>> On Thu, Jul 28, 2016 at 11:28 AM, Rohit Chaddha
> >>>  wrote:
> >>> > Hello Sean,
> >>> >
> >>> > I have tried both  file:/  and file:///
> >>> > Bit it does not work and give the same error
> >>> >
> >>> > -Rohit
> >>> >
> >>> >
> >>> >
> >>> > On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen 
> wrote:
> >>> >>
> >>> >> IIRC that was fixed, in that this is actually an invalid URI. Use
> >>> >> file:/C:/... I think.
> >>> >>
> >>> >> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha
> >>> >>  wrote:
> >>> >> > I upgraded from 2.0.0-preview to 2.0.0
> >>> >> > and I started getting the following error
> >>> >> >
> >>> >> > Caused by: java.net.URISyntaxException: Relative path in absolute
> >>> >> > URI:
> >>> >> > file:C:/ibm/spark-warehouse
> >>> >> >
> >>> >> > Any ideas how to fix this
> >>> >> >
> >>> >> > -Rohit
> >>> >
> >>> >
> >>
> >>
> >
>


Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-28 Thread Sean Owen
Ah, right. This wasn't actually resolved. Yeah your input on 15899
would be welcome. See if the proposed fix helps.

On Thu, Jul 28, 2016 at 11:52 AM, Rohit Chaddha
 wrote:
> Sean,
>
> I saw some JIRA tickets and looks like this is still an open bug (rather
> than an improvement as marked in JIRA).
>
> https://issues.apache.org/jira/browse/SPARK-15893
> https://issues.apache.org/jira/browse/SPARK-15899
>
> I am experimenting, but do you know of any solution on top of your head
>
>
>
> On Fri, Jul 29, 2016 at 12:06 AM, Rohit Chaddha 
> wrote:
>>
>> I am simply trying to do
>> session.read().json("file:///C:/data/a.json");
>>
>> in 2.0.0-preview it was working fine with
>> sqlContext.read().json("C:/data/a.json");
>>
>>
>> -Rohit
>>
>> On Fri, Jul 29, 2016 at 12:03 AM, Sean Owen  wrote:
>>>
>>> Hm, file:///C:/... doesn't work? that should certainly be an absolute
>>> URI with an absolute path. What exactly is your input value for this
>>> property?
>>>
>>> On Thu, Jul 28, 2016 at 11:28 AM, Rohit Chaddha
>>>  wrote:
>>> > Hello Sean,
>>> >
>>> > I have tried both  file:/  and file:///
>>> > Bit it does not work and give the same error
>>> >
>>> > -Rohit
>>> >
>>> >
>>> >
>>> > On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen  wrote:
>>> >>
>>> >> IIRC that was fixed, in that this is actually an invalid URI. Use
>>> >> file:/C:/... I think.
>>> >>
>>> >> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha
>>> >>  wrote:
>>> >> > I upgraded from 2.0.0-preview to 2.0.0
>>> >> > and I started getting the following error
>>> >> >
>>> >> > Caused by: java.net.URISyntaxException: Relative path in absolute
>>> >> > URI:
>>> >> > file:C:/ibm/spark-warehouse
>>> >> >
>>> >> > Any ideas how to fix this
>>> >> >
>>> >> > -Rohit
>>> >
>>> >
>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-28 Thread Rohit Chaddha
Sean,

I saw some JIRA tickets and looks like this is still an open bug (rather
than an improvement as marked in JIRA).

https://issues.apache.org/jira/browse/SPARK-15893
https://issues.apache.org/jira/browse/SPARK-15899

I am experimenting, but do you know of any solution on top of your head



On Fri, Jul 29, 2016 at 12:06 AM, Rohit Chaddha 
wrote:

> I am simply trying to do
> session.read().json("file:///C:/data/a.json");
>
> in 2.0.0-preview it was working fine with
> sqlContext.read().json("C:/data/a.json");
>
>
> -Rohit
>
> On Fri, Jul 29, 2016 at 12:03 AM, Sean Owen  wrote:
>
>> Hm, file:///C:/... doesn't work? that should certainly be an absolute
>> URI with an absolute path. What exactly is your input value for this
>> property?
>>
>> On Thu, Jul 28, 2016 at 11:28 AM, Rohit Chaddha
>>  wrote:
>> > Hello Sean,
>> >
>> > I have tried both  file:/  and file:///
>> > Bit it does not work and give the same error
>> >
>> > -Rohit
>> >
>> >
>> >
>> > On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen  wrote:
>> >>
>> >> IIRC that was fixed, in that this is actually an invalid URI. Use
>> >> file:/C:/... I think.
>> >>
>> >> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha
>> >>  wrote:
>> >> > I upgraded from 2.0.0-preview to 2.0.0
>> >> > and I started getting the following error
>> >> >
>> >> > Caused by: java.net.URISyntaxException: Relative path in absolute
>> URI:
>> >> > file:C:/ibm/spark-warehouse
>> >> >
>> >> > Any ideas how to fix this
>> >> >
>> >> > -Rohit
>> >
>> >
>>
>
>


Re: SPARK Exception thrown in awaitResult

2016-07-28 Thread Carlo . Allocca
Solved!!
The solution is using date_format with the “u” option.

Thank you very much.
Best,
Carlo

On 28 Jul 2016, at 18:59, carlo allocca 
> wrote:

Hi Mark,

Thanks for the suggestion.
I changed the maven entries as follows

spark-core_2.10
2.0.0

and
spark-sql_2.10
2.0.0

As result, it worked when I removed the following line of code to compute 
DAYOFWEEK (Monday—>1 etc.):

Dataset tmp6=tmp5.withColumn("ORD_DAYOFWEEK", callUDF("computeDayOfWeek", 
tmp5.col("ORD_time_window_per_hour#3").getItem("start").cast(DataTypes.StringType)));

 this.spark.udf().register("computeDayOfWeek", new UDF1() {
@Override
  public Integer call(String myDate) throws Exception {
Date date = new SimpleDateFormat(dateFormat).parse(myDate);
Calendar c = Calendar.getInstance();
c.setTime(date);
int dayOfWeek = c.get(Calendar.DAY_OF_WEEK);
  return dayOfWeek;//myDate.length();
}
  }, DataTypes.IntegerType);



And the full stack is reported below.

Is there another way to compute DAYOFWEEK from a dateFormat="-MM-dd 
HH:mm:ss" by using built-in function? I have  checked date_format but it does 
not do it.

Any Suggestion?

Many Thanks,
Carlo




Test set: org.mksmart.amaretto.ml.DatasetPerHourVerOneTest
---
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 32.658 sec <<< 
FAILURE!
testBuildDatasetNew(org.mksmart.amaretto.ml.DatasetPerHourVerOneTest)  Time 
elapsed: 32.581 sec  <<< ERROR!
org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2037)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:798)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:797)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:797)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:364)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:128)
at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562)
at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
at 
org.mksmart.amaretto.ml.DatasetPerHourVerOneTest.testBuildDatasetNew(DatasetPerHourVerOneTest.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 

Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-28 Thread Rohit Chaddha
I am simply trying to do
session.read().json("file:///C:/data/a.json");

in 2.0.0-preview it was working fine with
sqlContext.read().json("C:/data/a.json");


-Rohit

On Fri, Jul 29, 2016 at 12:03 AM, Sean Owen  wrote:

> Hm, file:///C:/... doesn't work? that should certainly be an absolute
> URI with an absolute path. What exactly is your input value for this
> property?
>
> On Thu, Jul 28, 2016 at 11:28 AM, Rohit Chaddha
>  wrote:
> > Hello Sean,
> >
> > I have tried both  file:/  and file:///
> > Bit it does not work and give the same error
> >
> > -Rohit
> >
> >
> >
> > On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen  wrote:
> >>
> >> IIRC that was fixed, in that this is actually an invalid URI. Use
> >> file:/C:/... I think.
> >>
> >> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha
> >>  wrote:
> >> > I upgraded from 2.0.0-preview to 2.0.0
> >> > and I started getting the following error
> >> >
> >> > Caused by: java.net.URISyntaxException: Relative path in absolute URI:
> >> > file:C:/ibm/spark-warehouse
> >> >
> >> > Any ideas how to fix this
> >> >
> >> > -Rohit
> >
> >
>


Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-28 Thread Hatim Diab
I’m not familiar with windows but for unix is the path is /data/zxy
then it’ll be 
file:///data/zxy
so I’d assume 
file://C:/

> On Jul 28, 2016, at 2:33 PM, Sean Owen  wrote:
> 
> Hm, file:///C:/... doesn't work? that should certainly be an absolute
> URI with an absolute path. What exactly is your input value for this
> property?
> 
> On Thu, Jul 28, 2016 at 11:28 AM, Rohit Chaddha
>  wrote:
>> Hello Sean,
>> 
>> I have tried both  file:/  and file:///
>> Bit it does not work and give the same error
>> 
>> -Rohit
>> 
>> 
>> 
>> On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen  wrote:
>>> 
>>> IIRC that was fixed, in that this is actually an invalid URI. Use
>>> file:/C:/... I think.
>>> 
>>> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha
>>>  wrote:
 I upgraded from 2.0.0-preview to 2.0.0
 and I started getting the following error
 
 Caused by: java.net.URISyntaxException: Relative path in absolute URI:
 file:C:/ibm/spark-warehouse
 
 Any ideas how to fix this
 
 -Rohit
>> 
>> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-28 Thread Sean Owen
Hm, file:///C:/... doesn't work? that should certainly be an absolute
URI with an absolute path. What exactly is your input value for this
property?

On Thu, Jul 28, 2016 at 11:28 AM, Rohit Chaddha
 wrote:
> Hello Sean,
>
> I have tried both  file:/  and file:///
> Bit it does not work and give the same error
>
> -Rohit
>
>
>
> On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen  wrote:
>>
>> IIRC that was fixed, in that this is actually an invalid URI. Use
>> file:/C:/... I think.
>>
>> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha
>>  wrote:
>> > I upgraded from 2.0.0-preview to 2.0.0
>> > and I started getting the following error
>> >
>> > Caused by: java.net.URISyntaxException: Relative path in absolute URI:
>> > file:C:/ibm/spark-warehouse
>> >
>> > Any ideas how to fix this
>> >
>> > -Rohit
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-28 Thread Rohit Chaddha
Hello Sean,

I have tried both  file:/  and file:///
Bit it does not work and give the same error

-Rohit



On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen  wrote:

> IIRC that was fixed, in that this is actually an invalid URI. Use
> file:/C:/... I think.
>
> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha
>  wrote:
> > I upgraded from 2.0.0-preview to 2.0.0
> > and I started getting the following error
> >
> > Caused by: java.net.URISyntaxException: Relative path in absolute URI:
> > file:C:/ibm/spark-warehouse
> >
> > Any ideas how to fix this
> >
> > -Rohit
>


Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-28 Thread Sean Owen
IIRC that was fixed, in that this is actually an invalid URI. Use
file:/C:/... I think.

On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha
 wrote:
> I upgraded from 2.0.0-preview to 2.0.0
> and I started getting the following error
>
> Caused by: java.net.URISyntaxException: Relative path in absolute URI:
> file:C:/ibm/spark-warehouse
>
> Any ideas how to fix this
>
> -Rohit

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: ClassTag variable in broadcast in spark 2.0 ? how to use

2016-07-28 Thread Rohit Chaddha
My bad. Please ignore this question.
I accidentally reverted to sparkContext causing the issue

On Thu, Jul 28, 2016 at 11:36 PM, Rohit Chaddha 
wrote:

> In spark 2.0 there is an addtional parameter of type ClassTag  in the
> broadcast method of the sparkContext
>
> What is this variable and how to do broadcast now?
>
> here is my exisitng code with 2.0.0-preview
> Broadcast> b = jsc.broadcast(u.collectAsMap());
>
> what changes needs to be done in 2.0 for this
> Broadcast> b = jsc.broadcast(u.collectAsMap(), *??* );
>
> Please help
>
> Rohit
>


ClassTag variable in broadcast in spark 2.0 ? how to use

2016-07-28 Thread Rohit Chaddha
In spark 2.0 there is an addtional parameter of type ClassTag  in the
broadcast method of the sparkContext

What is this variable and how to do broadcast now?

here is my exisitng code with 2.0.0-preview
Broadcast> b = jsc.broadcast(u.collectAsMap());

what changes needs to be done in 2.0 for this
Broadcast> b = jsc.broadcast(u.collectAsMap(), *??* );

Please help

Rohit


Custom Image RDD and Sequence Files

2016-07-28 Thread jtgenesis
Hey all,

I was wondering what the best course of action is for processing an image
that has an involved internal structure (file headers, sub-headers, image
data, more sub-headers, more kinds of data etc). I was hoping to get some
insight on the approach I'm using and whether there is a better, more Spark
way of handling it.

I'm coming from a Hadoop approach where I convert the image to a sequence
file. Now, i'm new to both Spark and Hadoop, but I have a deeper
understanding of Hadoop, which is why I went with the sequence files. The
sequence file is chopped into key/value pairs that contain file and image
meta-data and separate key/value pairs that contain the raw image data. I
currently use a LongWritable for the key and a BytesWritable for the value.
This is a naive approach, but I plan to create custom Writable key type that
contain pertinent information to the corresponding image data. The idea is
to create a custom Spark Partitioner, taking advantage of the key structure,
to reduce inter-cluster communication. Example. store all image tiles with
the same key.id property on the same node.

1.) Is converting the image to a Sequence File superfluous? Is it better to
do this pre-processing and creating a custom key/value type another way.
Would it be through Spark or Hadoop's Writable? It seems like Spark just
uses different flavors of Hadoop's InputFormat under the hood.

I see that Spark does have support for SequenceFiles, but I'm still not
fully clear on the extent of it.

2.)  When you read in a .seq file through sc.sequenceFIle(), it's using
SequenceFileInputFormat. This means that the number of partitions will be
determined by the number of splits, specified in the
SequenceFileInputFormat.getSplits. Do the input splits happen on key/value
boundaries? 

3.) The RDD created from Sequence Files will have the translated Scala
key/value type, but if I use a custom Hadoop Writable, will I have to do
anything on Spark/Scala side to understand it?

4.) Since I'm using a custom Hadoop Writable, is it best to register my
writable types with Kryo?

Thanks for any help!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Image-RDD-and-Sequence-Files-tp27426.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SPARK Exception thrown in awaitResult

2016-07-28 Thread Carlo . Allocca
Hi Mark,

Thanks for the suggestion.
I changed the maven entries as follows

spark-core_2.10
2.0.0

and
spark-sql_2.10
2.0.0

As result, it worked when I removed the following line of code to compute 
DAYOFWEEK (Monday—>1 etc.):

Dataset tmp6=tmp5.withColumn("ORD_DAYOFWEEK", callUDF("computeDayOfWeek", 
tmp5.col("ORD_time_window_per_hour#3").getItem("start").cast(DataTypes.StringType)));

 this.spark.udf().register("computeDayOfWeek", new UDF1() {
@Override
  public Integer call(String myDate) throws Exception {
Date date = new SimpleDateFormat(dateFormat).parse(myDate);
Calendar c = Calendar.getInstance();
c.setTime(date);
int dayOfWeek = c.get(Calendar.DAY_OF_WEEK);
  return dayOfWeek;//myDate.length();
}
  }, DataTypes.IntegerType);



And the full stack is reported below.

Is there another way to compute DAYOFWEEK from a dateFormat="-MM-dd 
HH:mm:ss" by using built-in function? I have  checked date_format but it does 
not do it.

Any Suggestion?

Many Thanks,
Carlo




Test set: org.mksmart.amaretto.ml.DatasetPerHourVerOneTest
---
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 32.658 sec <<< 
FAILURE!
testBuildDatasetNew(org.mksmart.amaretto.ml.DatasetPerHourVerOneTest)  Time 
elapsed: 32.581 sec  <<< ERROR!
org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2037)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:798)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:797)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:797)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:364)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:128)
at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562)
at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
at 
org.mksmart.amaretto.ml.DatasetPerHourVerOneTest.testBuildDatasetNew(DatasetPerHourVerOneTest.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at 

Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-28 Thread Rohit Chaddha
I upgraded from 2.0.0-preview to 2.0.0
and I started getting the following error

Caused by: java.net.URISyntaxException: Relative path in absolute URI:
file:C:/ibm/spark-warehouse

Any ideas how to fix this

-Rohit


Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Ofir Manor
BTW - this thread has many anecdotes on Apache ORC vs. Apache Parquet (I
personally think both are great at this point).
But the original question was about Spark 2.0. Anyone has some insights
about Parquet-specific optimizations / limitations vs. ORC-specific
optimizations / limitations in pre-2.0 vs. 2.0? I've put one in the
beginning of the thread regarding Structured Streaming, but there was a
general claim that pre-2.0 Spark was missing many ORC optimizations, and
that some (all?) were added in 2.0.
I saw that a lot of related tickets closed in 2.0, but it would great if
someone close to the details can explain.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, Jul 28, 2016 at 6:49 PM, Mich Talebzadeh 
wrote:

> Like anything else your mileage varies.
>
> ORC with Vectorised query execution
>  
> is
> the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
> with columnar indexes. To me that is cool. Parquet has been around and has
> its use case as well.
>
> I guess there is no hard and fast rule which one to use all the time. Use
> the one that provides best fit for the condition.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 09:18, Jörn Franke  wrote:
>
>> I see it more as a process of innovation and thus competition is good.
>> Companies just should not follow these religious arguments but try
>> themselves what suits them. There is more than software when using software
>> ;)
>>
>> On 28 Jul 2016, at 01:44, Mich Talebzadeh 
>> wrote:
>>
>> And frankly this is becoming some sort of religious arguments now
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 28 July 2016 at 00:01, Sudhir Babu Pothineni 
>> wrote:
>>
>>> It depends on what you are dong, here is the recent comparison of ORC,
>>> Parquet
>>>
>>>
>>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> Although from ORC authors, I thought fair comparison, We use ORC as
>>> System of Record on our Cloudera HDFS cluster, our experience is so far
>>> good.
>>>
>>> Perquet is backed by Cloudera, which has more installations of Hadoop.
>>> ORC is by Hortonworks, so battle of file format continues...
>>>
>>> Sent from my iPhone
>>>
>>> On Jul 27, 2016, at 4:54 PM, janardhan shetty 
>>> wrote:
>>>
>>> Seems like parquet format is better comparatively to orc when the
>>> dataset is log data without nested structures? Is this fair understanding ?
>>> On Jul 27, 2016 1:30 PM, "Jörn Franke"  wrote:
>>>
 Kudu has been from my impression be designed to offer somethings
 between hbase and parquet for write intensive loads - it is not faster for
 warehouse type of querying compared to parquet (merely slower, because that
 is not its use case).   I assume this is still the strategy of it.

 For some scenarios it could make sense together with parquet and Orc.
 However I am not sure what the advantage towards using hbase + parquet and
 Orc.

 On 27 Jul 2016, at 11:47, "u...@moosheimer.com " <
 u...@moosheimer.com > wrote:

 Hi Gourav,

 Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
 memory db with data storage while Parquet is "only" a columnar
 storage format.

 As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
 that's more a wish :-).

 Regards,
 Uwe

 Mit freundlichen Grüßen / best regards
 Kay-Uwe Moosheimer

 Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <
 

Re: RDD vs Dataset performance

2016-07-28 Thread Reynold Xin
The performance difference is coming from the need to serialize and
deserialize data to AnnotationText. The extra stage is probably very quick
and shouldn't impact much.

If you try cache the RDD using serialized mode, it would slow down a lot
too.


On Thu, Jul 28, 2016 at 9:52 AM, Darin McBeath 
wrote:

> I started playing round with Datasets on Spark 2.0 this morning and I'm
> surprised by the significant performance difference I'm seeing between an
> RDD and a Dataset for a very basic example.
>
>
> I've defined a simple case class called AnnotationText that has a handful
> of fields.
>
>
> I create a Dataset[AnnotationText] with my data and repartition(4) this on
> one of the columns and cache the resulting dataset as ds (force the cache
> by executing a count action).  Everything looks good and I have more than
> 10M records in my dataset ds.
>
> When I execute the following:
>
> ds.filter(textAnnotation => textAnnotation.text ==
> "mutational".toLowerCase).count
>
> It consistently finishes in just under 3 seconds.  One of the things I
> notice is that it has 3 stages.  The first stage is skipped (as this had to
> do with creation ds and it was already cached).  The second stage appears
> to do the filtering (requires 4 tasks) but interestingly it shuffles
> output.  The third stage (requires only 1 task) appears to count the
> results of the shuffle.
>
> When I look at the cached dataset (on 4 partitions) it is 82.6MB.
>
> I then decided to convert the ds dataset to an RDD as follows,
> repartition(4) and cache.
>
> val aRDD = ds.rdd.repartition(4).cache
> aRDD.count
> So, I now have an RDD[AnnotationText]
>
> When I execute the following:
>
> aRDD.filter(textAnnotation => textAnnotation.text ==
> "mutational".toLowerCase).count
>
> It consistently finishes in just under half a second.  One of the things I
> notice is that it only has 2 stages.  The first stage is skipped (as this
> had to do with creation of aRDD and it was already cached).  The second
> stage appears to do the filtering and count(requires 4 tasks).
> Interestingly, there is no shuffle (or subsequently 3rd stage).
>
> When I look at the cached RDD (on 4 partitions) it is 2.9GB.
>
>
> I was surprised how significant the cached storage difference was between
> the Dataset (82.6MB) and the RDD (2.9GB) version of the same content.  Is
> this kind of difference to be expected?
>
> While I like the smaller size for the Dataset version, I was confused as
> to why the performance for the Dataset version was so much slower (2.5s vs
> .5s).  I suspect it might be attributed to the shuffle and third stage
> required by the Dataset version but I'm not sure. I was under the
> impression that Datasets should (would) be faster in many use cases (such
> as the one I'm using above).  Am I doing something wrong or is this to be
> expected?
>
> Thanks.
>
> Darin.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Unable to create a dataframe from json dstream using pyspark

2016-07-28 Thread Sunil Kumar Chinnamgari
Hi,



I am attempting to create a dataframe from json in dstream but the code below 
does not seem to help get the dataframe right -
import sysimport jsonfrom pyspark import SparkContextfrom pyspark.streaming 
import StreamingContextfrom pyspark.sql import SQLContextdef 
getSqlContextInstance(sparkContext):    if ('sqlContextSingletonInstance' not 
in globals()):        globals()['sqlContextSingletonInstance'] = 
SQLContext(sparkContext)    return globals()['sqlContextSingletonInstance']
if __name__ == "__main__":    if len(sys.argv) != 3:        raise 
IOError("Invalid usage; the correct format is:\nquadrant_count.py  
")
# Initialize a SparkContext with a namespc = 
SparkContext(appName="jsonread")sqlContext = SQLContext(spc)# Create a 
StreamingContext with a batch interval of 2 secondsstc = StreamingContext(spc, 
2)# Checkpointing featurestc.checkpoint("checkpoint")# Creating a DStream to 
connect to hostname:port (like localhost:)lines = 
stc.socketTextStream(sys.argv[1], int(sys.argv[2]))lines.pprint()parsed = 
lines.map(lambda x: json.loads(x))def process(time, rdd):    print("= 
%s =" % str(time))    try:        # Get the singleton instance of 
SQLContext        sqlContext = getSqlContextInstance(rdd.context)        # 
Convert RDD[String] to RDD[Row] to DataFrame        rowRdd = rdd.map(lambda w: 
Row(word=w))        wordsDataFrame = sqlContext.createDataFrame(rowRdd)        
# Register as table        wordsDataFrame.registerTempTable("mytable")        
testDataFrame = sqlContext.sql("select summary from mytable")        
print(testDataFrame.show())        print(testDataFrame.printSchema())    
except:        passparsed.foreachRDD(process)stc.start()# Wait for the 
computation to terminatestc.awaitTermination()
No errors but when the script runs, it does read the json from streaming 
context successfully however it does not print the values in summary or the 
dataframe schema.
Example json I am attempting to read -
{"reviewerID": "A2IBPI20UZIR0U", "asin": "1384719342", "reviewerName": 
"cassandra tu \"Yeah, well, that's just like, u...", "helpful": [0, 0], 
"reviewText": "Not much to write about here, but it does exactly what it's 
supposed to. filters out the pop sounds. now my recordings are much more crisp. 
it is one of the lowest prices pop filters on amazon so might as well buy it, 
they honestly work the same despite their pricing,", "overall": 5.0, "summary": 
"good", "unixReviewTime": 1393545600, "reviewTime": "02 28, 2014"}
I am absolute new comer to spark streaming and started working on pet projects 
by reading documentation. Any help and guidance is greatly appreciated.
Best Regards,Sunil Kumar Chinnamgari

  

RDD vs Dataset performance

2016-07-28 Thread Darin McBeath
I started playing round with Datasets on Spark 2.0 this morning and I'm 
surprised by the significant performance difference I'm seeing between an RDD 
and a Dataset for a very basic example.


I've defined a simple case class called AnnotationText that has a handful of 
fields.


I create a Dataset[AnnotationText] with my data and repartition(4) this on one 
of the columns and cache the resulting dataset as ds (force the cache by 
executing a count action).  Everything looks good and I have more than 10M 
records in my dataset ds.

When I execute the following:

ds.filter(textAnnotation => textAnnotation.text == 
"mutational".toLowerCase).count 

It consistently finishes in just under 3 seconds.  One of the things I notice 
is that it has 3 stages.  The first stage is skipped (as this had to do with 
creation ds and it was already cached).  The second stage appears to do the 
filtering (requires 4 tasks) but interestingly it shuffles output.  The third 
stage (requires only 1 task) appears to count the results of the shuffle.  

When I look at the cached dataset (on 4 partitions) it is 82.6MB.

I then decided to convert the ds dataset to an RDD as follows, repartition(4) 
and cache.

val aRDD = ds.rdd.repartition(4).cache
aRDD.count
So, I now have an RDD[AnnotationText]

When I execute the following:

aRDD.filter(textAnnotation => textAnnotation.text == 
"mutational".toLowerCase).count

It consistently finishes in just under half a second.  One of the things I 
notice is that it only has 2 stages.  The first stage is skipped (as this had 
to do with creation of aRDD and it was already cached).  The second stage 
appears to do the filtering and count(requires 4 tasks).  Interestingly, there 
is no shuffle (or subsequently 3rd stage).   

When I look at the cached RDD (on 4 partitions) it is 2.9GB.


I was surprised how significant the cached storage difference was between the 
Dataset (82.6MB) and the RDD (2.9GB) version of the same content.  Is this kind 
of difference to be expected?

While I like the smaller size for the Dataset version, I was confused as to why 
the performance for the Dataset version was so much slower (2.5s vs .5s).  I 
suspect it might be attributed to the shuffle and third stage required by the 
Dataset version but I'm not sure. I was under the impression that Datasets 
should (would) be faster in many use cases (such as the one I'm using above).  
Am I doing something wrong or is this to be expected?

Thanks.

Darin.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark 1.6.0 read s3 files error.

2016-07-28 Thread freedafeng
BTW, I also tried yarn. Same error. 

When I ran the script, I used the real credentials for s3, which is omitted
in this post. sorry about that.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27425.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SPARK Exception thrown in awaitResult

2016-07-28 Thread Mark Hamstra
Don't use Spark 2.0.0-preview.  That was a preview release with known
issues, and was intended to be used only for early, pre-release testing
purpose.  Spark 2.0.0 is now released, and you should be using that.

On Thu, Jul 28, 2016 at 3:48 AM, Carlo.Allocca 
wrote:

> and, of course I am using
>
>  
> org.apache.spark
> spark-core_2.11
> 2.0.0-preview
> 
>
>
> 
> org.apache.spark
> spark-sql_2.11
> 2.0.0-preview
> jar
> 
>
>
> Is the below problem/issue related to the experimental version of SPARK
> 2.0.0.
>
> Many Thanks for your help and support.
>
> Best Regards,
> carlo
>
> On 28 Jul 2016, at 11:14, Carlo.Allocca  wrote:
>
> I have also found the following two related links:
>
> 1)
> https://github.com/apache/spark/commit/947b9020b0d621bc97661a0a056297e6889936d3
> 2) https://github.com/apache/spark/pull/12433
>
> which both explain why it happens but nothing about what to do to solve
> it.
>
> Do you have any suggestion/recommendation?
>
> Many thanks.
> Carlo
>
> On 28 Jul 2016, at 11:06, carlo allocca  wrote:
>
> Hi Rui,
>
> Thanks for the promptly reply.
> No, I am not using Mesos.
>
> Ok. I am writing a code to build a suitable dataset for my needs as in the
> following:
>
> == Session configuration:
>
>  SparkSession spark = SparkSession
> .builder()
> .master("local[6]") //
> .appName("DatasetForCaseNew")
> .config("spark.executor.memory", "4g")
> .config("spark.shuffle.blockTransferService", "nio")
> .getOrCreate();
>
>
> public Dataset buildDataset(){
> ...
>
> // STEP A
> // Join prdDS with cmpDS
> Dataset prdDS_Join_cmpDS
> = res1
>   .join(res2,
> (res1.col("PRD_asin#100")).equalTo(res2.col("CMP_asin")), "inner");
>
> prdDS_Join_cmpDS.take(1);
>
> // STEP B
> // Join prdDS with cmpDS
> Dataset prdDS_Join_cmpDS_Join
> = prdDS_Join_cmpDS
>   .join(res3,
> prdDS_Join_cmpDS.col("PRD_asin#100").equalTo(res3.col("ORD_asin")),
> "inner");
> prdDS_Join_cmpDS_Join.take(1);
> prdDS_Join_cmpDS_Join.show();
>
> }
>
>
> The exception is thrown when the computation reach the STEP B, until STEP
> A is fine.
>
> Is there anything wrong or missing?
>
> Thanks for your help in advance.
>
> Best Regards,
> Carlo
>
>
>
>
>
> === STACK TRACE
>
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 422.102
> sec <<< FAILURE!
> testBuildDataset(org.mksmart.amaretto.ml.DatasetPerHourVerOneTest)  Time
> elapsed: 421.994 sec  <<< ERROR!
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
> at
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:102)
> at
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
> at
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
> at
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
> at
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
> at
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
> at
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.consume(SortMergeJoinExec.scala:35)
> at
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.doProduce(SortMergeJoinExec.scala:565)
> at
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
> at
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
> at
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.produce(SortMergeJoinExec.scala:35)
> at
> 

Re: spark 1.6.0 read s3 files error.

2016-07-28 Thread Andy Davidson
Hi Freedafeng

Can you tells a little more? I.E. Can you paste your code and error message?

Andy

From:  freedafeng 
Date:  Thursday, July 28, 2016 at 9:21 AM
To:  "user @spark" 
Subject:  Re: spark 1.6.0 read s3 files error.

> The question is, what is the cause of the problem? and how to fix it? Thanks.
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-
> error-tp27417p27424.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 




Re: spark 1.6.0 read s3 files error.

2016-07-28 Thread freedafeng
The question is, what is the cause of the problem? and how to fix it? Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27424.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: performance problem when reading lots of small files created by spark streaming.

2016-07-28 Thread Gourav Sengupta
There is an option to join small files up. If you are unable to find it
just let me know.


Regards,
Gourav

On Thu, Jul 28, 2016 at 4:58 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Hi Pedro
>
> Thanks for the explanation. I started watching your repo. In the short
> term I think I am going to try concatenating my small files into 64MB and
> using HDFS. My spark streaming app is implemented Java and uses data
> frames. It writes to s3. My batch processing is written in python It reads
> data into data frames.
>
> Its probably a lot of work to make your solution working in these other
> contexts.
>
> Here is another use case you might be interested in
> Writing multiple files to S3 is really slow. It causes a lot of problems
> for my streaming app. Bad things happen if your processing time exceeds
> your window length. Our streaming app must save all the input. For each
> mini batch we split the input into as many as 30 different data sets. Each
> one needs to be written to S3.
>
> As a temporary work around I use an executor service to try and get more
> concurrent writes. Ideally the spark frame work would provide support for
> async IO, and hopefully the S3 performance issue would be improved. Here is
> my code if you are interested
>
>
> public class StreamingKafkaGnipCollector {
>
> static final int POOL_SIZE = 30;
>
> static ExecutorService executor = Executors.newFixedThreadPool(
> POOL_SIZE);
>
> …
>
> private static void saveRawInput(SQLContext sqlContext,
> JavaPairInputDStream messages, String outputURIBase) {
>
> JavaDStream lines = messages.map(new Function String>, String>() {
>
> private static final long serialVersionUID = 1L;
>
>
> @Override
>
> public String call(Tuple2 tuple2) {
>
> //logger.warn("TODO _2:{}", tuple2._2);
>
> return tuple2._2();
>
> }
>
> });
>
>
> lines.foreachRDD(new VoidFunction2() {
>
> @Override
>
> public void call(JavaRDD jsonRDD, Time time) throws Exception {
> …
>
> // df.write().json("s3://"); is very slow
>
> // run saves concurrently
>
> List saveData = new ArrayList(100);
>
> for (String tag: tags) {
>
> DataFrame saveDF = activityDF.filter(activityDF.col(tagCol).equalTo(tag));
>
> String dirPath = createPath(outputURIBase, date, tag, milliSeconds);
>
> saveData.add(new SaveData(saveDF, dirPath));
>
> }
>
>
> saveImpl(saveData, executor); // concurrent writes to S3
>
> }
>
> private void saveImpl(List saveData, ExecutorService executor) {
>
> List runningThreads = new ArrayList(POOL_SIZE);
>
> for(SaveData data : saveData) {
>
> SaveWorker worker = new SaveWorker(data);
>
> Future f = executor.submit(worker);
>
> runningThreads.add(f);
>
> }
>
> // wait for all the workers to complete
>
> for (Future worker : runningThreads) {
>
> try {
>
> worker.get();
>
> logger.debug("worker completed");
>
> } catch (InterruptedException e) {
>
> logger.error("", e);
>
> } catch (ExecutionException e) {
>
> logger.error("", e);
>
> }
>
> }
>
> }
>
>
> static class SaveData {
>
> private DataFrame df;
>
> private String path;
>
>
> SaveData(DataFrame df, String path) {
>
> this.df = df;
>
> this.path = path;
>
> }
>
> }
>
> static class SaveWorker implements Runnable {
>
> SaveData data;
>
>
> public SaveWorker(SaveData data) {
>
> this.data = data;
>
> }
>
>
> @Override
>
> public void run() {
>
> if (data.df.count() >= 1) {
>
> data.df.write().json(data.path);
>
> }
>
> }
>
> }
>
> }
>
>
> From: Pedro Rodriguez 
> Date: Wednesday, July 27, 2016 at 8:40 PM
> To: Andrew Davidson 
> Cc: "user @spark" 
> Subject: Re: performance problem when reading lots of small files created
> by spark streaming.
>
> There are a few blog posts that detail one possible/likely issue for
> example:
> http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219
>
> TLDR: The hadoop libraries spark uses assumes that its input comes from a
>  file system (works with HDFS) however S3 is a key value store, not a file
> system. Somewhere along the line, this makes things very slow. Below I
> describe their approach and a library I am working on to solve this problem.
>
> (Much) Longer Version (with a shiny new library in development):
> So far in my reading of source code, Hadoop attempts to actually read from
> S3 which can be expensive particularly since it does so from a single
> driver core (different from listing files, actually reading them, I can
> find the source code and link it later if you would like). The concept
> explained above is to instead use the AWS sdk to list files then distribute
> the files names as a collection with sc.parallelize, then read them in
> parallel. I found this worked, but lacking in a few ways so I started this
> project: https://github.com/EntilZha/spark-s3
>
> This takes that idea further by:
> 1. Rather than sc.parallelize, implement the RDD interface where each

Re: Spark 2.0 - JavaAFTSurvivalRegressionExample doesn't work

2016-07-28 Thread Bryan Cutler
That's the correct fix.  I have this done along with a few other Java
examples that still use the old MLlib Vectors in this PR thats waiting for
review https://github.com/apache/spark/pull/14308

On Jul 28, 2016 5:14 AM, "Robert Goodman"  wrote:

> I changed import in the sample from
>
> import org.apache.spark.mllib.linalg.*;
>
> to
>
>import org.apache.spark.ml.linalg.*;
>
> and the sample now runs.
>
>Thanks
>  Bob
>
>
> On Wed, Jul 27, 2016 at 1:33 PM, Robert Goodman  wrote:
> > I tried to run the JavaAFTSurvivalRegressionExample on Spark 2.0 and the
> > example doesn't work. It looks like the problem is that the example is
> using
> > the MLLib Vector/VectorUDT to create the DataSet which needs to be
> converted
> > using MLUtils before using in the model. I haven't actually tried this
> yet.
> >
> > When I run the example (/bin/run-example
> > ml.JavaAFTSurvivalRegressionExample), I get the following stack trace
> >
> > Exception in thread "main" java.lang.IllegalArgumentException:
> requirement
> > failed: Column features must be of type
> > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
> > org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.
> > at scala.Predef$.require(Predef.scala:224)
> > at
> >
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
> > at
> >
> org.apache.spark.ml.regression.AFTSurvivalRegressionParams$class.validateAndTransformSchema(AFTSurvivalRegression.scala:106)
> > at
> >
> org.apache.spark.ml.regression.AFTSurvivalRegression.validateAndTransformSchema(AFTSurvivalRegression.scala:126)
> > at
> >
> org.apache.spark.ml.regression.AFTSurvivalRegression.fit(AFTSurvivalRegression.scala:199)
> > at
> >
> org.apache.spark.examples.ml.JavaAFTSurvivalRegressionExample.main(JavaAFTSurvivalRegressionExample.java:67)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:498)
> > at
> >
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
> > at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
> > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
> > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
> > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> >
> >
> > Are you suppose to be able use the ML version of VectorUDT? The Spark 2.0
> > API docs for Java, don't show the class but I was able to import the
> class
> > into a java program.
> >
> >Thanks
> >  Bob
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: performance problem when reading lots of small files created by spark streaming.

2016-07-28 Thread Andy Davidson
Hi Pedro

Thanks for the explanation. I started watching your repo. In the short term
I think I am going to try concatenating my small files into 64MB and using
HDFS. My spark streaming app is implemented Java and uses data frames. It
writes to s3. My batch processing is written in python It reads data into
data frames.

Its probably a lot of work to make your solution working in these other
contexts.

Here is another use case you might be interested in
Writing multiple files to S3 is really slow. It causes a lot of problems for
my streaming app. Bad things happen if your processing time exceeds your
window length. Our streaming app must save all the input. For each mini
batch we split the input into as many as 30 different data sets. Each one
needs to be written to S3.

As a temporary work around I use an executor service to try and get more
concurrent writes. Ideally the spark frame work would provide support for
async IO, and hopefully the S3 performance issue would be improved. Here is
my code if you are interested


public class StreamingKafkaGnipCollector {

static final int POOL_SIZE = 30;

static ExecutorService executor =
Executors.newFixedThreadPool(POOL_SIZE);


Š

private static void saveRawInput(SQLContext sqlContext,
JavaPairInputDStream messages, String outputURIBase) {

JavaDStream lines = messages.map(new Function, String>() {

private static final long serialVersionUID = 1L;



@Override

public String call(Tuple2 tuple2) {

//logger.warn("TODO _2:{}", tuple2._2);

return tuple2._2();

}

});



lines.foreachRDD(new VoidFunction2() {

@Override

public void call(JavaRDD jsonRDD, Time time) throws Exception {

Š
// df.write().json("s3://"); is very slow

// run saves concurrently

List saveData = new ArrayList(100);

for (String tag: tags) {

DataFrame saveDF = activityDF.filter(activityDF.col(tagCol).equalTo(tag));

String dirPath = createPath(outputURIBase, date, tag, milliSeconds);

saveData.add(new SaveData(saveDF, dirPath));

}



saveImpl(saveData, executor); // concurrent writes to S3

}

private void saveImpl(List saveData, ExecutorService executor) {

List runningThreads = new ArrayList(POOL_SIZE);

for(SaveData data : saveData) {

SaveWorker worker = new SaveWorker(data);

Future f = executor.submit(worker);

runningThreads.add(f);

}

// wait for all the workers to complete

for (Future worker : runningThreads) {

try {

worker.get();

logger.debug("worker completed");

} catch (InterruptedException e) {

logger.error("", e);

} catch (ExecutionException e) {

logger.error("", e);

}

} 

}



static class SaveData {

private DataFrame df;

private String path;



SaveData(DataFrame df, String path) {

this.df = df;

this.path = path;

}

}

static class SaveWorker implements Runnable {

SaveData data;



public SaveWorker(SaveData data) {

this.data = data;

}



@Override

public void run() {

if (data.df.count() >= 1) {

data.df.write().json(data.path);

}

}

}

}



From:  Pedro Rodriguez 
Date:  Wednesday, July 27, 2016 at 8:40 PM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: performance problem when reading lots of small files created
by spark streaming.

> There are a few blog posts that detail one possible/likely issue for example:
> http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219
> 
> TLDR: The hadoop libraries spark uses assumes that its input comes from a
> file system (works with HDFS) however S3 is a key value store, not a file
> system. Somewhere along the line, this makes things very slow. Below I
> describe their approach and a library I am working on to solve this problem.
> 
> (Much) Longer Version (with a shiny new library in development):
> So far in my reading of source code, Hadoop attempts to actually read from S3
> which can be expensive particularly since it does so from a single driver core
> (different from listing files, actually reading them, I can find the source
> code and link it later if you would like). The concept explained above is to
> instead use the AWS sdk to list files then distribute the files names as a
> collection with sc.parallelize, then read them in parallel. I found this
> worked, but lacking in a few ways so I started this project:
> https://github.com/EntilZha/spark-s3
> 
> This takes that idea further by:
> 1. Rather than sc.parallelize, implement the RDD interface where each
> partition is defined by the files it needs to read (haven't gotten to
> DataFrames yet)
> 2. At the driver node, use the AWS SDK to list all the files with their size
> (listing is fast), then run the Least Processing Time Algorithm to sift the
> files into roughly balanced partitions by size
> 3. API: S3Context(sc).textFileByPrefix("bucket", "file1",
> "folder2").regularRDDOperationsHere or import implicits and do
> sc.s3.textFileByPrefix
> 
> 

Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Mich Talebzadeh
Like anything else your mileage varies.

ORC with Vectorised query execution

is
the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
with columnar indexes. To me that is cool. Parquet has been around and has
its use case as well.

I guess there is no hard and fast rule which one to use all the time. Use
the one that provides best fit for the condition.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 09:18, Jörn Franke  wrote:

> I see it more as a process of innovation and thus competition is good.
> Companies just should not follow these religious arguments but try
> themselves what suits them. There is more than software when using software
> ;)
>
> On 28 Jul 2016, at 01:44, Mich Talebzadeh 
> wrote:
>
> And frankly this is becoming some sort of religious arguments now
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 00:01, Sudhir Babu Pothineni 
> wrote:
>
>> It depends on what you are dong, here is the recent comparison of ORC,
>> Parquet
>>
>>
>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>>
>> Although from ORC authors, I thought fair comparison, We use ORC as
>> System of Record on our Cloudera HDFS cluster, our experience is so far
>> good.
>>
>> Perquet is backed by Cloudera, which has more installations of Hadoop.
>> ORC is by Hortonworks, so battle of file format continues...
>>
>> Sent from my iPhone
>>
>> On Jul 27, 2016, at 4:54 PM, janardhan shetty 
>> wrote:
>>
>> Seems like parquet format is better comparatively to orc when the dataset
>> is log data without nested structures? Is this fair understanding ?
>> On Jul 27, 2016 1:30 PM, "Jörn Franke"  wrote:
>>
>>> Kudu has been from my impression be designed to offer somethings between
>>> hbase and parquet for write intensive loads - it is not faster for
>>> warehouse type of querying compared to parquet (merely slower, because that
>>> is not its use case).   I assume this is still the strategy of it.
>>>
>>> For some scenarios it could make sense together with parquet and Orc.
>>> However I am not sure what the advantage towards using hbase + parquet and
>>> Orc.
>>>
>>> On 27 Jul 2016, at 11:47, "u...@moosheimer.com " <
>>> u...@moosheimer.com > wrote:
>>>
>>> Hi Gourav,
>>>
>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
>>> memory db with data storage while Parquet is "only" a columnar
>>> storage format.
>>>
>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
>>> that's more a wish :-).
>>>
>>> Regards,
>>> Uwe
>>>
>>> Mit freundlichen Grüßen / best regards
>>> Kay-Uwe Moosheimer
>>>
>>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <
>>> gourav.sengu...@gmail.com>:
>>>
>>> Gosh,
>>>
>>> whether ORC came from this or that, it runs queries in HIVE with TEZ at
>>> a speed that is better than SPARK.
>>>
>>> Has anyone heard of KUDA? Its better than Parquet. But I think that
>>> someone might just start saying that KUDA has difficult lineage as well.
>>> After all dynastic rules dictate.
>>>
>>> Personally I feel that if something stores my data compressed and makes
>>> me access it faster I do not care where it comes from or how difficult the
>>> child birth was :)
>>>
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
>>> sbpothin...@gmail.com> wrote:
>>>
 Just correction:

 ORC Java libraries from Hive are forked into Apache ORC. Vectorization
 default.

 Do not know If Spark leveraging this new repo?

 
  org.apache.orc
 orc
 1.1.2
 pom
 








Re: Guys is this some form of Spam or someone has left his auto-reply loose LOL

2016-07-28 Thread Mich Talebzadeh
He says he is enjoying his holidays. Do we want to disturb him? :)

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 16:38, Pedro Rodriguez  wrote:

> Same here, but maybe this is a really urgent matter we need to contact him
> about... or just make a filter
>
> On Thu, Jul 28, 2016 at 7:59 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>> -- Forwarded message --
>> From: Geert Van Landeghem [Napoleon Games NV] <
>> g.vanlandeg...@napoleongames.be>
>> Date: 28 July 2016 at 14:38
>> Subject: Re: Re: Is spark-1.6.1-bin-2.6.0 compatible with
>> hive-1.1.0-cdh5.7.1
>> To: Mich Talebzadeh 
>>
>>
>> Hello,
>>
>> I am enjoying holidays untill the end of august, for urgent matters
>> contact the BI department on 702 or 703 or send an email to
>> b...@napoleongames.be.
>>
>> For really urgent matters contact me on my mobile phone: +32 477 75 95 33
>> .
>>
>> kind regards
>> Geert
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


Re: Guys is this some form of Spam or someone has left his auto-reply loose LOL

2016-07-28 Thread Pedro Rodriguez
Same here, but maybe this is a really urgent matter we need to contact him
about... or just make a filter

On Thu, Jul 28, 2016 at 7:59 AM, Mich Talebzadeh 
wrote:

>
> -- Forwarded message --
> From: Geert Van Landeghem [Napoleon Games NV] <
> g.vanlandeg...@napoleongames.be>
> Date: 28 July 2016 at 14:38
> Subject: Re: Re: Is spark-1.6.1-bin-2.6.0 compatible with
> hive-1.1.0-cdh5.7.1
> To: Mich Talebzadeh 
>
>
> Hello,
>
> I am enjoying holidays untill the end of august, for urgent matters
> contact the BI department on 702 or 703 or send an email to
> b...@napoleongames.be.
>
> For really urgent matters contact me on my mobile phone: +32 477 75 95 33.
>
> kind regards
> Geert
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>



-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Pls assist: need to create an udf that returns a LabeledPoint in pyspark

2016-07-28 Thread Marco Mistroni
hi all
 could anyone assist?
i need to create a udf function that returns a LabeledPoint
I read that in pyspark (1.6) LabeledPoint is not supported and i have to
create
a StructType

anyone can point me in some directions?

kr
  marco


Guys is this some form of Spam or someone has left his auto-reply loose LOL

2016-07-28 Thread Mich Talebzadeh
-- Forwarded message --
From: Geert Van Landeghem [Napoleon Games NV] <
g.vanlandeg...@napoleongames.be>
Date: 28 July 2016 at 14:38
Subject: Re: Re: Is spark-1.6.1-bin-2.6.0 compatible with
hive-1.1.0-cdh5.7.1
To: Mich Talebzadeh 


Hello,

I am enjoying holidays untill the end of august, for urgent matters contact
the BI department on 702 or 703 or send an email to b...@napoleongames.be.

For really urgent matters contact me on my mobile phone: +32 477 75 95 33.

kind regards
Geert


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Is spark-1.6.1-bin-2.6.0 compatible with hive-1.1.0-cdh5.7.1

2016-07-28 Thread Mich Talebzadeh
Ok does it create a derby database and comes back to prompt? For example
does spark-sql work OK.

If it cannot find the metastore it will create an empty derby database in
the same directory and at prompt you can  type show databases; and that
will only show default!

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 13:10, Mohammad Tariq  wrote:

> Hi Mich,
>
> Thank you so much for the prompt response!
>
> I do have a copy of hive-site.xml in spark conf directory.
>
>
> On Thursday, July 28, 2016, Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> This line
>>
>> 2016-07-28 04:36:01,814] INFO Property hive.metastore.integral.jdo.pushdown
>> unknown - will be ignored (DataNucleus.Persistence:77)
>>
>> telling me that you do don't seem to have the softlink to hive-site.xml
>> in $SPARK_HOME/conf
>>
>> hive-site.xml -> /usr/lib/hive/conf/hive-site.xml
>>
>> I suggest you check that. That is the reason it cannot find you Hive
>> metastore
>>
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 28 July 2016 at 12:45, Mohammad Tariq  wrote:
>>
>>> Could anyone please help me with this? I have been using the same
>>> version of Spark with CDH-5.4.5 successfully so far. However after a recent
>>> CDH upgrade I'm not able to run the same Spark SQL module against
>>> hive-1.1.0-cdh5.7.1.
>>>
>>> When I try to run my program Spark tries to connect to local derby Hive
>>> metastore instead of the configured MySQL metastore. I have all the
>>> required jars along with hive-site.xml in place though. There is no change
>>> in the setup.
>>>
>>> This is the exception which I'm getting :
>>>
>>> [2016-07-28 04:36:01,207] INFO Initializing execution hive, version
>>> 1.2.1 (org.apache.spark.sql.hive.HiveContext:58)
>>>
>>> [2016-07-28 04:36:01,231] INFO Inspected Hadoop version: 2.6.0-cdh5.7.1
>>> (org.apache.spark.sql.hive.client.ClientWrapper:58)
>>>
>>> [2016-07-28 04:36:01,232] INFO Loaded
>>> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
>>> 2.6.0-cdh5.7.1 (org.apache.spark.sql.hive.client.ClientWrapper:58)
>>>
>>> [2016-07-28 04:36:01,520] INFO 0: Opening raw store with implemenation
>>> class:org.apache.hadoop.hive.metastore.ObjectStore
>>> (org.apache.hadoop.hive.metastore.HiveMetaStore:638)
>>>
>>> [2016-07-28 04:36:01,548] INFO ObjectStore, initialize called
>>> (org.apache.hadoop.hive.metastore.ObjectStore:332)
>>>
>>> [2016-07-28 04:36:01,814] INFO Property
>>> hive.metastore.integral.jdo.pushdown unknown - will be ignored
>>> (DataNucleus.Persistence:77)
>>>
>>> [2016-07-28 04:36:01,815] INFO Property datanucleus.cache.level2 unknown
>>> - will be ignored (DataNucleus.Persistence:77)
>>>
>>> [2016-07-28 04:36:02,417] WARN Retrying creating default database after
>>> error: Unexpected exception caught.
>>> (org.apache.hadoop.hive.metastore.HiveMetaStore:671)
>>>
>>> javax.jdo.JDOFatalInternalException: Unexpected exception caught.
>>>
>>> at
>>> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1193)
>>>
>>> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>>>
>>> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>>>
>>> at
>>> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:410)
>>>
>>> at
>>> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:439)
>>>
>>> at
>>> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:334)
>>>
>>> at
>>> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:290)
>>>
>>> at
>>> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>>>
>>> at
>>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>>>
>>> at
>>> 

Re: Spark 2.0 - JavaAFTSurvivalRegressionExample doesn't work

2016-07-28 Thread Robert Goodman
I changed import in the sample from

import org.apache.spark.mllib.linalg.*;

to

   import org.apache.spark.ml.linalg.*;

and the sample now runs.

   Thanks
 Bob


On Wed, Jul 27, 2016 at 1:33 PM, Robert Goodman  wrote:
> I tried to run the JavaAFTSurvivalRegressionExample on Spark 2.0 and the
> example doesn't work. It looks like the problem is that the example is using
> the MLLib Vector/VectorUDT to create the DataSet which needs to be converted
> using MLUtils before using in the model. I haven't actually tried this yet.
>
> When I run the example (/bin/run-example
> ml.JavaAFTSurvivalRegressionExample), I get the following stack trace
>
> Exception in thread "main" java.lang.IllegalArgumentException: requirement
> failed: Column features must be of type
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.
> at scala.Predef$.require(Predef.scala:224)
> at
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
> at
> org.apache.spark.ml.regression.AFTSurvivalRegressionParams$class.validateAndTransformSchema(AFTSurvivalRegression.scala:106)
> at
> org.apache.spark.ml.regression.AFTSurvivalRegression.validateAndTransformSchema(AFTSurvivalRegression.scala:126)
> at
> org.apache.spark.ml.regression.AFTSurvivalRegression.fit(AFTSurvivalRegression.scala:199)
> at
> org.apache.spark.examples.ml.JavaAFTSurvivalRegressionExample.main(JavaAFTSurvivalRegressionExample.java:67)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
> Are you suppose to be able use the ML version of VectorUDT? The Spark 2.0
> API docs for Java, don't show the class but I was able to import the class
> into a java program.
>
>Thanks
>  Bob

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is spark-1.6.1-bin-2.6.0 compatible with hive-1.1.0-cdh5.7.1

2016-07-28 Thread Mohammad Tariq
Hi Mich,

Thank you so much for the prompt response!

I do have a copy of hive-site.xml in spark conf directory.

On Thursday, July 28, 2016, Mich Talebzadeh 
wrote:

> Hi,
>
> This line
>
> 2016-07-28 04:36:01,814] INFO Property hive.metastore.integral.jdo.pushdown
> unknown - will be ignored (DataNucleus.Persistence:77)
>
> telling me that you do don't seem to have the softlink to hive-site.xml
> in $SPARK_HOME/conf
>
> hive-site.xml -> /usr/lib/hive/conf/hive-site.xml
>
> I suggest you check that. That is the reason it cannot find you Hive
> metastore
>
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 12:45, Mohammad Tariq  > wrote:
>
>> Could anyone please help me with this? I have been using the same version
>> of Spark with CDH-5.4.5 successfully so far. However after a recent CDH
>> upgrade I'm not able to run the same Spark SQL module against
>> hive-1.1.0-cdh5.7.1.
>>
>> When I try to run my program Spark tries to connect to local derby Hive
>> metastore instead of the configured MySQL metastore. I have all the
>> required jars along with hive-site.xml in place though. There is no change
>> in the setup.
>>
>> This is the exception which I'm getting :
>>
>> [2016-07-28 04:36:01,207] INFO Initializing execution hive, version 1.2.1
>> (org.apache.spark.sql.hive.HiveContext:58)
>>
>> [2016-07-28 04:36:01,231] INFO Inspected Hadoop version: 2.6.0-cdh5.7.1
>> (org.apache.spark.sql.hive.client.ClientWrapper:58)
>>
>> [2016-07-28 04:36:01,232] INFO Loaded
>> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
>> 2.6.0-cdh5.7.1 (org.apache.spark.sql.hive.client.ClientWrapper:58)
>>
>> [2016-07-28 04:36:01,520] INFO 0: Opening raw store with implemenation
>> class:org.apache.hadoop.hive.metastore.ObjectStore
>> (org.apache.hadoop.hive.metastore.HiveMetaStore:638)
>>
>> [2016-07-28 04:36:01,548] INFO ObjectStore, initialize called
>> (org.apache.hadoop.hive.metastore.ObjectStore:332)
>>
>> [2016-07-28 04:36:01,814] INFO Property
>> hive.metastore.integral.jdo.pushdown unknown - will be ignored
>> (DataNucleus.Persistence:77)
>>
>> [2016-07-28 04:36:01,815] INFO Property datanucleus.cache.level2 unknown
>> - will be ignored (DataNucleus.Persistence:77)
>>
>> [2016-07-28 04:36:02,417] WARN Retrying creating default database after
>> error: Unexpected exception caught.
>> (org.apache.hadoop.hive.metastore.HiveMetaStore:671)
>>
>> javax.jdo.JDOFatalInternalException: Unexpected exception caught.
>>
>> at
>> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1193)
>>
>> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>>
>> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>>
>> at
>> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:410)
>>
>> at
>> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:439)
>>
>> at
>> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:334)
>>
>> at
>> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:290)
>>
>> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>>
>> at
>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>>
>> at
>> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>>
>> at
>> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>>
>> at
>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:642)
>>
>> at
>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:620)
>>
>> at
>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:669)
>>
>> at
>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:478)
>>
>> at
>> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:78)
>>
>> at
>> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84)
>>
>> at
>> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5903)
>>
>> at
>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:198)
>>
>> at
>> 

Re: Is spark-1.6.1-bin-2.6.0 compatible with hive-1.1.0-cdh5.7.1

2016-07-28 Thread Mich Talebzadeh
Hi,

This line

2016-07-28 04:36:01,814] INFO Property hive.metastore.integral.jdo.pushdown
unknown - will be ignored (DataNucleus.Persistence:77)

telling me that you do don't seem to have the softlink to hive-site.xml in
$SPARK_HOME/conf

hive-site.xml -> /usr/lib/hive/conf/hive-site.xml

I suggest you check that. That is the reason it cannot find you Hive
metastore



HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 12:45, Mohammad Tariq  wrote:

> Could anyone please help me with this? I have been using the same version
> of Spark with CDH-5.4.5 successfully so far. However after a recent CDH
> upgrade I'm not able to run the same Spark SQL module against
> hive-1.1.0-cdh5.7.1.
>
> When I try to run my program Spark tries to connect to local derby Hive
> metastore instead of the configured MySQL metastore. I have all the
> required jars along with hive-site.xml in place though. There is no change
> in the setup.
>
> This is the exception which I'm getting :
>
> [2016-07-28 04:36:01,207] INFO Initializing execution hive, version 1.2.1
> (org.apache.spark.sql.hive.HiveContext:58)
>
> [2016-07-28 04:36:01,231] INFO Inspected Hadoop version: 2.6.0-cdh5.7.1
> (org.apache.spark.sql.hive.client.ClientWrapper:58)
>
> [2016-07-28 04:36:01,232] INFO Loaded
> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
> 2.6.0-cdh5.7.1 (org.apache.spark.sql.hive.client.ClientWrapper:58)
>
> [2016-07-28 04:36:01,520] INFO 0: Opening raw store with implemenation
> class:org.apache.hadoop.hive.metastore.ObjectStore
> (org.apache.hadoop.hive.metastore.HiveMetaStore:638)
>
> [2016-07-28 04:36:01,548] INFO ObjectStore, initialize called
> (org.apache.hadoop.hive.metastore.ObjectStore:332)
>
> [2016-07-28 04:36:01,814] INFO Property
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> (DataNucleus.Persistence:77)
>
> [2016-07-28 04:36:01,815] INFO Property datanucleus.cache.level2 unknown -
> will be ignored (DataNucleus.Persistence:77)
>
> [2016-07-28 04:36:02,417] WARN Retrying creating default database after
> error: Unexpected exception caught.
> (org.apache.hadoop.hive.metastore.HiveMetaStore:671)
>
> javax.jdo.JDOFatalInternalException: Unexpected exception caught.
>
> at
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1193)
>
> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>
> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>
> at
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:410)
>
> at
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:439)
>
> at
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:334)
>
> at
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:290)
>
> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>
> at
> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>
> at
> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:642)
>
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:620)
>
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:669)
>
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:478)
>
> at
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:78)
>
> at
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84)
>
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5903)
>
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:198)
>
> at
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>
> at
> 

reasons for introducing SPARK-9415 - disable group by on MapType

2016-07-28 Thread Tomasz Bartczak
Hello,

what were the reasons for disabling group-by and join on column of MapType?
It was done intentionally in
https://issues.apache.org/jira/browse/SPARK-9415

I understand that it would not be feasible to do order by MapType, but -
group by requires only equality check, which is possible for map of
primitive types.

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/reasons-for-introducing-SPARK-9415-disable-group-by-on-MapType-tp27423.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Is spark-1.6.1-bin-2.6.0 compatible with hive-1.1.0-cdh5.7.1

2016-07-28 Thread Mohammad Tariq
Could anyone please help me with this? I have been using the same version
of Spark with CDH-5.4.5 successfully so far. However after a recent CDH
upgrade I'm not able to run the same Spark SQL module against
hive-1.1.0-cdh5.7.1.

When I try to run my program Spark tries to connect to local derby Hive
metastore instead of the configured MySQL metastore. I have all the
required jars along with hive-site.xml in place though. There is no change
in the setup.

This is the exception which I'm getting :

[2016-07-28 04:36:01,207] INFO Initializing execution hive, version 1.2.1
(org.apache.spark.sql.hive.HiveContext:58)

[2016-07-28 04:36:01,231] INFO Inspected Hadoop version: 2.6.0-cdh5.7.1
(org.apache.spark.sql.hive.client.ClientWrapper:58)

[2016-07-28 04:36:01,232] INFO Loaded
org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
2.6.0-cdh5.7.1 (org.apache.spark.sql.hive.client.ClientWrapper:58)

[2016-07-28 04:36:01,520] INFO 0: Opening raw store with implemenation
class:org.apache.hadoop.hive.metastore.ObjectStore
(org.apache.hadoop.hive.metastore.HiveMetaStore:638)

[2016-07-28 04:36:01,548] INFO ObjectStore, initialize called
(org.apache.hadoop.hive.metastore.ObjectStore:332)

[2016-07-28 04:36:01,814] INFO Property
hive.metastore.integral.jdo.pushdown unknown - will be ignored
(DataNucleus.Persistence:77)

[2016-07-28 04:36:01,815] INFO Property datanucleus.cache.level2 unknown -
will be ignored (DataNucleus.Persistence:77)

[2016-07-28 04:36:02,417] WARN Retrying creating default database after
error: Unexpected exception caught.
(org.apache.hadoop.hive.metastore.HiveMetaStore:671)

javax.jdo.JDOFatalInternalException: Unexpected exception caught.

at
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1193)

at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)

at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)

at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:410)

at
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:439)

at
org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:334)

at
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:290)

at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)

at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)

at
org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)

at
org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)

at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:642)

at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:620)

at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:669)

at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:478)

at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:78)

at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84)

at
org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5903)

at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:198)

at
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1491)

at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:67)

at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:82)

at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2935)

at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2954)

at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:513)

at
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204)

at
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)

at
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)

at
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)

at
org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)

at
org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)

at org.apache.spark.sql.UDFRegistration.(UDFRegistration.scala:40)

at org.apache.spark.sql.SQLContext.(SQLContext.scala:330)

at 

Re: Run times for Spark 1.6.2 compared to 2.1.0?

2016-07-28 Thread Colin Beckingham

On 27/07/16 16:31, Colin Beckingham wrote:
I have a project which runs fine in both Spark 1.6.2 and 2.1.0. It 
calculates a logistic model using MLlib. I compiled the 2.1 today from 
source and took the version 1 as a precompiled version with Hadoop. 
The odd thing is that on 1.6.2 the project produces an answer in 350 
sec and the 2.1.0 takes 990 sec. Identical code using pyspark. I'm 
wondering if there is something in the setup params for 1.6 and 2.1, 
say number of executors or memory allocation, which might account for 
this? I'm using just the 4 cores of my machine as master and executors.
FWIW I have a bit more information. Watching the jobs as Spark runs I 
can see that when performing the logistic regression in Spark 1.6.2 the 
PySpark call "LogisticRegressionWithLBFGS.train()"  runs "treeAggregate 
at LBFGS.scala:218" but the same command in pyspark with Spark 2.1 runs 
"treeAggregate at LogisticRegression.scala:1092". This last command 
takes about 3 times longer to run than the LBFGS version, and there are 
way more of these calls, and the result is considerably less accurate 
than the LBFGS. The rest of the process seems to be pretty close. So 
Spark 2.1 does not seem to be running an optimized version of logistic 
regression algorithm?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SPARK Exception thrown in awaitResult

2016-07-28 Thread Carlo . Allocca
and, of course I am using

 
org.apache.spark
spark-core_2.11
2.0.0-preview




org.apache.spark
spark-sql_2.11
2.0.0-preview
jar



Is the below problem/issue related to the experimental version of SPARK 2.0.0.

Many Thanks for your help and support.

Best Regards,
carlo

On 28 Jul 2016, at 11:14, Carlo.Allocca 
> wrote:

I have also found the following two related links:

1) 
https://github.com/apache/spark/commit/947b9020b0d621bc97661a0a056297e6889936d3
2) https://github.com/apache/spark/pull/12433

which both explain why it happens but nothing about what to do to solve it.

Do you have any suggestion/recommendation?

Many thanks.
Carlo

On 28 Jul 2016, at 11:06, carlo allocca 
> wrote:

Hi Rui,

Thanks for the promptly reply.
No, I am not using Mesos.

Ok. I am writing a code to build a suitable dataset for my needs as in the 
following:

== Session configuration:

 SparkSession spark = SparkSession
.builder()
.master("local[6]") //
.appName("DatasetForCaseNew")
.config("spark.executor.memory", "4g")
.config("spark.shuffle.blockTransferService", "nio")
.getOrCreate();


public Dataset buildDataset(){
...

// STEP A
// Join prdDS with cmpDS
Dataset prdDS_Join_cmpDS
= res1
  .join(res2, 
(res1.col("PRD_asin#100")).equalTo(res2.col("CMP_asin")), "inner");

prdDS_Join_cmpDS.take(1);

// STEP B
// Join prdDS with cmpDS
Dataset prdDS_Join_cmpDS_Join
= prdDS_Join_cmpDS
  .join(res3, 
prdDS_Join_cmpDS.col("PRD_asin#100").equalTo(res3.col("ORD_asin")), "inner");
prdDS_Join_cmpDS_Join.take(1);
prdDS_Join_cmpDS_Join.show();

}


The exception is thrown when the computation reach the STEP B, until STEP A is 
fine.

Is there anything wrong or missing?

Thanks for your help in advance.

Best Regards,
Carlo





=== STACK TRACE

Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 422.102 sec <<< 
FAILURE!
testBuildDataset(org.mksmart.amaretto.ml.DatasetPerHourVerOneTest)  Time 
elapsed: 421.994 sec  <<< ERROR!
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:102)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.consume(SortMergeJoinExec.scala:35)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.doProduce(SortMergeJoinExec.scala:565)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.produce(SortMergeJoinExec.scala:35)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 

Re: SPARK Exception thrown in awaitResult

2016-07-28 Thread Carlo . Allocca
I have also found the following two related links:

1) 
https://github.com/apache/spark/commit/947b9020b0d621bc97661a0a056297e6889936d3
2) https://github.com/apache/spark/pull/12433

which both explain why it happens but nothing about what to do to solve it.

Do you have any suggestion/recommendation?

Many thanks.
Carlo

On 28 Jul 2016, at 11:06, carlo allocca 
> wrote:

Hi Rui,

Thanks for the promptly reply.
No, I am not using Mesos.

Ok. I am writing a code to build a suitable dataset for my needs as in the 
following:

== Session configuration:

 SparkSession spark = SparkSession
.builder()
.master("local[6]") //
.appName("DatasetForCaseNew")
.config("spark.executor.memory", "4g")
.config("spark.shuffle.blockTransferService", "nio")
.getOrCreate();


public Dataset buildDataset(){
...

// STEP A
// Join prdDS with cmpDS
Dataset prdDS_Join_cmpDS
= res1
  .join(res2, 
(res1.col("PRD_asin#100")).equalTo(res2.col("CMP_asin")), "inner");

prdDS_Join_cmpDS.take(1);

// STEP B
// Join prdDS with cmpDS
Dataset prdDS_Join_cmpDS_Join
= prdDS_Join_cmpDS
  .join(res3, 
prdDS_Join_cmpDS.col("PRD_asin#100").equalTo(res3.col("ORD_asin")), "inner");
prdDS_Join_cmpDS_Join.take(1);
prdDS_Join_cmpDS_Join.show();

}


The exception is thrown when the computation reach the STEP B, until STEP A is 
fine.

Is there anything wrong or missing?

Thanks for your help in advance.

Best Regards,
Carlo





=== STACK TRACE

Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 422.102 sec <<< 
FAILURE!
testBuildDataset(org.mksmart.amaretto.ml.DatasetPerHourVerOneTest)  Time 
elapsed: 421.994 sec  <<< ERROR!
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:102)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.consume(SortMergeJoinExec.scala:35)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.doProduce(SortMergeJoinExec.scala:565)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.produce(SortMergeJoinExec.scala:35)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.produce(BroadcastHashJoinExec.scala:38)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:304)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:343)

Re: Fail a batch in Spark Streaming forcefully based on business rules

2016-07-28 Thread Hemalatha A
Another usecase why I need to do this is, If Exception A is caught I should
just print it and ignore, but ifException B occurs, I have to end the
batch, fail it and stop processing the batch.
Is it possible to achieve this?? Any hints on this please.


On Wed, Jul 27, 2016 at 10:42 AM, Hemalatha A <
hemalatha.amru...@googlemail.com> wrote:

> Hello,
>
> I have a uescase where in, I have  to fail certain batches in my
> streaming batches, based on my application specific business rules.
> Ex: If in a batch of 2 seconds, I don't receive 100 message, I should fail
> the batch and move on.
>
> How to achieve this behavior?
>
> --
>
>
> Regards
> Hemalatha
>



-- 


Regards
Hemalatha


Re: create external table from partitioned avro file

2016-07-28 Thread Gourav Sengupta
Why avro?

Regards,
Gourav Sengupta

On Thu, Jul 28, 2016 at 8:15 AM, Yang Cao  wrote:

> Hi,
>
> I am using spark 1.6 and I hope to create a hive external table based on
> one partitioned avro file. Currently, I don’t find any build-in api to do
> this work. I tried the write.format().saveAsTable, with format
> com.databricks.spark.avro. it returned error can’t file Hive serde for
> this. Also, same problem with function createExternalTable(). Spark seems
> can recognize avro format. Need help for this task. Welcome any suggestion.
>
> THX
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: SPARK Exception thrown in awaitResult

2016-07-28 Thread Carlo . Allocca
Hi Rui,

Thanks for the promptly reply.
No, I am not using Mesos.

Ok. I am writing a code to build a suitable dataset for my needs as in the 
following:

== Session configuration:

 SparkSession spark = SparkSession
.builder()
.master("local[6]") //
.appName("DatasetForCaseNew")
.config("spark.executor.memory", "4g")
.config("spark.shuffle.blockTransferService", "nio")
.getOrCreate();


public Dataset buildDataset(){
...

// STEP A
// Join prdDS with cmpDS
Dataset prdDS_Join_cmpDS
= res1
  .join(res2, 
(res1.col("PRD_asin#100")).equalTo(res2.col("CMP_asin")), "inner");

prdDS_Join_cmpDS.take(1);

// STEP B
// Join prdDS with cmpDS
Dataset prdDS_Join_cmpDS_Join
= prdDS_Join_cmpDS
  .join(res3, 
prdDS_Join_cmpDS.col("PRD_asin#100").equalTo(res3.col("ORD_asin")), "inner");
prdDS_Join_cmpDS_Join.take(1);
prdDS_Join_cmpDS_Join.show();

}


The exception is thrown when the computation reach the STEP B, until STEP A is 
fine.

Is there anything wrong or missing?

Thanks for your help in advance.

Best Regards,
Carlo





=== STACK TRACE

Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 422.102 sec <<< 
FAILURE!
testBuildDataset(org.mksmart.amaretto.ml.DatasetPerHourVerOneTest)  Time 
elapsed: 421.994 sec  <<< ERROR!
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:102)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.consume(SortMergeJoinExec.scala:35)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.doProduce(SortMergeJoinExec.scala:565)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.produce(SortMergeJoinExec.scala:35)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.produce(BroadcastHashJoinExec.scala:38)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:304)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:343)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 

Re: SPARK Exception thrown in awaitResult

2016-07-28 Thread Sun Rui
Are you using Mesos? if not , https://issues.apache.org/jira/browse/SPARK-16522 
  is not relevant

You may describe more information about your Spark environment, and the full 
stack trace.
> On Jul 28, 2016, at 17:44, Carlo.Allocca  wrote:
> 
> Hi All, 
> 
> I am running SPARK locally, and when running d3=join(d1,d2) and d5=(d3, d4) 
> am getting  the following exception "org.apache.spark.SparkException: 
> Exception thrown in awaitResult”. 
> Googling for it, I found that the closed is the answer reported 
> https://issues.apache.org/jira/browse/SPARK-16522 
>  which mention that it is 
> bug of the SPARK 2.0.0. 
> 
> Is it correct or am I missing anything? 
> 
> Many Thanks for your answer and help. 
> 
> Best Regards,
> Carlo
> 
> -- The Open University is incorporated by Royal Charter (RC 000391), an 
> exempt charity in England & Wales and a charity registered in Scotland (SC 
> 038302). The Open University is authorised and regulated by the Financial 
> Conduct Authority.



Re: Spark Standalone Cluster: Having a master and worker on the same node

2016-07-28 Thread Chanh Le
Hi Jestin,
I saw most of setup usually setup along master and slave in a same node.
Because I think master doesn't do as much job as slave does and resource is
expensive we need to use it.
BTW In my setup I setup along master and slave.
I have  5 nodes and 3 of which are master and slave running alongside.
Hope it can help!.


Regards,
Chanh





On Thu, Jul 28, 2016 at 12:19 AM, Jestin Ma 
wrote:

> Hi, I'm doing performance testing and currently have 1 master node and 4
> worker nodes and am submitting in client mode from a 6th cluster node.
>
> I know we can have a master and worker on the same node. Speaking in terms
> of performance and practicality, is it possible/suggested to have another
> working running on either the 6th node or the master node?
>
> Thank you!
>
>


Re: saveAsTextFile at treeEnsembleModels.scala:447, took 2.513396 s Killed

2016-07-28 Thread Ascot Moss
Hi,

Thanks for your reply.

permissions (access) is not an issue in my case, it is because this issue
only happened when the bigger input file was used to generate the model,
i.e. with smaller input(s) all worked well.   It seems to me that ".save"
cannot save big file.

Q1: Any idea about the size  limit that ".save" can handle?
Q2: Any idea about how to check the size model that will be saved vis
".save" ?

Regards



On Thu, Jul 28, 2016 at 4:19 PM, Spico Florin  wrote:

> Hi!
>   There are many reasons that your task is failed. One could be that you
> don't have proper permissions (access) to  hdfs with your user. Please
> check your user rights to write in hdfs. Please have a look also :
>
> http://stackoverflow.com/questions/27427042/spark-unable-to-save-in-hadoop-permission-denied-for-user
> I hope it jelps.
>  Florin
>
>
> On Thu, Jul 28, 2016 at 3:49 AM, Ascot Moss  wrote:
>
>>
>> Hi,
>>
>> Please help!
>>
>> When saving the model, I got following error and cannot save the model to
>> hdfs:
>>
>> (my source code, my spark is v1.6.2)
>> my_model.save(sc, "/my_model")
>>
>> -
>> 16/07/28 08:36:19 INFO TaskSchedulerImpl: Removed TaskSet 69.0, whose
>> tasks have all completed, from pool
>>
>> 16/07/28 08:36:19 INFO DAGScheduler: ResultStage 69 (saveAsTextFile at
>> treeEnsembleModels.scala:447) finished in 0.901 s
>>
>> 16/07/28 08:36:19 INFO DAGScheduler: Job 38 finished: saveAsTextFile at
>> treeEnsembleModels.scala:447, took 2.513396 s
>>
>> Killed
>> -
>>
>>
>> Q1: Is there any limitation on saveAsTextFile?
>> Q2: or where to find the error log file location?
>>
>> Regards
>>
>>
>>
>>
>>
>


SPARK Exception thrown in awaitResult

2016-07-28 Thread Carlo . Allocca
Hi All,

I am running SPARK locally, and when running d3=join(d1,d2) and d5=(d3, d4) am 
getting  the following exception "org.apache.spark.SparkException: Exception 
thrown in awaitResult”.
Googling for it, I found that the closed is the answer reported 
https://issues.apache.org/jira/browse/SPARK-16522 which mention that it is bug 
of the SPARK 2.0.0.

Is it correct or am I missing anything?

Many Thanks for your answer and help.

Best Regards,
Carlo

-- The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England & Wales and a charity registered in Scotland (SC 038302). 
The Open University is authorised and regulated by the Financial Conduct 
Authority.


Re:Re: Spark 2.0 on YARN - Dynamic Resource Allocation Behavior change?

2016-07-28 Thread LONG WANG
Thanks for your reply, I have tried your suggestion. it works now.

At 2016-07-28 16:18:01, "Sun Rui"  wrote:
Yes, this is a change in Spark 2.0.  you can take a look at 
https://issues.apache.org/jira/browse/SPARK-13723


In the latest Spark On Yarn documentation for Spark 2.0, there is updated 
description for --num-executors:
| spark.executor.instances | 2 | The number of executors for static allocation. 
Withspark.dynamicAllocation.enabled, the initial set of executors will be at 
least this large. |
You can disable the dynamic allocation for an application by specifying “--conf 
spark.dynamicAllocation.enabled=false” in the command line.


On Jul 28, 2016, at 15:44, LONG WANG  wrote:


Hi Spark Experts,


  Today I tried Spark 2.0 on YARN and also enabled 
Dynamic Resource Allocation feature, I just find that no matter I specify 
--num-executor in spark-submit command or not, the Dynamic Resource Allocation 
is used, but I remember when I specify --num-executor option in spark-submit 
command in Spark 1.6, the Dynamic Resource Allocation feature will not be 
used/effect for that job. And I can see below log in Spark 1.6 .


<截图1.png>
   
  Is this a behavior change in Spark 2.0? And How can I 
disable Dynamic Resource Allocation for a specific job submission temporarily 
as before? 



 

 邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。
共有 1 个附件
截图1.png(23K)极速下载 在线预览



Re: Spark Thrift Server 2.0 set spark.sql.shuffle.partitions not working when query

2016-07-28 Thread Chanh Le
Thank you Takeshi it works fine now.

Regards,
Chanh


> On Jul 28, 2016, at 2:03 PM, Takeshi Yamamuro  wrote:
> 
> Hi,
> 
> you need to set the value when you just start the server.
> 
> // maropu
> 
> On Thu, Jul 28, 2016 at 3:59 PM, Chanh Le  > wrote:
> Hi everyone,
> 
> I set spark.sql.shuffle.partitions=10 after started Spark Thrift Server but I 
> seems not working.
> 
> ./sbin/start-thriftserver.sh --master 
> mesos://zk://master1:2181,master2:2181,master3:2181/mesos <> --conf 
> spark.driver.memory=5G --conf spark.scheduler.mode=FAIR --class 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --jars 
> /home/spark/spark-2.0.0-bin-hadoop2.6/jars/alluxio-core-client-spark-1.2.0-jar-with-dependencies.jar
>  --total-executor-cores 35 spark-internal --hiveconf 
> hive.server2.thrift.port=1 --hiveconf 
> hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf 
> hive.metastore.metadb.dir=/user/hive/metadb
> 
> Did anyone has the same with me?
> 
> 
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro



Materializing mapWithState .stateSnapshot() after ssc.stop

2016-07-28 Thread Ben Teeuwen
Hi all,

I’ve posted a question regarding sessionizing events using scala and 
mapWithState at 
http://stackoverflow.com/questions/38541958/materialize-mapwithstate-statesnapshots-to-database-for-later-resume-of-spark-st
 
.
 

The context is to be able to pause and resume a streaming app while not losing 
the state information. So I want to save and reload (initialize) the state 
snapshot. Has anyone of you already been able to do this?

Thanks,

Ben

Re: Spark 2.0 on YARN - Dynamic Resource Allocation Behavior change?

2016-07-28 Thread Sun Rui
Yes, this is a change in Spark 2.0.  you can take a look at 
https://issues.apache.org/jira/browse/SPARK-13723 


In the latest Spark On Yarn documentation 
 for Spark 2.0, there 
is updated description for --num-executors:
> spark.executor.instances  2   The number of executors for static 
> allocation. Withspark.dynamicAllocation.enabled, the initial set of executors 
> will be at least this large.
You can disable the dynamic allocation for an application by specifying “--conf 
spark.dynamicAllocation.enabled=false” in the command line.

> On Jul 28, 2016, at 15:44, LONG WANG  wrote:
> 
> Hi Spark Experts,
> 
>   Today I tried Spark 2.0 on YARN and also enabled 
> Dynamic Resource Allocation feature, I just find that no matter I specify 
> --num-executor in spark-submit command or not, the Dynamic Resource 
> Allocation is used, but I remember when I specify --num-executor option in 
> spark-submit command in Spark 1.6, the Dynamic Resource Allocation feature 
> will not be used/effect for that job. And I can see below log in Spark 1.6 .
> 
> <截图1.png>
>
>   Is this a behavior change in Spark 2.0? And How can 
> I disable Dynamic Resource Allocation for a specific job submission 
> temporarily as before? 
> 
> 
>  
>  邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。
> 共有 1 个附件
> 截图1.png(23K)
> 极速下载 
> 
>  在线预览 
> 


Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Jörn Franke
I see it more as a process of innovation and thus competition is good. 
Companies just should not follow these religious arguments but try themselves 
what suits them. There is more than software when using software ;)

> On 28 Jul 2016, at 01:44, Mich Talebzadeh  wrote:
> 
> And frankly this is becoming some sort of religious arguments now
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 28 July 2016 at 00:01, Sudhir Babu Pothineni  
>> wrote:
>> It depends on what you are dong, here is the recent comparison of ORC, 
>> Parquet
>> 
>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>> 
>> Although from ORC authors, I thought fair comparison, We use ORC as System 
>> of Record on our Cloudera HDFS cluster, our experience is so far good.
>> 
>> Perquet is backed by Cloudera, which has more installations of Hadoop. ORC 
>> is by Hortonworks, so battle of file format continues...
>> 
>> Sent from my iPhone
>> 
>>> On Jul 27, 2016, at 4:54 PM, janardhan shetty  
>>> wrote:
>>> 
>>> Seems like parquet format is better comparatively to orc when the dataset 
>>> is log data without nested structures? Is this fair understanding ?
>>> 
 On Jul 27, 2016 1:30 PM, "Jörn Franke"  wrote:
 Kudu has been from my impression be designed to offer somethings between 
 hbase and parquet for write intensive loads - it is not faster for 
 warehouse type of querying compared to parquet (merely slower, because 
 that is not its use case).   I assume this is still the strategy of it.
 
 For some scenarios it could make sense together with parquet and Orc. 
 However I am not sure what the advantage towards using hbase + parquet and 
 Orc.
 
> On 27 Jul 2016, at 11:47, "u...@moosheimer.com"  
> wrote:
> 
> Hi Gourav,
> 
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in 
> memory db with data storage while Parquet is "only" a columnar storage 
> format.
> 
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... 
> that's more a wish :-).
> 
> Regards,
> Uwe
> 
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
> 
>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta 
>> :
>> 
>> Gosh,
>> 
>> whether ORC came from this or that, it runs queries in HIVE with TEZ at 
>> a speed that is better than SPARK.
>> 
>> Has anyone heard of KUDA? Its better than Parquet. But I think that 
>> someone might just start saying that KUDA has difficult lineage as well. 
>> After all dynastic rules dictate.
>> 
>> Personally I feel that if something stores my data compressed and makes 
>> me access it faster I do not care where it comes from or how difficult 
>> the child birth was :)
>> 
>> 
>> Regards,
>> Gourav
>> 
>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni 
>>>  wrote:
>>> Just correction:
>>> 
>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization 
>>> default. 
>>> 
>>> Do not know If Spark leveraging this new repo?
>>> 
>>> 
>>>  org.apache.orc
>>> orc
>>> 1.1.2
>>> pom
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Sent from my iPhone
 On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
 
>>> 
 parquet was inspired by dremel but written from the ground up as a 
 library with support for a variety of big data systems (hive, pig, 
 impala, cascading, etc.). it is also easy to add new support, since 
 its a proper library.
 
 orc bas been enhanced while deployed at facebook in hive and at yahoo 
 in hive. just hive. it didn't really exist by itself. it was part of 
 the big java soup that is called hive, without an easy way to extract 
 it. hive does not expose proper java apis. it never cared for that.
 
> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
>  wrote:
> Interesting opinion, thank you
> 
> Still, on the website parquet is basically inspired by Dremel 
> (Google) [1] and part of orc 

Re: Any reference of performance tuning on SparkSQL?

2016-07-28 Thread Sonal Goyal
I found some references at

http://spark.apache.org/docs/latest/sql-programming-guide.html#performance-tuning
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-tuning-in-Spark-SQL-td21871.html

HTH

Best Regards,
Sonal
Founder, Nube Technologies 
Reifier at Strata Hadoop World 
Reifier at Spark Summit 2015






On Thu, Jul 28, 2016 at 12:40 PM, Linyuxin  wrote:

> Hi ALL
>
>  Is there any  reference of performance tuning on SparkSQL?
>
> I can only find about turning on spark core on http://spark.apache.org/
>


Spark 2.0 on YARN - Dynamic Resource Allocation Behavior change?

2016-07-28 Thread LONG WANG
Hi Spark Experts,


  Today I tried Spark 2.0 on YARN and also enabled 
Dynamic Resource Allocation feature, I just find that no matter I specify 
--num-executor in spark-submit command or not, the Dynamic Resource Allocation 
is used, but I remember when I specify --num-executor option in spark-submit 
command in Spark 1.6, the Dynamic Resource Allocation feature will not be 
used/effect for that job. And I can see below log in Spark 1.6 .



   
  Is this a behavior change in Spark 2.0? And How can I 
disable Dynamic Resource Allocation for a specific job submission temporarily 
as before? 

create external table from partitioned avro file

2016-07-28 Thread Yang Cao
Hi,

I am using spark 1.6 and I hope to create a hive external table based on one 
partitioned avro file. Currently, I don’t find any build-in api to do this 
work. I tried the write.format().saveAsTable, with format 
com.databricks.spark.avro. it returned error can’t file Hive serde for this. 
Also, same problem with function createExternalTable(). Spark seems can 
recognize avro format. Need help for this task. Welcome any suggestion.

THX
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Any reference of performance tuning on SparkSQL?

2016-07-28 Thread Linyuxin
Hi ALL
 Is there any  reference of performance tuning on SparkSQL?
I can only find about turning on spark core on http://spark.apache.org/


Re: Possible to push sub-queries down into the DataSource impl?

2016-07-28 Thread Takeshi Yamamuro
Hi,

Have you seen this ticket?
https://issues.apache.org/jira/browse/SPARK-12449

// maropu

On Thu, Jul 28, 2016 at 2:13 AM, Timothy Potter 
wrote:

> I'm not looking for a one-off solution for a specific query that can
> be solved on the client side as you suggest, but rather a generic
> solution that can be implemented within the DataSource impl itself
> when it knows a sub-query can be pushed down into the engine. In other
> words, I'd like to intercept the query planning process to be able to
> push-down computation into the engine when it makes sense.
>
> On Wed, Jul 27, 2016 at 8:04 AM, Marco Colombo
>  wrote:
> > Why don't you create a dataframe filtered, map it as temporary table and
> > then use it in your query? You can also cache it, of multiple queries on
> the
> > same inner queries are requested.
> >
> >
> > Il mercoledì 27 luglio 2016, Timothy Potter  ha
> > scritto:
> >>
> >> Take this simple join:
> >>
> >> SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER
> >> JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating
> >> >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON
> >> solr.movie_id = m.movie_id ORDER BY aggCount DESC
> >>
> >> I would like the ability to push the inner sub-query aliased as "solr"
> >> down into the data source engine, in this case Solr as it will
> >> greatlly reduce the amount of data that has to be transferred from
> >> Solr into Spark. I would imagine this issue comes up frequently if the
> >> underlying engine is a JDBC data source as well ...
> >>
> >> Is this possible? Of course, my example is a bit cherry-picked so
> >> determining if a sub-query can be pushed down into the data source
> >> engine is probably not a trivial task, but I'm wondering if Spark has
> >> the hooks to allow me to try ;-)
> >>
> >> Cheers,
> >> Tim
> >>
> >> -
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>
> >
> >
> > --
> > Ing. Marco Colombo
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
---
Takeshi Yamamuro